Hacked By AnonymousFox
�
�܋f, � �p � d dl Z d dlZd dlmZmZ ddlmZmZ ej d� � Z G d� d� � Z
dS )� N)�Optional�Union� )�LanguageFilter�ProbingStates% [a-zA-Z]*[�-�]+[a-zA-Z]*[^a-zA-Z�-�]?c �` � e Zd ZdZej fdeddfd�Zdd�Zede e
fd�� � Zede e
fd�� � Zd e
eef defd
�Zedefd�� � Zdefd�Zed
e
eef defd�� � Zed
e
eef defd�� � Zed
e
eef defd�� � ZdS )�
CharSetProbergffffff�?�lang_filter�returnNc � � t j | _ d| _ || _ t j t � � | _ d S )NT) r � DETECTING�_state�activer
�logging� getLogger�__name__�logger)�selfr
s �L/opt/cloudlinux/venv/lib64/python3.11/site-packages/chardet/charsetprober.py�__init__zCharSetProber.__init__, s1 � �"�,������&����'��1�1����� c �( � t j | _ d S �N)r r
r �r s r �resetzCharSetProber.reset2 s � �"�,����r c � � d S r � r s r �charset_namezCharSetProber.charset_name5 s � ��tr c � � t �r ��NotImplementedErrorr s r �languagezCharSetProber.language9 s � �!�!r �byte_strc � � t �r r )r r# s r �feedzCharSetProber.feed= s � �!�!r c � � | j S r )r r s r �statezCharSetProber.state@ s
� ��{�r c � � dS )Ng r r s r �get_confidencezCharSetProber.get_confidenceD s � ��sr �bufc �2 � t j dd| � � } | S )Ns ([ -])+� )�re�sub)r* s r �filter_high_byte_onlyz#CharSetProber.filter_high_byte_onlyG s � ��f�&��c�2�2���
r c � � t � � }t � | � � }|D ]Z}|� |dd� � � |dd� }|� � � s|dk rd}|� |� � �[|S )u7
We define three types of bytes:
alphabet: english alphabets [a-zA-Z]
international: international characters [-ÿ]
marker: everything else [^a-zA-Z-ÿ]
The input buffer can be thought to contain a series of words delimited
by markers. This function works to filter all words that contain at
least one international character. All contiguous sequences of markers
are replaced by a single space ascii character.
This filter applies to all scripts which do not use English characters.
N���� �r, )� bytearray�INTERNATIONAL_WORDS_PATTERN�findall�extend�isalpha)r* �filtered�words�word� last_chars r �filter_international_wordsz(CharSetProber.filter_international_wordsL s� � � �;�;��
,�3�3�C�8�8���
'�
'�D��O�O�D��"��I�&�&�&� �R�S�S� �I��$�$�&�&�
!�9�w�+>�+>� � ��O�O�I�&�&�&�&��r c �v � t � � }d}d}t | � � � d� � } t | � � D ]U\ }}|dk r|dz }d}�|dk r<||k r4|s2|� | ||� � � |� d� � d}�V|s|� | |d � � � |S )
a[
Returns a copy of ``buf`` that retains only the sequences of English
alphabet and high byte characters that are not between <> characters.
This filter can be applied to all scripts which contain both English
characters and extended ASCII characters, but is currently only used by
``Latin1Prober``.
Fr �c� >r � <r, TN)r3 �
memoryview�cast� enumerater6 )r* r8 �in_tag�prev�curr�buf_chars r �remove_xml_tagszCharSetProber.remove_xml_tagsn s� � � �;�;��������o�o�"�"�3�'�'��'��n�n� � �N�D�(� �4����a�x������T�!�!��$�;�;�v�;� �O�O�C��T� �N�3�3�3��O�O�D�)�)�)���� � (�
�O�O�C����J�'�'�'��r )r N)r �
__module__�__qualname__�SHORTCUT_THRESHOLDr �NONEr r �propertyr �strr r"