:py:mod:`rannet.tokenizer` ========================== .. py:module:: rannet.tokenizer Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: rannet.tokenizer.SpecialTokens rannet.tokenizer.RanNetWordPieceTokenizer .. py:class:: SpecialTokens(unused_num: int = 1000) .. py:property:: tokens :type: List[str] .. py:method:: __contains__(token: str) -> bool Check if the input token exists in special tokens. :param - token: str :returns: bool .. py:class:: RanNetWordPieceTokenizer(vocab: Optional[Union[str, Dict[str, int]]] = None, special_tokens: Optional[SpecialTokens] = None, clean_text: bool = True, handle_chinese_chars: bool = True, strip_accents: Optional[bool] = None, lowercase: bool = True, wordpieces_prefix: str = '##') Bases: :py:obj:`tokenizers.implementations.base_tokenizer.BaseTokenizer` RanNet WordPiece Tokenizer .. py:method:: from_file(vocab: str, **kwargs) :staticmethod: .. py:method:: train(files: Union[str, List[str]], vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: List[str] = [], special_tokens: Optional[SpecialTokens] = None, show_progress: bool = True, wordpieces_prefix: str = '##') Train the model using the given files .. py:method:: train_from_iterator(iterator: Union[Iterator[str], Iterator[Iterator[str]]], vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: List[str] = [], special_tokens: Optional[SpecialTokens] = None, show_progress: bool = True, wordpieces_prefix: str = '##', length: Optional[int] = None) Train the model using the given iterator .. py:method:: rematch_to_text(offsets: List[Tuple[int, int]]) -> List[List[int]] >>> text = 'hello [PAD] world' >>> t = tokenizer.encode(text) >>> mapping = tokenizer.rematch_to_text(t.offsets) >>> for ch_pos in mapping: print(text[ch_pos[0]: ch_pos[-1]+1]) hello [PAD] world