:py:mod:`rannet.tokenizer`
==========================

.. py:module:: rannet.tokenizer


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   rannet.tokenizer.SpecialTokens
   rannet.tokenizer.RanNetWordPieceTokenizer


.. py:class:: SpecialTokens(unused_num: int = 1000)


   .. py:property:: tokens
      :type: List[str]


   .. py:method:: __contains__(token: str) -> bool

      Check if the input token exists in special tokens.
      :param - token: str

      :returns: bool


.. py:class:: RanNetWordPieceTokenizer(vocab: Optional[Union[str, Dict[str, int]]] = None, special_tokens: Optional[SpecialTokens] = None, clean_text: bool = True, handle_chinese_chars: bool = True, strip_accents: Optional[bool] = None, lowercase: bool = True, wordpieces_prefix: str = '##')


   Bases: :py:obj:`tokenizers.implementations.base_tokenizer.BaseTokenizer`

   RanNet WordPiece Tokenizer

   .. py:method:: from_file(vocab: str, **kwargs)
      :staticmethod:


   .. py:method:: train(files: Union[str, List[str]], vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: List[str] = [], special_tokens: Optional[SpecialTokens] = None, show_progress: bool = True, wordpieces_prefix: str = '##')

      Train the model using the given files


   .. py:method:: train_from_iterator(iterator: Union[Iterator[str], Iterator[Iterator[str]]], vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: List[str] = [], special_tokens: Optional[SpecialTokens] = None, show_progress: bool = True, wordpieces_prefix: str = '##', length: Optional[int] = None)

      Train the model using the given iterator


   .. py:method:: rematch_to_text(offsets: List[Tuple[int, int]]) -> List[List[int]]

      >>> text = 'hello [PAD] world'
      >>> t = tokenizer.encode(text)
      >>> mapping = tokenizer.rematch_to_text(t.offsets)
      >>> for ch_pos in mapping:
              print(text[ch_pos[0]: ch_pos[-1]+1])
      hello
      [PAD]
      world