rannet.tokenizer¶
Module Contents¶
Classes¶
RanNet WordPiece Tokenizer |
- class rannet.tokenizer.SpecialTokens(unused_num: int = 1000)¶
- property tokens: List[str]¶
- __contains__(token: str) bool¶
Check if the input token exists in special tokens. :param - token: str
- Returns:
bool
- class rannet.tokenizer.RanNetWordPieceTokenizer(vocab: str | Dict[str, int] | None = None, special_tokens: SpecialTokens | None = None, clean_text: bool = True, handle_chinese_chars: bool = True, strip_accents: bool | None = None, lowercase: bool = True, wordpieces_prefix: str = '##')¶
Bases:
tokenizers.implementations.base_tokenizer.BaseTokenizerRanNet WordPiece Tokenizer
- static from_file(vocab: str, **kwargs)¶
- train(files: str | List[str], vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: List[str] = [], special_tokens: SpecialTokens | None = None, show_progress: bool = True, wordpieces_prefix: str = '##')¶
Train the model using the given files
- train_from_iterator(iterator: Iterator[str] | Iterator[Iterator[str]], vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: List[str] = [], special_tokens: SpecialTokens | None = None, show_progress: bool = True, wordpieces_prefix: str = '##', length: int | None = None)¶
Train the model using the given iterator
- rematch_to_text(offsets: List[Tuple[int, int]]) List[List[int]]¶
>>> text = 'hello [PAD] world' >>> t = tokenizer.encode(text) >>> mapping = tokenizer.rematch_to_text(t.offsets) >>> for ch_pos in mapping: print(text[ch_pos[0]: ch_pos[-1]+1]) hello [PAD] world