rannet.tokenizer

Module Contents

Classes

SpecialTokens

RanNetWordPieceTokenizer

RanNet WordPiece Tokenizer

class rannet.tokenizer.SpecialTokens(unused_num: int = 1000)
property tokens: List[str]
__contains__(token: str) bool

Check if the input token exists in special tokens. :param - token: str

Returns:

bool

class rannet.tokenizer.RanNetWordPieceTokenizer(vocab: str | Dict[str, int] | None = None, special_tokens: SpecialTokens | None = None, clean_text: bool = True, handle_chinese_chars: bool = True, strip_accents: bool | None = None, lowercase: bool = True, wordpieces_prefix: str = '##')

Bases: tokenizers.implementations.base_tokenizer.BaseTokenizer

RanNet WordPiece Tokenizer

static from_file(vocab: str, **kwargs)
train(files: str | List[str], vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: List[str] = [], special_tokens: SpecialTokens | None = None, show_progress: bool = True, wordpieces_prefix: str = '##')

Train the model using the given files

train_from_iterator(iterator: Iterator[str] | Iterator[Iterator[str]], vocab_size: int = 30000, min_frequency: int = 2, limit_alphabet: int = 1000, initial_alphabet: List[str] = [], special_tokens: SpecialTokens | None = None, show_progress: bool = True, wordpieces_prefix: str = '##', length: int | None = None)

Train the model using the given iterator

rematch_to_text(offsets: List[Tuple[int, int]]) List[List[int]]
>>> text = 'hello [PAD] world'
>>> t = tokenizer.encode(text)
>>> mapping = tokenizer.rematch_to_text(t.offsets)
>>> for ch_pos in mapping:
        print(text[ch_pos[0]: ch_pos[-1]+1])
hello
[PAD]
world