rannet.dataloader¶
DataLoader for pretraininng
Module Contents¶
Classes¶
dataloader for pretraning |
|
DataLoader with BERT MLM setting |
|
DataLoader for seq2seq |
Functions¶
|
find sub-array positions |
- rannet.dataloader.subfinder(array: List, sub_array: List) List[int]¶
find sub-array positions example: >>> array = [0, 0, 1, 2, 3, 5, 1, 2, 3, 1, 2] >>> sub_array = [1, 2, 3] >>> subfinder(array, sub_array) [2, 6]
- class rannet.dataloader.DataLoader(tokenizer: rannet.tokenizer.RanNetWordPieceTokenizer, max_length: int = 512)¶
dataloader for pretraning
- abstract process_sentence(text) Tuple[List[int], List[int]]¶
- tfrecord_serialize(instances, instance_keys=['token_ids', 'mask_ids'])¶
convert to tfrecord
- static load_tfrecord(record_paths, batch_size, sequence_length=512, buffer_size=None)¶
load dataset from tfrecord
- get_random_token(token_id: int) int¶
- truncate_pad_sequence(sequence: List[int], padding_value=0) List[int]¶
- process_paragraph(texts: List[str] | List[Dict[str, str]])¶
- Parameters:
texts – Union[List[str], List[Dict[str, str]]] for NOLAN-Style: [{“word”: “xxx”, “sentence”: “xxx”}, ], for BERT-Style: [“sentence 1”, “xxx”, ]
- process(corpus: List[List], record_path: str, workers=4)¶
process corpus
- class rannet.dataloader.BertMlmDataLoader(tokenizer: rannet.tokenizer.RanNetWordPieceTokenizer, word_segment: Callable, mask_rate: float = 0.15, max_length: int = 512)¶
Bases:
DataLoaderDataLoader with BERT MLM setting
- process_sentence(obj: str) Tuple[List[int], List[int]]¶
- class rannet.dataloader.Seq2SeqLMDataLoader(tokenizer: rannet.tokenizer.RanNetWordPieceTokenizer, mask_rate: float = 0.15, max_source_length: int | None = None, max_target_length: int | None = None)¶
DataLoader for seq2seq
- process(source_text: str, target_text: str) Tuple[List[int], List[int], List[int]]¶