rannet.dataloader

DataLoader for pretraininng

Module Contents

Classes

DataLoader

dataloader for pretraning

BertMlmDataLoader

DataLoader with BERT MLM setting

Seq2SeqLMDataLoader

DataLoader for seq2seq

Functions

subfinder(→ List[int])

find sub-array positions

rannet.dataloader.subfinder(array: List, sub_array: List) List[int]

find sub-array positions example: >>> array = [0, 0, 1, 2, 3, 5, 1, 2, 3, 1, 2] >>> sub_array = [1, 2, 3] >>> subfinder(array, sub_array) [2, 6]

class rannet.dataloader.DataLoader(tokenizer: rannet.tokenizer.RanNetWordPieceTokenizer, max_length: int = 512)

dataloader for pretraning

abstract process_sentence(text) Tuple[List[int], List[int]]
tfrecord_serialize(instances, instance_keys=['token_ids', 'mask_ids'])

convert to tfrecord

static load_tfrecord(record_paths, batch_size, sequence_length=512, buffer_size=None)

load dataset from tfrecord

get_random_token(token_id: int) int
truncate_pad_sequence(sequence: List[int], padding_value=0) List[int]
process_paragraph(texts: List[str] | List[Dict[str, str]])
Parameters:

texts – Union[List[str], List[Dict[str, str]]] for NOLAN-Style: [{“word”: “xxx”, “sentence”: “xxx”}, ], for BERT-Style: [“sentence 1”, “xxx”, ]

process(corpus: List[List], record_path: str, workers=4)

process corpus

class rannet.dataloader.BertMlmDataLoader(tokenizer: rannet.tokenizer.RanNetWordPieceTokenizer, word_segment: Callable, mask_rate: float = 0.15, max_length: int = 512)

Bases: DataLoader

DataLoader with BERT MLM setting

process_sentence(obj: str) Tuple[List[int], List[int]]
class rannet.dataloader.Seq2SeqLMDataLoader(tokenizer: rannet.tokenizer.RanNetWordPieceTokenizer, mask_rate: float = 0.15, max_source_length: int | None = None, max_target_length: int | None = None)

DataLoader for seq2seq

process(source_text: str, target_text: str) Tuple[List[int], List[int], List[int]]