:py:mod:`rannet.dataloader` =========================== .. py:module:: rannet.dataloader .. autoapi-nested-parse:: DataLoader for pretraininng Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: rannet.dataloader.DataLoader rannet.dataloader.BertMlmDataLoader rannet.dataloader.Seq2SeqLMDataLoader Functions ~~~~~~~~~ .. autoapisummary:: rannet.dataloader.subfinder .. py:function:: subfinder(array: List, sub_array: List) -> List[int] find sub-array positions example: >>> array = [0, 0, 1, 2, 3, 5, 1, 2, 3, 1, 2] >>> sub_array = [1, 2, 3] >>> subfinder(array, sub_array) [2, 6] .. py:class:: DataLoader(tokenizer: rannet.tokenizer.RanNetWordPieceTokenizer, max_length: int = 512) dataloader for pretraning .. py:method:: process_sentence(text) -> Tuple[List[int], List[int]] :abstractmethod: .. py:method:: tfrecord_serialize(instances, instance_keys=['token_ids', 'mask_ids']) convert to tfrecord .. py:method:: load_tfrecord(record_paths, batch_size, sequence_length=512, buffer_size=None) :staticmethod: load dataset from tfrecord .. py:method:: get_random_token(token_id: int) -> int .. py:method:: truncate_pad_sequence(sequence: List[int], padding_value=0) -> List[int] .. py:method:: process_paragraph(texts: Union[List[str], List[Dict[str, str]]]) :param texts: Union[List[str], List[Dict[str, str]]] for NOLAN-Style: [{"word": "xxx", "sentence": "xxx"}, ], for BERT-Style: ["sentence 1", "xxx", ] .. py:method:: process(corpus: List[List], record_path: str, workers=4) process corpus .. py:class:: BertMlmDataLoader(tokenizer: rannet.tokenizer.RanNetWordPieceTokenizer, word_segment: Callable, mask_rate: float = 0.15, max_length: int = 512) Bases: :py:obj:`DataLoader` DataLoader with BERT MLM setting .. py:method:: process_sentence(obj: str) -> Tuple[List[int], List[int]] .. py:class:: Seq2SeqLMDataLoader(tokenizer: rannet.tokenizer.RanNetWordPieceTokenizer, mask_rate: float = 0.15, max_source_length: Optional[int] = None, max_target_length: Optional[int] = None) DataLoader for seq2seq .. py:method:: process(source_text: str, target_text: str) -> Tuple[List[int], List[int], List[int]]