Skip to content

Commit

Permalink
Upgrade dataset (PaddlePaddle#214)
Browse files Browse the repository at this point in the history
* update squad metric

* fix bug and add some doc

* update quick_start

* minor update

* minor fix

* update doc

* update doc

* update iterdataset

* minor fix

* fix doc

* fix doc

* fix doc

* upgrade dataset

* fix menu
  • Loading branch information
smallv0221 authored Apr 2, 2021
1 parent d1c11f3 commit ef1719b
Show file tree
Hide file tree
Showing 4 changed files with 90 additions and 28 deletions.
6 changes: 5 additions & 1 deletion docs/data_prepare/data_preprocess.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,4 +203,8 @@ PaddleNLP内置了多种collate function,配合 :class:`paddle.io.BatchSampler
Stack(dtype="int64") # label
}): fn(samples)
可以看到,:func:`Dict` 函数是将单条数据中的键值与 :func:`Pad` 等函数进行对应,适用于单条数据是字典的情况。而 :func:`Tuple` 是通过单条数据中不同部分的index进行对应的。所以需要注意的是 :func:`convert_example` 方法和 :func:`batchify_fn` 方法的匹配。
可以看到,:func:`Dict` 函数是将单条数据中的键值与 :func:`Pad` 等函数进行对应,适用于单条数据是字典的情况。而 :func:`Tuple` 是通过单条数据中不同部分的index进行对应的。

所以需要 **注意** 的是 :func:`convert_example` 方法和 :func:`batchify_fn` 方法的匹配。

之后的流程与基于预训练模型的数据处理相同。
21 changes: 13 additions & 8 deletions docs/data_prepare/dataset_self_defined.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,38 +7,39 @@
从本地文件创建数据集
-------------------

从本地文件创建数据集时,我们 **推荐** 根据本地数据集的格式给出读取function并传入 :class:`MapDataset` 或 :class:`IterDataset` 中创建数据集。
从本地文件创建数据集时,我们 **推荐** 根据本地数据集的格式给出读取function并传入 :func:`load_dataset`中创建数据集。
以 :obj:`waybill_ie` 快递单信息抽取任务中的数据为例:

.. code-block::
from paddlenlp.datasets import MapDataset, IterDataset
from paddlenlp.datasets import load_dataset
def read(data_path):
with open(data_path, 'r', encoding='utf-8') as f:
# 跳过列名
next(f)
for line in f():
for line in f:
words, labels = line.strip('\n').split('\t')
words = words.split('\002')
labels = labels.split('\002')
yield {'tokens': words, 'labels': labels}
# MapDataset需要__getitem__()方法和__len__()方法,需要将生成器转为list
map_ds = MapDataset(list(read(data_path)))
# IterDataset需要__iter__()方法,生成器满足需求
iter_ds = IterDataset(read(data_path))
map_ds = load_dataset(read, data_path='train.txt',lazy=False)
iter_ds = load_dataset(read, data_path='train.txt',lazy=True)
我们推荐将数据读取代码写成生成器(generator)的形式,这样可以更好的构建 :class:`MapDataset` 和 :class:`IterDataset` 两种数据集。同时我们也推荐将单条数据写成字典的格式,这样可以更方便的监测数据流向。

事实上,:class:`MapDataset` 在绝大多数时候都可以满足要求。一般只有在数据集过于庞大无法一次性加载进内存的时候我们才考虑使用 :class:`IterDataset` 。任何人都可以方便的定义属于自己的数据集。

.. note::

需要注意的是,只有从 :class:`DatasetBuilder` 初始化的数据集具有将数据中的label自动转为id的功能(详细条件参见 :doc:`如何贡献数据集 <../community/contribute_dataset>`)。

像上例中的自定义数据集需要在自定义的convert to feature方法中添加label转id的功能。

自定义数据读取function中的参数可以直接以关键字参数的的方式传入 :func:`load_dataset` 中。而且对于自定义数据集,:attr:`lazy` 参数是 **必须** 传入的。

:class:`paddle.io.Dataset/IterableDataset` 创建数据集
-------------------

Expand Down Expand Up @@ -105,6 +106,10 @@
print([data for data in list_ds]) # ['a', 'b', 'c', 'd']
print([data for data in gen_ds]) # [0, 1, 2, 3, 4]
.. note::

需要注意,像上例中直接将 **生成器** 对象传入 :class:`IterDataset` 所生成的数据集。其数据只能迭代 **一次** 。

与常规的python对象一样,只要满足以上的条件,我们也可以使用同样的方法从第三方数据集创建PaddleNLP数据集。

例如HuggingFace Dataset:
Expand Down
7 changes: 7 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
数据集列表 <data_prepare/dataset_list>
加载数据集 <data_prepare/dataset_load>
自定义数据集 <data_prepare/dataset_self_defined>
数据处理 <data_prepare/data_preprocess>

.. toctree::
:maxdepth: 2
Expand All @@ -66,6 +67,12 @@
高性能预测部署 <advanced_guide/deployment>
大规模分布式训练 <advanced_guide/distributed_training>

.. toctree::
:maxdepth: 2
:caption: 社区贡献

如何贡献数据集 <community/contribute_dataset>

.. toctree::
:maxdepth: 2
:caption: API Reference
Expand Down
84 changes: 65 additions & 19 deletions paddlenlp/datasets/experimental/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
import os
import warnings
import sys
import itertools
import inspect

import paddle.distributed as dist
from paddle.io import Dataset, IterableDataset
Expand Down Expand Up @@ -54,23 +54,34 @@ def import_main_class(module_path):
return module_main_cls


def load_dataset(path,
def load_dataset(path_or_read_func,
name=None,
data_files=None,
splits=None,
lazy=None,
**kwargs):

reader_cls = import_main_class(path)
if not name:
reader_instance = reader_cls(lazy=lazy, **kwargs)
if inspect.isfunction(path_or_read_func):
assert lazy is not None, "lazy can not be None in custom mode."
kwargs['name'] = name
kwargs['data_files'] = data_files
kwargs['splits'] = splits
custom_kwargs = {}
for name in inspect.signature(path_or_read_func).parameters.keys():
if name in kwargs.keys():
custom_kwargs[name] = kwargs[name]

reader_instance = SimpleBuilder(lazy=lazy, read_func=path_or_read_func)
return reader_instance.read(**custom_kwargs)
else:
reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)

datasets = reader_instance.read_datasets(
data_files=data_files, splits=splits)
reader_cls = import_main_class(path_or_read_func)
if not name:
reader_instance = reader_cls(lazy=lazy, **kwargs)
else:
reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)

return datasets
datasets = reader_instance.read_datasets(
data_files=data_files, splits=splits)
return datasets


class MapDataset(Dataset):
Expand Down Expand Up @@ -208,14 +219,26 @@ def _filter(self, data):

def __iter__(self):
num_samples = 0
self.data, data = itertools.tee(self.data)
for example in data:
if (not self._filter_pipline or
self._filter(self._filter_pipline)) and self._shard_filter(
num_samples=num_samples):
yield self._transform(
example) if self._transform_pipline else example
num_samples += 1
if inspect.isfunction(self.data):
for example in self.data():
if (not self._filter_pipline or
self._filter(self._filter_pipline)
) and self._shard_filter(num_samples=num_samples):
yield self._transform(
example) if self._transform_pipline else example
num_samples += 1
else:
if inspect.isgenerator(self.data):
warnings.warn(
'Reciving generator as data source, data can only be iterated once'
)
for example in self.data:
if (not self._filter_pipline or
self._filter(self._filter_pipline)
) and self._shard_filter(num_samples=num_samples):
yield self._transform(
example) if self._transform_pipline else example
num_samples += 1

def filter(self, fn):
"""
Expand Down Expand Up @@ -449,3 +472,26 @@ def get_vocab(self):
Return vocab file path of the dataset if specified.
"""
return None


class SimpleBuilder(DatasetBuilder):
def __init__(self, lazy, read_func):
self._read = read_func
self.lazy = lazy

def read(self, **kwargs):
if self.lazy:

def generate_examples():
generator = self._read(**kwargs)
for example in generator:
yield example

return IterDataset(generate_examples)
else:
examples = self._read(**kwargs)
if hasattr(examples, '__len__') and hasattr(examples,
'__getitem__'):
return MapDataset(examples)
else:
return MapDataset(list(examples))

0 comments on commit ef1719b

Please sign in to comment.