lazy_dataset

Lazy_dataset is a helper to deal with large datasets that do not fit into memory. It allows to define transformations that are applied lazily, (e.g. a mapping function to read data from HDD). When someone iterates over the dataset all transformations are applied.

Supported transformations:

dataset.map(map_fn): Apply the function map_fn to each example (builtins.map)
dataset[2]: Get example at index 2.
dataset['example_id'] Get that example that has the example id 'example_id'.
dataset[10:20]: Get a sub dataset that contains only the examples in the slice 10 to 20.
dataset.filter(filter_fn, lazy=True) Drops examples where filter_fn(example) is false (builtins.filter).
dataset.concatenate(*others): Concatenates two or more datasets (numpy.concatenate)
dataset.intersperse(*others): Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).
dataset.zip(*others): Zip two or more datasets
dataset.shuffle(reshuffle=False): Shuffles the dataset. When reshuffle is True it shuffles each time when you iterate over the data.
dataset.tile(reps, shuffle=False): Repeats the dataset reps times and concatenates it (numpy.tile)
dataset.cycle(): Repeats the dataset endlessly (itertools.cycle but without caching)
dataset.groupby(group_fn): Groups examples together. In contrast to itertools.groupby a sort is not nessesary, like in pandas (itertools.groupby, pandas.DataFrame.groupby)
dataset.sort(key_fn, sort_fn=sorted): Sorts the examples depending on the values key_fn(example) (list.sort)
dataset.batch(batch_size, drop_last=False): Batches batch_size examples together as a list. Usually followed by a map (tensorflow.data.Dataset.batch)
dataset.random_choice(): Get a random example (numpy.random.choice)
dataset.cache(): Cache in RAM (similar to ESPnet's keep_all_data_on_mem)
dataset.diskcache(): Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)
...

>>> from IPython.lib.pretty import pprint
>>> import lazy_dataset
>>> examples = {
...     'example_id_1': {
...         'observation': [1, 2, 3],
...         'label': 1,
...     },
...     'example_id_2': {
...         'observation': [4, 5, 6],
...         'label': 2,
...     },
...     'example_id_3': {
...         'observation': [7, 8, 9],
...         'label': 3,
...     },
... }
>>> for example_id, example in examples.items():
...     example['example_id'] = example_id
>>> ds = lazy_dataset.new(examples)
>>> ds
  DictDataset(len=3)
MapDataset(_pickle.loads)
>>> ds.keys()
('example_id_1', 'example_id_2', 'example_id_3')
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}
>>> def transform(example):
...     example['label'] *= 10
...     return example
>>> ds = ds.map(transform)
>>> for example in ds:
...     print(example)
{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds = ds.filter(lambda example: example['label'] > 15)
>>> for example in ds:
...     print(example)
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds['example_id_2']
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
>>> ds
      DictDataset(len=3)
    MapDataset(_pickle.loads)
  MapDataset(<function transform at 0x7ff74efb6620>)
FilterDataset(<function <lambda> at 0x7ff74efb67b8>)

Comparison with PyTorch's DataLoader

See here for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.

Installation

Install it directly with Pip, if you just want to use it:

pip install lazy_dataset

If you want to make changes or want the most recent version: Clone the repository and install it as follows:

git clone https://github.com/fgnt/lazy_dataset.git
cd lazy_dataset
pip install --editable .

Name		Name	Last commit message	Last commit date
Latest commit History 528 Commits
.github/workflows		.github/workflows
comparison		comparison
doc		doc
lazy_dataset		lazy_dataset
scripts		scripts
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
maintenance.md		maintenance.md
pylint.cfg		pylint.cfg
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lazy_dataset

Comparison with PyTorch's DataLoader

Installation

About

Releases

Packages

Contributors 12

Languages

License

fgnt/lazy_dataset

Folders and files

Latest commit

History

Repository files navigation

lazy_dataset

Comparison with PyTorch's DataLoader

Installation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Languages

Packages