Skip to content

Commit

Permalink
Deprecate abstract "DataSet" in favor of "Dataset" (#2746)
Browse files Browse the repository at this point in the history
* Deprecate abstract "DataSet" in favor of "Dataset"

Signed-off-by: Deepyaman Datta <[email protected]>

* Update docs references to abstract "DataSet" class

Signed-off-by: Deepyaman Datta <[email protected]>

* Fix `AbstractDataSet` reference to `AbstractDataset`

Signed-off-by: Deepyaman Datta <[email protected]>

* Update docs/source/conf.py

Signed-off-by: Deepyaman Datta <[email protected]>

* Change remaining `_DEPRECATED_ERROR_CLASSES` to `_DEPRECATED_CLASSES`, update type hints

Signed-off-by: Deepyaman Datta <[email protected]>

* Update RELEASE.md

* Reformat kedro/io/__init__.py with Black and isort

Signed-off-by: Deepyaman Datta <[email protected]>

* Remove duplicate imports

Signed-off-by: Deepyaman Datta <[email protected]>

* Move imports

Signed-off-by: Deepyaman Datta <[email protected]>

---------

Signed-off-by: Deepyaman Datta <[email protected]>
  • Loading branch information
deepyaman authored Aug 14, 2023
1 parent ae5f6ff commit 16dd1df
Show file tree
Hide file tree
Showing 77 changed files with 274 additions and 254 deletions.
6 changes: 6 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@
## Breaking changes to the API

## Upcoming deprecations for Kedro 0.19.0
* Renamed abstract dataset classes, in accordance with the [Kedro lexicon](https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide#kedro-lexicon). Dataset classes ending with "DataSet" are deprecated and will be removed in 0.19.0. Note that all of the below classes are also importable from `kedro.io`; only the module where they are defined is listed as the location.

| Type | Deprecated Alias | Location |
| -------------------------- | -------------------------- | --------------- |
| `AbstractDataset` | `AbstractDataSet` | `kedro.io.core` |
| `AbstractVersionedDataset` | `AbstractVersionedDataSet` | `kedro.io.core` |

# Release 0.18.12

Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -783,7 +783,7 @@ gear = cars["gear"].values
The following steps happened behind the scenes when `load` was called:

- The value `cars` was located in the Data Catalog
- The corresponding `AbstractDataSet` object was retrieved
- The corresponding `AbstractDataset` object was retrieved
- The `load` method of this dataset was called
- This `load` method delegated the loading to the underlying pandas `read_csv` function

Expand Down
18 changes: 9 additions & 9 deletions docs/source/data/kedro_io.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Kedro IO


In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataSet](/kedro.io.AbstractDataSet) and [kedro.io.DataSetError](/kedro.io.DataSetError).
In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataset](/kedro.io.AbstractDataset) and [kedro.io.DataSetError](/kedro.io.DataSetError).

## Error handling

Expand All @@ -21,9 +21,9 @@ except DataSetError:
```


## AbstractDataSet
## AbstractDataset

To understand what is going on behind the scenes, you should study the [AbstractDataSet interface](/kedro.io.AbstractDataSet). `AbstractDataSet` is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation.
To understand what is going on behind the scenes, you should study the [AbstractDataset interface](/kedro.io.AbstractDataset). `AbstractDataset` is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.

If you have a dataset called `parts`, you can make direct calls to it like so:

Expand All @@ -33,13 +33,13 @@ parts_df = parts.load()

We recommend using a `DataCatalog` instead (for more details, see [the `DataCatalog` documentation](../data/data_catalog.md)) as it has been designed to make all datasets available to project members.

For contributors, if you would like to submit a new dataset, you must extend the `AbstractDataSet`. For a complete guide, please read [the section on custom datasets](../extend_kedro/custom_datasets.md).
For contributors, if you would like to submit a new dataset, you must extend the `AbstractDataset`. For a complete guide, please read [the section on custom datasets](../extend_kedro/custom_datasets.md).


## Versioning

In order to enable versioning, you need to update the `catalog.yml` config file and set the `versioned` attribute to `true` for the given dataset. If this is a custom dataset, the implementation must also:
1. extend `kedro.io.core.AbstractVersionedDataSet` AND
1. extend `kedro.io.core.AbstractVersionedDataset` AND
2. add `version` namedtuple as an argument to its `__init__` method AND
3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro_datasets.pandas.CSVDataSet) as an example) AND
4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro_datasets.pandas.CSVDataSet) for an example implementation)
Expand All @@ -55,10 +55,10 @@ from pathlib import Path, PurePosixPath

import pandas as pd

from kedro.io import AbstractVersionedDataSet
from kedro.io import AbstractVersionedDataset


class MyOwnDataSet(AbstractVersionedDataSet):
class MyOwnDataSet(AbstractVersionedDataset):
def __init__(self, filepath, version, param1, param2=True):
super().__init__(PurePosixPath(filepath), version)
self._param1 = param1
Expand Down Expand Up @@ -314,7 +314,7 @@ Here is an exhaustive list of the arguments supported by `PartitionedDataSet`:
| Argument | Required | Supported types | Description |
| ----------------- | ------------------------------ | ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | Yes | `str` | Path to the folder containing partitioned data. If path starts with the protocol (e.g., `s3://`) then the corresponding `fsspec` concrete filesystem implementation will be used. If protocol is not specified, local filesystem will be used |
| `dataset` | Yes | `str`, `Type[AbstractDataSet]`, `Dict[str, Any]` | Underlying dataset definition, for more details see the section below |
| `dataset` | Yes | `str`, `Type[AbstractDataset]`, `Dict[str, Any]` | Underlying dataset definition, for more details see the section below |
| `credentials` | No | `Dict[str, Any]` | Protocol-specific options that will be passed to `fsspec.filesystemcall`, for more details see the section below |
| `load_args` | No | `Dict[str, Any]` | Keyword arguments to be passed into `find()` method of the corresponding filesystem implementation |
| `filepath_arg` | No | `str` (defaults to `filepath`) | Argument name of the underlying dataset initializer that will contain a path to an individual partition |
Expand All @@ -326,7 +326,7 @@ Dataset definition should be passed into the `dataset` argument of the `Partitio

##### Shorthand notation

Requires you only to specify a class of the underlying dataset either as a string (e.g. `pandas.CSVDataSet` or a fully qualified class path like `kedro_datasets.pandas.CSVDataSet`) or as a class object that is a subclass of the [AbstractDataSet](/kedro.io.AbstractDataSet).
Requires you only to specify a class of the underlying dataset either as a string (e.g. `pandas.CSVDataSet` or a fully qualified class path like `kedro_datasets.pandas.CSVDataSet`) or as a class object that is a subclass of the [AbstractDataset](/kedro.io.AbstractDataset).

##### Full notation

Expand Down
4 changes: 2 additions & 2 deletions docs/source/deployment/dask.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,14 @@ from kedro.framework.hooks.manager import (
_register_hooks_setuptools,
)
from kedro.framework.project import settings
from kedro.io import AbstractDataSet, DataCatalog
from kedro.io import AbstractDataset, DataCatalog
from kedro.pipeline import Pipeline
from kedro.pipeline.node import Node
from kedro.runner import AbstractRunner, run_node
from pluggy import PluginManager


class _DaskDataSet(AbstractDataSet):
class _DaskDataSet(AbstractDataset):
"""``_DaskDataSet`` publishes/gets named datasets to/from the Dask
scheduler."""

Expand Down
36 changes: 18 additions & 18 deletions docs/source/extend_kedro/custom_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ Consult the [Pillow documentation](https://pillow.readthedocs.io/en/stable/insta

## The anatomy of a dataset

At the minimum, a valid Kedro dataset needs to subclass the base [AbstractDataSet](/kedro.io.AbstractDataSet) and provide an implementation for the following abstract methods:
At the minimum, a valid Kedro dataset needs to subclass the base [AbstractDataset](/kedro.io.AbstractDataset) and provide an implementation for the following abstract methods:

* `_load`
* `_save`
* `_describe`

`AbstractDataSet` is generically typed with an input data type for saving data, and an output data type for loading data.
`AbstractDataset` is generically typed with an input data type for saving data, and an output data type for loading data.
This typing is optional however, and defaults to `Any` type.

Here is an example skeleton for `ImageDataSet`:
Expand All @@ -43,10 +43,10 @@ from typing import Any, Dict

import numpy as np

from kedro.io import AbstractDataSet
from kedro.io import AbstractDataset


class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
"""``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
Example:
Expand Down Expand Up @@ -108,11 +108,11 @@ import fsspec
import numpy as np
from PIL import Image

from kedro.io import AbstractDataSet
from kedro.io import AbstractDataset
from kedro.io.core import get_filepath_str, get_protocol_and_path


class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
def __init__(self, filepath: str):
"""Creates a new instance of ImageDataSet to load / save image data for given filepath.
Expand Down Expand Up @@ -169,7 +169,7 @@ Similarly, we can implement the `_save` method as follows:


```python
class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
def _save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
# using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
Expand All @@ -193,7 +193,7 @@ You can open the file to verify that the data was written back correctly.
The `_describe` method is used for printing purposes. The convention in Kedro is for the method to return a dictionary describing the attributes of the dataset.

```python
class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
def _describe(self) -> Dict[str, Any]:
"""Returns a dict that describes the attributes of the dataset."""
return dict(filepath=self._filepath, protocol=self._protocol)
Expand All @@ -214,11 +214,11 @@ import fsspec
import numpy as np
from PIL import Image

from kedro.io import AbstractDataSet
from kedro.io import AbstractDataset
from kedro.io.core import get_filepath_str, get_protocol_and_path


class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
"""``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
Example:
Expand Down Expand Up @@ -301,7 +301,7 @@ $ ls -la data/01_raw/pokemon-images-and-types/images/images/*.png | wc -l
Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time.
```
To add [Versioning](../data/kedro_io.md#versioning) support to the new dataset we need to extend the
[AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataSet) to:
[AbstractVersionedDataset](/kedro.io.AbstractVersionedDataset) to:

* Accept a `version` keyword argument as part of the constructor
* Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively
Expand All @@ -320,11 +320,11 @@ import fsspec
import numpy as np
from PIL import Image

from kedro.io import AbstractVersionedDataSet
from kedro.io import AbstractVersionedDataset
from kedro.io.core import get_filepath_str, get_protocol_and_path, Version


class ImageDataSet(AbstractVersionedDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractVersionedDataset[np.ndarray, np.ndarray]):
"""``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
Example:
Expand Down Expand Up @@ -391,14 +391,14 @@ The difference between the original `ImageDataSet` and the versioned `ImageDataS
import numpy as np
from PIL import Image

-from kedro.io import AbstractDataSet
-from kedro.io import AbstractDataset
-from kedro.io.core import get_filepath_str, get_protocol_and_path
+from kedro.io import AbstractVersionedDataSet
+from kedro.io import AbstractVersionedDataset
+from kedro.io.core import get_filepath_str, get_protocol_and_path, Version


-class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractVersionedDataSet[np.ndarray, np.ndarray]):
-class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractVersionedDataset[np.ndarray, np.ndarray]):
"""``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.

Example:
Expand Down Expand Up @@ -537,7 +537,7 @@ These parameters are then passed to the dataset constructor so you can use them
import fsspec
class ImageDataSet(AbstractVersionedDataSet):
class ImageDataSet(AbstractVersionedDataset):
def __init__(
self,
filepath: str,
Expand Down
2 changes: 1 addition & 1 deletion docs/source/extend_kedro/plugins.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ When you are ready to submit your code:
## Supported Kedro plugins

- [Kedro-Datasets](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets), a collection of all of Kedro's data connectors. These data
connectors are implementations of the `AbstractDataSet`
connectors are implementations of the `AbstractDataset`
- [Kedro-Docker](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker), a tool for packaging and shipping Kedro projects within containers
- [Kedro-Airflow](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow), a tool for converting your Kedro project into an Airflow project
- [Kedro-Viz](https://github.com/kedro-org/kedro-viz), a tool for visualising your Kedro pipelines
Expand Down
4 changes: 2 additions & 2 deletions docs/source/kedro.io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ kedro.io
:toctree:
:template: autosummary/class.rst

kedro.io.AbstractDataSet
kedro.io.AbstractVersionedDataSet
kedro.io.AbstractDataset
kedro.io.AbstractVersionedDataset
kedro.io.CachedDataSet
kedro.io.CachedDataset
kedro.io.DataCatalog
Expand Down
18 changes: 9 additions & 9 deletions docs/source/nodes_and_pipelines/nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,18 +287,18 @@ def report_accuracy(y_pred: pd.Series, y_test: pd.Series):
</details>


The `ChunkWiseDataset` is a variant of the `pandas.CSVDataset` where the main change is to the `_save` method that appends data instead of overwriting it. You need to create a file `src/<package_name>/chunkwise.py` and put this class inside it. Below is an example of the `ChunkWiseCSVDataset` implementation:
The `ChunkWiseCSVDataset` is a variant of the `pandas.CSVDataSet` where the main change is to the `_save` method that appends data instead of overwriting it. You need to create a file `src/<package_name>/chunkwise.py` and put this class inside it. Below is an example of the `ChunkWiseCSVDataset` implementation:

```python
import pandas as pd

from kedro.io.core import (
get_filepath_str,
)
from kedro.extras.datasets.pandas import CSVDataset
from kedro.extras.datasets.pandas import CSVDataSet


class ChunkWiseCSVDataset(CSVDataset):
class ChunkWiseCSVDataset(CSVDataSet):
"""``ChunkWiseCSVDataset`` loads/saves data from/to a CSV file using an underlying
filesystem. It uses pandas to handle the CSV file.
"""
Expand All @@ -319,20 +319,20 @@ After that, you need to update the `catalog.yml` to use this new dataset.

```diff
+ y_pred:
+ type: <package_name>.chunkwise.ChunkWiseCSVDataSet
+ type: <package_name>.chunkwise.ChunkWiseCSVDataset
+ filepath: data/07_model_output/y_pred.csv
```

With these changes, when you run `kedro run` in your terminal, you should see `y_pred`` being saved multiple times in the logs as the generator lazily processes and saves the data in smaller chunks.
With these changes, when you run `kedro run` in your terminal, you should see `y_pred` being saved multiple times in the logs as the generator lazily processes and saves the data in smaller chunks.

```
...
INFO Loading data from 'y_train' (MemoryDataset)... data_catalog.py:475
INFO Running node: make_predictions: make_predictions([X_train,X_test,y_train]) -> [y_pred] node.py:331
INFO Saving data to 'y_pred' (ChunkWiseCSVDataSet)... data_catalog.py:514
INFO Saving data to 'y_pred' (ChunkWiseCSVDataSet)... data_catalog.py:514
INFO Saving data to 'y_pred' (ChunkWiseCSVDataSet)... data_catalog.py:514
INFO Saving data to 'y_pred' (ChunkWiseCSVDataset)... data_catalog.py:514
INFO Saving data to 'y_pred' (ChunkWiseCSVDataset)... data_catalog.py:514
INFO Saving data to 'y_pred' (ChunkWiseCSVDataset)... data_catalog.py:514
INFO Completed 2 out of 3 tasks sequential_runner.py:85
INFO Loading data from 'y_pred' (ChunkWiseCSVDataSet)... data_catalog.py:475
INFO Loading data from 'y_pred' (ChunkWiseCSVDataset)... data_catalog.py:475
... runner.py:105
```
Loading

0 comments on commit 16dd1df

Please sign in to comment.