Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate abstract "DataSet" in favor of "Dataset" #2746

Merged
merged 15 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@
## Breaking changes to the API

## Upcoming deprecations for Kedro 0.19.0
* Renamed abstract dataset classes, in accordance with the [Kedro lexicon](https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide#kedro-lexicon). Dataset classes ending with "DataSet" are deprecated and will be removed in 0.19.0. Note that all of the below classes are also importable from `kedro.io`; only the module where they are defined is listed as the location.

| Type | Deprecated Alias | Location |
| -------------------------- | -------------------------- | --------------- |
| `AbstractDataset` | `AbstractDataSet` | `kedro.io.core` |
| `AbstractVersionedDataset` | `AbstractVersionedDataSet` | `kedro.io.core` |

# Release 0.18.11

Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -578,7 +578,7 @@ gear = cars["gear"].values
The following steps happened behind the scenes when `load` was called:

- The value `cars` was located in the Data Catalog
- The corresponding `AbstractDataSet` object was retrieved
- The corresponding `AbstractDataset` object was retrieved
- The `load` method of this dataset was called
- This `load` method delegated the loading to the underlying pandas `read_csv` function

Expand Down
18 changes: 9 additions & 9 deletions docs/source/data/kedro_io.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Kedro IO


In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataSet](/kedro.io.AbstractDataSet) and [kedro.io.DataSetError](/kedro.io.DataSetError).
In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataset](/kedro.io.AbstractDataset) and [kedro.io.DataSetError](/kedro.io.DataSetError).

## Error handling

Expand All @@ -21,9 +21,9 @@ except DataSetError:
```


## AbstractDataSet
## AbstractDataset

To understand what is going on behind the scenes, you should study the [AbstractDataSet interface](/kedro.io.AbstractDataSet). `AbstractDataSet` is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation.
To understand what is going on behind the scenes, you should study the [AbstractDataset interface](/kedro.io.AbstractDataset). `AbstractDataset` is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.

If you have a dataset called `parts`, you can make direct calls to it like so:

Expand All @@ -33,13 +33,13 @@ parts_df = parts.load()

We recommend using a `DataCatalog` instead (for more details, see [the `DataCatalog` documentation](../data/data_catalog.md)) as it has been designed to make all datasets available to project members.

For contributors, if you would like to submit a new dataset, you must extend the `AbstractDataSet`. For a complete guide, please read [the section on custom datasets](../extend_kedro/custom_datasets.md).
For contributors, if you would like to submit a new dataset, you must extend the `AbstractDataset`. For a complete guide, please read [the section on custom datasets](../extend_kedro/custom_datasets.md).


## Versioning

In order to enable versioning, you need to update the `catalog.yml` config file and set the `versioned` attribute to `true` for the given dataset. If this is a custom dataset, the implementation must also:
1. extend `kedro.io.core.AbstractVersionedDataSet` AND
1. extend `kedro.io.core.AbstractVersionedDataset` AND
2. add `version` namedtuple as an argument to its `__init__` method AND
3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro_datasets.pandas.CSVDataSet) as an example) AND
4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro_datasets.pandas.CSVDataSet) for an example implementation)
Expand All @@ -55,10 +55,10 @@ from pathlib import Path, PurePosixPath

import pandas as pd

from kedro.io import AbstractVersionedDataSet
from kedro.io import AbstractVersionedDataset


class MyOwnDataSet(AbstractVersionedDataSet):
class MyOwnDataSet(AbstractVersionedDataset):
def __init__(self, filepath, version, param1, param2=True):
super().__init__(PurePosixPath(filepath), version)
self._param1 = param1
Expand Down Expand Up @@ -314,7 +314,7 @@ Here is an exhaustive list of the arguments supported by `PartitionedDataSet`:
| Argument | Required | Supported types | Description |
| ----------------- | ------------------------------ | ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | Yes | `str` | Path to the folder containing partitioned data. If path starts with the protocol (e.g., `s3://`) then the corresponding `fsspec` concrete filesystem implementation will be used. If protocol is not specified, local filesystem will be used |
| `dataset` | Yes | `str`, `Type[AbstractDataSet]`, `Dict[str, Any]` | Underlying dataset definition, for more details see the section below |
| `dataset` | Yes | `str`, `Type[AbstractDataset]`, `Dict[str, Any]` | Underlying dataset definition, for more details see the section below |
| `credentials` | No | `Dict[str, Any]` | Protocol-specific options that will be passed to `fsspec.filesystemcall`, for more details see the section below |
| `load_args` | No | `Dict[str, Any]` | Keyword arguments to be passed into `find()` method of the corresponding filesystem implementation |
| `filepath_arg` | No | `str` (defaults to `filepath`) | Argument name of the underlying dataset initializer that will contain a path to an individual partition |
Expand All @@ -326,7 +326,7 @@ Dataset definition should be passed into the `dataset` argument of the `Partitio

##### Shorthand notation

Requires you only to specify a class of the underlying dataset either as a string (e.g. `pandas.CSVDataSet` or a fully qualified class path like `kedro_datasets.pandas.CSVDataSet`) or as a class object that is a subclass of the [AbstractDataSet](/kedro.io.AbstractDataSet).
Requires you only to specify a class of the underlying dataset either as a string (e.g. `pandas.CSVDataSet` or a fully qualified class path like `kedro_datasets.pandas.CSVDataSet`) or as a class object that is a subclass of the [AbstractDataset](/kedro.io.AbstractDataset).

##### Full notation

Expand Down
4 changes: 2 additions & 2 deletions docs/source/deployment/dask.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,14 @@ from kedro.framework.hooks.manager import (
_register_hooks_setuptools,
)
from kedro.framework.project import settings
from kedro.io import AbstractDataSet, DataCatalog
from kedro.io import AbstractDataset, DataCatalog
from kedro.pipeline import Pipeline
from kedro.pipeline.node import Node
from kedro.runner import AbstractRunner, run_node
from pluggy import PluginManager


class _DaskDataSet(AbstractDataSet):
class _DaskDataSet(AbstractDataset):
"""``_DaskDataSet`` publishes/gets named datasets to/from the Dask
scheduler."""

Expand Down
36 changes: 18 additions & 18 deletions docs/source/extend_kedro/custom_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ Consult the [Pillow documentation](https://pillow.readthedocs.io/en/stable/insta

## The anatomy of a dataset

At the minimum, a valid Kedro dataset needs to subclass the base [AbstractDataSet](/kedro.io.AbstractDataSet) and provide an implementation for the following abstract methods:
At the minimum, a valid Kedro dataset needs to subclass the base [AbstractDataset](/kedro.io.AbstractDataset) and provide an implementation for the following abstract methods:

* `_load`
* `_save`
* `_describe`

`AbstractDataSet` is generically typed with an input data type for saving data, and an output data type for loading data.
`AbstractDataset` is generically typed with an input data type for saving data, and an output data type for loading data.
This typing is optional however, and defaults to `Any` type.

Here is an example skeleton for `ImageDataSet`:
Expand All @@ -43,10 +43,10 @@ from typing import Any, Dict

import numpy as np

from kedro.io import AbstractDataSet
from kedro.io import AbstractDataset


class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
"""``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
Example:
Expand Down Expand Up @@ -108,11 +108,11 @@ import fsspec
import numpy as np
from PIL import Image

from kedro.io import AbstractDataSet
from kedro.io import AbstractDataset
from kedro.io.core import get_filepath_str, get_protocol_and_path


class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
def __init__(self, filepath: str):
"""Creates a new instance of ImageDataSet to load / save image data for given filepath.
Expand Down Expand Up @@ -169,7 +169,7 @@ Similarly, we can implement the `_save` method as follows:


```python
class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
def _save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
# using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
Expand All @@ -193,7 +193,7 @@ You can open the file to verify that the data was written back correctly.
The `_describe` method is used for printing purposes. The convention in Kedro is for the method to return a dictionary describing the attributes of the dataset.

```python
class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
def _describe(self) -> Dict[str, Any]:
"""Returns a dict that describes the attributes of the dataset."""
return dict(filepath=self._filepath, protocol=self._protocol)
Expand All @@ -214,11 +214,11 @@ import fsspec
import numpy as np
from PIL import Image

from kedro.io import AbstractDataSet
from kedro.io import AbstractDataset
from kedro.io.core import get_filepath_str, get_protocol_and_path


class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
"""``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
Example:
Expand Down Expand Up @@ -301,7 +301,7 @@ $ ls -la data/01_raw/pokemon-images-and-types/images/images/*.png | wc -l
Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time.
```
To add [Versioning](../data/kedro_io.md#versioning) support to the new dataset we need to extend the
[AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataSet) to:
[AbstractVersionedDataset](/kedro.io.AbstractVersionedDataset) to:

* Accept a `version` keyword argument as part of the constructor
* Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively
Expand All @@ -320,11 +320,11 @@ import fsspec
import numpy as np
from PIL import Image

from kedro.io import AbstractVersionedDataSet
from kedro.io import AbstractVersionedDataset
from kedro.io.core import get_filepath_str, get_protocol_and_path, Version


class ImageDataSet(AbstractVersionedDataSet[np.ndarray, np.ndarray]):
class ImageDataSet(AbstractVersionedDataset[np.ndarray, np.ndarray]):
"""``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
Example:
Expand Down Expand Up @@ -391,14 +391,14 @@ The difference between the original `ImageDataSet` and the versioned `ImageDataS
import numpy as np
from PIL import Image

-from kedro.io import AbstractDataSet
-from kedro.io import AbstractDataset
-from kedro.io.core import get_filepath_str, get_protocol_and_path
+from kedro.io import AbstractVersionedDataSet
+from kedro.io import AbstractVersionedDataset
+from kedro.io.core import get_filepath_str, get_protocol_and_path, Version


-class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractVersionedDataSet[np.ndarray, np.ndarray]):
-class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractVersionedDataset[np.ndarray, np.ndarray]):
"""``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.

Example:
Expand Down Expand Up @@ -537,7 +537,7 @@ These parameters are then passed to the dataset constructor so you can use them
import fsspec
class ImageDataSet(AbstractVersionedDataSet):
class ImageDataSet(AbstractVersionedDataset):
def __init__(
self,
filepath: str,
Expand Down
2 changes: 1 addition & 1 deletion docs/source/extend_kedro/plugins.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ When you are ready to submit your code:
## Supported Kedro plugins

- [Kedro-Datasets](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets), a collection of all of Kedro's data connectors. These data
connectors are implementations of the `AbstractDataSet`
connectors are implementations of the `AbstractDataset`
- [Kedro-Docker](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker), a tool for packaging and shipping Kedro projects within containers
- [Kedro-Airflow](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow), a tool for converting your Kedro project into an Airflow project
- [Kedro-Viz](https://github.com/kedro-org/kedro-viz), a tool for visualising your Kedro pipelines
Expand Down
4 changes: 2 additions & 2 deletions docs/source/kedro.io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ kedro.io
:toctree:
:template: autosummary/class.rst

kedro.io.AbstractDataSet
kedro.io.AbstractVersionedDataSet
kedro.io.AbstractDataset
kedro.io.AbstractVersionedDataset
kedro.io.CachedDataSet
kedro.io.CachedDataset
kedro.io.DataCatalog
Expand Down
4 changes: 2 additions & 2 deletions docs/source/nodes_and_pipelines/nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,14 +203,14 @@ import fsspec
import pandas as pd

from kedro.io.core import (
AbstractVersionedDataSet,
AbstractVersionedDataset,
Version,
get_filepath_str,
get_protocol_and_path,
)


class ChunkWiseCSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame]):
class ChunkWiseCSVDataSet(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame]):
"""``ChunkWiseCSVDataSet`` loads/saves data from/to a CSV file using an underlying
filesystem. It uses pandas to handle the CSV file.
"""
Expand Down
8 changes: 4 additions & 4 deletions docs/source/nodes_and_pipelines/run_a_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,13 +57,13 @@ If the built-in Kedro runners do not meet your requirements, you can also define

```python
# in src/<package_name>/runner.py
from kedro.io import AbstractDataSet, DataCatalog, MemoryDataSet
from kedro.io import AbstractDataset, DataCatalog, MemoryDataSet
from kedro.pipeline import Pipeline
from kedro.runner.runner import AbstractRunner
from pluggy import PluginManager


from kedro.io import AbstractDataSet, DataCatalog, MemoryDataSet
from kedro.io import AbstractDataset, DataCatalog, MemoryDataSet
from kedro.pipeline import Pipeline
from kedro.runner.runner import AbstractRunner

Expand All @@ -74,13 +74,13 @@ class DryRunner(AbstractRunner):
neccessary data exists.
"""

def create_default_data_set(self, ds_name: str) -> AbstractDataSet:
def create_default_data_set(self, ds_name: str) -> AbstractDataset:
"""Factory method for creating the default data set for the runner.

Args:
ds_name: Name of the missing data set
Returns:
An instance of an implementation of AbstractDataSet to be used
An instance of an implementation of AbstractDataset to be used
for all unregistered data sets.

"""
Expand Down
8 changes: 4 additions & 4 deletions kedro/extras/datasets/README.md
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes across kedro/extras/datasets could be reverted--no harm, no foul.

Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
> `kedro.extras.datasets` is deprecated and will be removed in Kedro 0.19,
> install `kedro-datasets` instead by running `pip install kedro-datasets`.

Welcome to `kedro.extras.datasets`, the home of Kedro's data connectors. Here you will find `AbstractDataSet` implementations created by QuantumBlack and external contributors.
Welcome to `kedro.extras.datasets`, the home of Kedro's data connectors. Here you will find `AbstractDataset` implementations created by QuantumBlack and external contributors.

## What `AbstractDataSet` implementations are supported?
## What `AbstractDataset` implementations are supported?

We support a range of data descriptions, including CSV, Excel, Parquet, Feather, HDF5, JSON, Pickle, SQL Tables, SQL Queries, Spark DataFrames and more. We even allow support for working with images.

Expand All @@ -16,7 +16,7 @@ These data descriptions are supported with the APIs of `pandas`, `spark`, `netwo

Here is a full list of [supported data descriptions and APIs](https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.html).

## How can I create my own `AbstractDataSet` implementation?
## How can I create my own `AbstractDataset` implementation?


Take a look at our [instructions on how to create your own `AbstractDataSet` implementation](https://kedro.readthedocs.io/en/stable/extend_kedro/custom_datasets.html).
Take a look at our [instructions on how to create your own `AbstractDataset` implementation](https://kedro.readthedocs.io/en/stable/extend_kedro/custom_datasets.html).
2 changes: 1 addition & 1 deletion kedro/extras/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""``kedro.extras.datasets`` is where you can find all of Kedro's data connectors.
These data connectors are implementations of the ``AbstractDataSet``.
These data connectors are implementations of the ``AbstractDataset``.

.. warning::

Expand Down
4 changes: 2 additions & 2 deletions kedro/extras/datasets/api/api_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@
import requests
from requests.auth import AuthBase

from kedro.io.core import AbstractDataSet, DatasetError
from kedro.io.core import AbstractDataset, DatasetError

# NOTE: kedro.extras.datasets will be removed in Kedro 0.19.0.
# Any contribution to datasets should be made in kedro-datasets
# in kedro-plugins (https://github.com/kedro-org/kedro-plugins)


class APIDataSet(AbstractDataSet[None, requests.Response]):
class APIDataSet(AbstractDataset[None, requests.Response]):
"""``APIDataSet`` loads the data from HTTP(S) APIs.
It uses the python requests library: https://requests.readthedocs.io/en/latest/

Expand Down
2 changes: 1 addition & 1 deletion kedro/extras/datasets/biosequence/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""``AbstractDataSet`` implementation to read/write from/to a sequence file."""
"""``AbstractDataset`` implementation to read/write from/to a sequence file."""

__all__ = ["BioSequenceDataSet"]

Expand Down
Loading