Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved documentation for configuring dataset parameters in the data catalog #3969

Merged
merged 10 commits into from
Jun 28, 2024
4 changes: 2 additions & 2 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
* The utility method `get_pkg_version()` is deprecated and will be removed in Kedro 0.20.0.

## Documentation changes

Extended documentation with an example of logging customisation at runtime
* Improved documentation for configuring dataset parameters in the data catalog
* Extended documentation with an example of logging customisation at runtime

## Community contributions

Expand Down
40 changes: 40 additions & 0 deletions docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,46 @@
load_args:
engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)
```

### Configuring dataset parameters in `catalog.yml`

The dataset configuration in `catalog.yml` is defined as follows:
1. The top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.
2. The next level includes multiple keys. The first one is the mandatory key - `type` which defines the type of dataset to use.

Check warning on line 38 in docs/source/data/data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/data_catalog.md#L38

[Kedro.toowordy] 'multiple' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'multiple' is too wordy", "location": {"path": "docs/source/data/data_catalog.md", "range": {"start": {"line": 38, "column": 28}}}, "severity": "WARNING"}

Check warning on line 38 in docs/source/data/data_catalog.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/data_catalog.md#L38

[Kedro.toowordy] 'type of' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'type of' is too wordy", "location": {"path": "docs/source/data/data_catalog.md", "range": {"start": {"line": 38, "column": 105}}}, "severity": "WARNING"}
The rest of the keys are dataset parameters and vary depending on the implementation.
To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.
3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation.
In the example below, a configuration of the `shuttles` dataset includes the `load_args` parameter which is defined by the `pandas` option for loading CSV files.
While the `save_args` parameter in a configuration of the `weather` dataset is defined by the `snowpark` `saveAsTable` method.
To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset.
For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you can find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted.

```{note}
Kedro datasets delegate any of the `load_args` / `save_args` directly to the underlying implementation.
```

The example below showcases the configuration of two datasets - `shuttles` of type [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) and `weather` of type [snowflake.SnowparkTableDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.snowflake.SnowparkTableDataset.html).

```yaml
shuttles: # Dataset name
type: pandas.ExcelDataset # Dataset type
filepath: data/01_raw/shuttles.xlsx # pandas.ExcelDataset parameter
load_args: # pandas.ExcelDataset parameter
engine: openpyxl # Pandas option for loading CSV files

weather: # Dataset name
type: snowflake.SnowparkTableDataset # Dataset type
table_name: "weather_data"
database: "meteorology"
schema: "observations"
credentials: snowflake_client
save_args: # snowflake.SnowparkTableDataset parameter
mode: overwrite # Snowpark saveAsTable input option
column_order: name
table_type: ''
```


### Dataset `type`

Kedro supports a range of connectors, for CSV files, Excel spreadsheets, Parquet files, Feather files, HDF5 files, JSON documents, pickled objects, SQL tables, SQL queries, and more. They are supported using libraries such as pandas, PySpark, NetworkX, and Matplotlib.
Expand Down