Skip to content

Commit

Permalink
Improved documentation for configuring dataset parameters in the data…
Browse files Browse the repository at this point in the history
… catalog (#3969)

* Extended data catalog configuration section

Signed-off-by: Elena Khaustova <[email protected]>

* Updated RELEASE.md

Signed-off-by: Elena Khaustova <[email protected]>

* Small updates

Signed-off-by: Elena Khaustova <[email protected]>

* Added some clarifications based on the review comment

Signed-off-by: Elena Khaustova <[email protected]>

* replace property with parameter for consistency

Signed-off-by: Elena Khaustova <[email protected]>

* Added a suggested note

Signed-off-by: Elena Khaustova <[email protected]>

* Applied review comments

Signed-off-by: Elena Khaustova <[email protected]>

* Split sentences and added datasets' names

Signed-off-by: Elena Khaustova <[email protected]>

---------

Signed-off-by: Elena Khaustova <[email protected]>
  • Loading branch information
ElenaKhaustova authored Jun 28, 2024
1 parent 4f7c2f2 commit adfc593
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 2 deletions.
4 changes: 2 additions & 2 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
* The utility method `get_pkg_version()` is deprecated and will be removed in Kedro 0.20.0.

## Documentation changes

Extended documentation with an example of logging customisation at runtime
* Improved documentation for configuring dataset parameters in the data catalog
* Extended documentation with an example of logging customisation at runtime

## Community contributions

Expand Down
40 changes: 40 additions & 0 deletions docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,46 @@ shuttles:
load_args:
engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)
```
### Configuring dataset parameters in `catalog.yml`

The dataset configuration in `catalog.yml` is defined as follows:
1. The top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.
2. The next level includes multiple keys. The first one is the mandatory key - `type` which defines the type of dataset to use.
The rest of the keys are dataset parameters and vary depending on the implementation.
To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.
3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation.
In the example below, a configuration of the `shuttles` dataset includes the `load_args` parameter which is defined by the `pandas` option for loading CSV files.
While the `save_args` parameter in a configuration of the `weather` dataset is defined by the `snowpark` `saveAsTable` method.
To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset.
For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you can find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted.

```{note}
Kedro datasets delegate any of the `load_args` / `save_args` directly to the underlying implementation.
```

The example below showcases the configuration of two datasets - `shuttles` of type [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) and `weather` of type [snowflake.SnowparkTableDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.snowflake.SnowparkTableDataset.html).

```yaml
shuttles: # Dataset name
type: pandas.ExcelDataset # Dataset type
filepath: data/01_raw/shuttles.xlsx # pandas.ExcelDataset parameter
load_args: # pandas.ExcelDataset parameter
engine: openpyxl # Pandas option for loading CSV files

weather: # Dataset name
type: snowflake.SnowparkTableDataset # Dataset type
table_name: "weather_data"
database: "meteorology"
schema: "observations"
credentials: snowflake_client
save_args: # snowflake.SnowparkTableDataset parameter
mode: overwrite # Snowpark saveAsTable input option
column_order: name
table_type: ''
```
### Dataset `type`

Kedro supports a range of connectors, for CSV files, Excel spreadsheets, Parquet files, Feather files, HDF5 files, JSON documents, pickled objects, SQL tables, SQL queries, and more. They are supported using libraries such as pandas, PySpark, NetworkX, and Matplotlib.
Expand Down

0 comments on commit adfc593

Please sign in to comment.