From adfc593bcd2f1b74676e7ab7c1a3b9c168b7257f Mon Sep 17 00:00:00 2001 From: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Date: Fri, 28 Jun 2024 12:45:32 +0100 Subject: [PATCH] Improved documentation for configuring dataset parameters in the data catalog (#3969) * Extended data catalog configuration section Signed-off-by: Elena Khaustova * Updated RELEASE.md Signed-off-by: Elena Khaustova * Small updates Signed-off-by: Elena Khaustova * Added some clarifications based on the review comment Signed-off-by: Elena Khaustova * replace property with parameter for consistency Signed-off-by: Elena Khaustova * Added a suggested note Signed-off-by: Elena Khaustova * Applied review comments Signed-off-by: Elena Khaustova * Split sentences and added datasets' names Signed-off-by: Elena Khaustova --------- Signed-off-by: Elena Khaustova --- RELEASE.md | 4 ++-- docs/source/data/data_catalog.md | 40 ++++++++++++++++++++++++++++++++ 2 files changed, 42 insertions(+), 2 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index 0fbaa2af58..f6b37a2606 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -13,8 +13,8 @@ * The utility method `get_pkg_version()` is deprecated and will be removed in Kedro 0.20.0. ## Documentation changes - -Extended documentation with an example of logging customisation at runtime +* Improved documentation for configuring dataset parameters in the data catalog +* Extended documentation with an example of logging customisation at runtime ## Community contributions diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index 1bbc336651..1d597f1128 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -30,6 +30,46 @@ shuttles: load_args: engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0) ``` + +### Configuring dataset parameters in `catalog.yml` + +The dataset configuration in `catalog.yml` is defined as follows: +1. The top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below. +2. The next level includes multiple keys. The first one is the mandatory key - `type` which defines the type of dataset to use. +The rest of the keys are dataset parameters and vary depending on the implementation. +To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation ` and navigate to the `__init__` method of the target dataset. +3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation. +In the example below, a configuration of the `shuttles` dataset includes the `load_args` parameter which is defined by the `pandas` option for loading CSV files. +While the `save_args` parameter in a configuration of the `weather` dataset is defined by the `snowpark` `saveAsTable` method. +To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation ` and navigate to the target parameter in the `__init__` definition for the dataset. +For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you can find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted. + +```{note} +Kedro datasets delegate any of the `load_args` / `save_args` directly to the underlying implementation. +``` + +The example below showcases the configuration of two datasets - `shuttles` of type [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) and `weather` of type [snowflake.SnowparkTableDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.snowflake.SnowparkTableDataset.html). + +```yaml +shuttles: # Dataset name + type: pandas.ExcelDataset # Dataset type + filepath: data/01_raw/shuttles.xlsx # pandas.ExcelDataset parameter + load_args: # pandas.ExcelDataset parameter + engine: openpyxl # Pandas option for loading CSV files + +weather: # Dataset name + type: snowflake.SnowparkTableDataset # Dataset type + table_name: "weather_data" + database: "meteorology" + schema: "observations" + credentials: snowflake_client + save_args: # snowflake.SnowparkTableDataset parameter + mode: overwrite # Snowpark saveAsTable input option + column_order: name + table_type: '' +``` + + ### Dataset `type` Kedro supports a range of connectors, for CSV files, Excel spreadsheets, Parquet files, Feather files, HDF5 files, JSON documents, pickled objects, SQL tables, SQL queries, and more. They are supported using libraries such as pandas, PySpark, NetworkX, and Matplotlib.