Improved documentation for configuring dataset parameters in the data…

… catalog (#3969) * Extended data catalog configuration section Signed-off-by: Elena Khaustova <[email protected]> * Updated RELEASE.md Signed-off-by: Elena Khaustova <[email protected]> * Small updates Signed-off-by: Elena Khaustova <[email protected]> * Added some clarifications based on the review comment Signed-off-by: Elena Khaustova <[email protected]> * replace property with parameter for consistency Signed-off-by: Elena Khaustova <[email protected]> * Added a suggested note Signed-off-by: Elena Khaustova <[email protected]> * Applied review comments Signed-off-by: Elena Khaustova <[email protected]> * Split sentences and added datasets' names Signed-off-by: Elena Khaustova <[email protected]> --------- Signed-off-by: Elena Khaustova <[email protected]>
kedro-org · Jun 28, 2024 · adfc593 · adfc593
1 parent 4f7c2f2
commit adfc593
Show file tree

Hide file tree

Showing 2 changed files with 42 additions and 2 deletions.
diff --git a/RELEASE.md b/RELEASE.md
@@ -13,8 +13,8 @@
 * The utility method `get_pkg_version()` is deprecated and will be removed in Kedro 0.20.0.
 
 ## Documentation changes
-
-Extended documentation with an example of logging customisation at runtime
+* Improved documentation for configuring dataset parameters in the data catalog
+* Extended documentation with an example of logging customisation at runtime
 
 ## Community contributions
 

diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md
@@ -30,6 +30,46 @@ shuttles:
   load_args:
     engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)
 ```
+
+### Configuring dataset parameters in `catalog.yml`
+
+The dataset configuration in `catalog.yml` is defined as follows:
+1. The top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.
+2. The next level includes multiple keys. The first one is the mandatory key - `type` which defines the type of dataset to use.
+The rest of the keys are dataset parameters and vary depending on the implementation.
+To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.
+3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation.
+In the example below, a configuration of the `shuttles` dataset includes the `load_args` parameter which is defined by the `pandas` option for loading CSV files.
+While the `save_args` parameter in a configuration of the `weather` dataset is defined by the `snowpark` `saveAsTable` method.
+To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset.
+For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you can find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted.
+
+```{note}
+Kedro datasets delegate any of the `load_args` / `save_args` directly to the underlying implementation.
+```
+
+The example below showcases the configuration of two datasets - `shuttles` of type [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) and `weather` of type [snowflake.SnowparkTableDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.snowflake.SnowparkTableDataset.html).
+
+```yaml
+shuttles: # Dataset name
+  type: pandas.ExcelDataset # Dataset type
+  filepath: data/01_raw/shuttles.xlsx # pandas.ExcelDataset parameter
+  load_args: # pandas.ExcelDataset parameter
+    engine: openpyxl # Pandas option for loading CSV files
+
+weather: # Dataset name
+  type: snowflake.SnowparkTableDataset # Dataset type
+  table_name: "weather_data"
+  database: "meteorology"
+  schema: "observations"
+  credentials: snowflake_client
+  save_args: # snowflake.SnowparkTableDataset parameter
+    mode: overwrite # Snowpark saveAsTable input option
+    column_order: name
+    table_type: ''
+```
+
+
 ### Dataset `type`
 
 Kedro supports a range of connectors, for CSV files, Excel spreadsheets, Parquet files, Feather files, HDF5 files, JSON documents, pickled objects, SQL tables, SQL queries, and more. They are supported using libraries such as pandas, PySpark, NetworkX, and Matplotlib.