Improved documentation for configuring dataset parameters in the data catalog #3969

ElenaKhaustova · 2024-06-26T17:20:47Z

Description

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>

noklam · 2024-06-27T04:24:43Z

docs/source/data/data_catalog.md

+2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use.
+The rest of the keys are dataset properties and vary depending on the implementation.
+To get the extensive list of dataset properties, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.
+3. Some dataset properties can be further configured depending on the libraries underlying the dataset implementation.


What's an example of this and how this different from 2.? Where can user find information about this?

The example is in the following line. I've extended it a bit for clarity. The difference is that some of the parameters require referring to the underlying library methods to get the full set of the parameters accepted. It is not clear for some users, so we wanted to explicitly show that in the docs.

#3919 (comment)

Signed-off-by: Elena Khaustova <[email protected]>

datajoely · 2024-06-27T09:30:49Z

docs/source/data/data_catalog.md

+The dataset configuration in `catalog.yml` is defined as follows:
+1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.
+2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use.
+The rest of the keys are dataset properties and vary depending on the implementation.


I think it would be cool to do something like

Important
Kedro datasets make every intention to not make any assumptions and delegate any of the load_args / save_args directly to the underlying implementation.

Signed-off-by: Elena Khaustova <[email protected]>

ankatiyar

small suggestions but LGTM! 🚀

ankatiyar · 2024-06-27T14:21:47Z

docs/source/data/data_catalog.md

@@ -36,20 +36,24 @@ shuttles:
 The dataset configuration in `catalog.yml` is defined as follows:
 1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.
 2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use.


Suggested change

2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use.

2. The next level includes multiple keys. The first one is the mandatory key, `type`, which defines the type of dataset to use.

ankatiyar · 2024-06-27T14:26:43Z

docs/source/data/data_catalog.md

+The rest of the keys are dataset parameters and vary depending on the implementation.
+To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.
+3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation.
+In the example below, the configuration of the `load_args` parameter is defined by the `pandas` option for loading CSV files, while the configuration of the `save_args` parameter is defined by the `snowpark` `saveAsTable` method.


nit: I would also add the dataset names here, like shuttles when mentioning load_args and weather for save_args example and break this into two sentences as shorter sentences are easier to read.

merelcht

I left some small grammatical suggestions, but otherwise looks all good 👍

merelcht · 2024-06-27T15:55:58Z

docs/source/data/data_catalog.md

+### Configuring dataset parameters in `catalog.yml`
+
+The dataset configuration in `catalog.yml` is defined as follows:
+1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.


Suggested change

1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.

1. The top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.

merelcht · 2024-06-27T15:56:40Z

docs/source/data/data_catalog.md

+1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.
+2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use.
+The rest of the keys are dataset parameters and vary depending on the implementation.
+To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.


Suggested change

To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.

To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.

merelcht · 2024-06-27T15:57:14Z

docs/source/data/data_catalog.md

+To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.
+3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation.
+In the example below, the configuration of the `load_args` parameter is defined by the `pandas` option for loading CSV files, while the configuration of the `save_args` parameter is defined by the `snowpark` `saveAsTable` method.
+To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset.


Suggested change

To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset.

To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset.

merelcht · 2024-06-27T15:58:33Z

docs/source/data/data_catalog.md

+3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation.
+In the example below, the configuration of the `load_args` parameter is defined by the `pandas` option for loading CSV files, while the configuration of the `save_args` parameter is defined by the `snowpark` `saveAsTable` method.
+To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset.
+For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you may find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted.


Suggested change

For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you may find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted.

For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you can find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted.

merelcht · 2024-06-27T15:59:04Z

docs/source/data/data_catalog.md

+For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you may find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted.
+
+```{note}
+Kedro datasets make every intention to not make any assumptions and delegate any of the `load_args` / `save_args` directly to the underlying implementation.


Suggested change

Kedro datasets make every intention to not make any assumptions and delegate any of the `load_args` / `save_args` directly to the underlying implementation.

Kedro datasets delegate any of the `load_args` / `save_args` directly to the underlying implementation.

Signed-off-by: Elena Khaustova <[email protected]>

merelcht

LGTM! 👍

ElenaKhaustova added 3 commits June 26, 2024 18:16

Extended data catalog configuration section

01af7ce

Signed-off-by: Elena Khaustova <[email protected]>

Updated RELEASE.md

33ed052

Signed-off-by: Elena Khaustova <[email protected]>

Small updates

a5cbd7e

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova marked this pull request as ready for review June 26, 2024 18:01

ElenaKhaustova requested review from yetudada, astrojuanlu and merelcht as code owners June 26, 2024 18:01

ElenaKhaustova requested review from datajoely, ankatiyar and DimedS June 26, 2024 18:01

noklam reviewed Jun 27, 2024

View reviewed changes

Added some clarifications based on the review comment

c631616

Signed-off-by: Elena Khaustova <[email protected]>

datajoely reviewed Jun 27, 2024

View reviewed changes

ElenaKhaustova added 3 commits June 27, 2024 14:26

Merge branch 'main' into docs/3919-dataset-configuration

160c11e

replace property with parameter for consistency

08482c0

Signed-off-by: Elena Khaustova <[email protected]>

Added a suggested note

29c732d

Signed-off-by: Elena Khaustova <[email protected]>

ankatiyar approved these changes Jun 27, 2024

View reviewed changes

merelcht reviewed Jun 27, 2024

View reviewed changes

ElenaKhaustova added 2 commits June 27, 2024 23:41

Applied review comments

735d25b

Signed-off-by: Elena Khaustova <[email protected]>

Split sentences and added datasets' names

0931b97

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova requested review from datajoely and merelcht June 27, 2024 22:55

merelcht approved these changes Jun 28, 2024

View reviewed changes

Merge branch 'main' into docs/3919-dataset-configuration

df95b42

ElenaKhaustova merged commit adfc593 into main Jun 28, 2024
10 checks passed

ElenaKhaustova deleted the docs/3919-dataset-configuration branch June 28, 2024 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved documentation for configuring dataset parameters in the data catalog #3969

Improved documentation for configuring dataset parameters in the data catalog #3969

ElenaKhaustova commented Jun 26, 2024 •

edited

Loading

noklam Jun 27, 2024

ElenaKhaustova Jun 27, 2024

datajoely Jun 27, 2024

ankatiyar left a comment

ankatiyar Jun 27, 2024

ankatiyar Jun 27, 2024

merelcht left a comment

merelcht Jun 27, 2024

merelcht Jun 27, 2024

merelcht Jun 27, 2024

merelcht Jun 27, 2024

merelcht Jun 27, 2024

merelcht left a comment

	2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use.
	2. The next level includes multiple keys. The first one is the mandatory key, `type`, which defines the type of dataset to use.

	1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.
	1. The top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below.

	To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.
	To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset.

	Kedro datasets make every intention to not make any assumptions and delegate any of the `load_args` / `save_args` directly to the underlying implementation.
	Kedro datasets delegate any of the `load_args` / `save_args` directly to the underlying implementation.

Improved documentation for configuring dataset parameters in the data catalog #3969

Improved documentation for configuring dataset parameters in the data catalog #3969

Conversation

ElenaKhaustova commented Jun 26, 2024 • edited Loading

Description

Developer Certificate of Origin

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankatiyar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

ElenaKhaustova commented Jun 26, 2024 •

edited

Loading