Documentation improvement about create a custom dataset #3654

avonarret · 2024-02-26T13:14:34Z

Description

The documentation at Advanced: Tutorial to create a custom dataset describes how custom datasets can be created. However, the documentation still lacks some details on how to "register" your own dataset, so it can be imported and used by the catalog.load method.

Documentation page (if applicable)

https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html

Context

According to the current documentation I have created the following file structure:

src/projectname/datasets
├── __init__.py
└── custom_dataset.py

After configuring the catalog.yml as described in the docs and running the catalog.load method in a jupyter notebook, the custom dataset didn't get recognized. The following errors occured:

DatasetError: An exception occurred when parsing config for dataset 'catalogname': Class 'projectname.datasets.CustomDataset' not found, is this a typo?

When using this structure, according to @astrojuanlu there is a pip install . required Slack Conversation.

Possible steps to consider for the docs:

create the custom Dataset according to the recent docs in src/projectname/datasets/datasetname.py with the __init__.py alongside.
Exclude a possible pitfall by explicitely mentioning the (in my case required) lines in __init__.py, this file probably needs

from .custom_dataset.py import CustomDataset
__all__ = ["SFTPCapableCSVDataset"]

cd projectname
pip install .
In conf/base/catalog.yml the type should be projectname.datasets.CustomDataset
Whereas projectname and CustomDataset have to be exchanged with the according names of the respective project.

With these steps I was able to sucessfully call catalog.load within the jupyter notebook.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-02-26T13:40:54Z

Thanks for opening this issue @avonarret ! Adding it to our backlog

noklam · 2024-03-05T13:24:52Z

@avonarret Could you create a minimal project that we can reproduce? I have done this many times and it doesn't require pip install ., so I am surprised if this does not work now. I test it quickly with the current main branch and work as expected.

If the modules are importable, datasets is just one of the module so there is nothing special about it. i.e.
'projectname.datasets.CustomDataset' is this an importable object if you run it from kedro ipython?

laurensversluis · 2024-11-21T16:47:15Z

I am running in the same issue when accessing the catalog while debugging the standard test_project_path function.

astrojuanlu · 2024-11-25T06:51:49Z

@laurensversluis could you give a bit more detail on your setup?

Going back to the original @avonarret description, I am in line with @avonarret and I'm not sure one needs pip install . to make custom datasets work. In fact, Kedro adds the appropriate directories to sys.path on startup:

kedro/kedro/framework/startup.py

Lines 137 to 141 in 63d7516

    
           def _add_src_to_path(source_dir: Path, project_path: Path) -> None: 
        
               _validate_source_path(source_dir, project_path) 
        
               if str(source_dir) not in sys.path: 
        
                   sys.path.insert(0, str(source_dir))

astrojuanlu · 2024-11-25T06:53:25Z

Also, eventually we should move towards higher level tools like uv that take care of doing the pip install -e . step for you.

avonarret · 2024-11-29T13:42:16Z

@avonarret Could you create a minimal project that we can reproduce? I have done this many times and it doesn't require pip install ., so I am surprised if this does not work now. I test it quickly with the current main branch and work as expected.

If the modules are importable, datasets is just one of the module so there is nothing special about it. i.e. 'projectname.datasets.CustomDataset' is this an importable object if you run it from kedro ipython?

@noklam Sorry for the late reply. In the course of our internal developments, we have realized that we currently do not need any preprocessing for our needs. We realized that the use of Kedro would be a bit overkill and that we are already well served with Airflow DAGs for our “simpler” tasks. At least in the current project - who knows what's to come.

I have now tried again to reproduce the problem I originally described. I followed the recent (most probably unchanged) documentation at https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html#project-setup again in a new setup environment:

python3 -m venv .venv
source .venv/bin/activate
pip install kedro
kedro new 
> my-project
> all
> yes
cd my-project/
pip install -r requirements.txt
kedro run

The created starter examples ran cleanly without any problems.

Then I created the src/my_project/datasets folder, as described in the documentation incl. empty __init__.py, as well as image_dataset.py (https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html#the-complete-example) and placed the according configuration (https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html#integration-with-partitioneddataset) in conf/base/catalog.yml.

Result: I was able to successfully load the catalog with the custom dataset with kedro ipython in the CLI, as well as with a Jupyter notebook and %load_ext kedro.ipython, without the error occurring again. So I can't reproduce it anymore and there was no need for pip install . or explicitely including the custom dataset class in __init__.py as of today, at least to get the pokemon example custom dataset up and running. Sorry for the inconvenience.

Also, eventually we should move towards higher level tools like uv that take care of doing the pip install -e . step for you.

@astrojuanlu Since I didn't have to run a separate pip install when retesting, this probably won't be necessary for now? But basically I support the approach, should a similar need arise in other situations.

astrojuanlu added this to Kedro Framework Feb 26, 2024

astrojuanlu added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Feb 26, 2024

github-actions bot mentioned this issue Mar 1, 2024

Monthly issue metrics report #3671

Open

merelcht added this to the Improve Kedro documentation used by advanced users milestone Sep 23, 2024

astrojuanlu added support: needs more info and removed Component: Documentation 📄 Issue/PR for markdown and API documentation labels Nov 25, 2024

astrojuanlu removed this from Kedro Framework Nov 25, 2024

github-actions bot removed the support: needs more info label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation improvement about create a custom dataset #3654

Documentation improvement about create a custom dataset #3654

avonarret commented Feb 26, 2024 •

edited by astrojuanlu

Loading

astrojuanlu commented Feb 26, 2024

noklam commented Mar 5, 2024

laurensversluis commented Nov 21, 2024

astrojuanlu commented Nov 25, 2024

astrojuanlu commented Nov 25, 2024

avonarret commented Nov 29, 2024

Documentation improvement about create a custom dataset #3654

Documentation improvement about create a custom dataset #3654

Comments

avonarret commented Feb 26, 2024 • edited by astrojuanlu Loading

Description

Documentation page (if applicable)

Context

astrojuanlu commented Feb 26, 2024

noklam commented Mar 5, 2024

laurensversluis commented Nov 21, 2024

astrojuanlu commented Nov 25, 2024

astrojuanlu commented Nov 25, 2024

avonarret commented Nov 29, 2024

avonarret commented Feb 26, 2024 •

edited by astrojuanlu

Loading