Clean up `catalog.datasets` and `catalog._data_sets` #2999

noklam · 2023-09-01T12:28:47Z

Description

In #2998, an example is added to access metadata in data catalog. This is useful when user need to extend Kedro, i.e. kedro-viz or plugin developers. We provide an example as follow

class MetadataHook:
    @hook_impl
    def after_catalog_created(
        self,
        catalog: DataCatalog,
    ):
        for dataset_name, dataset in catalog.datasets.__dict__.items():
            print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

It used an internal __dict__ which is not ideal. On the other hand, we also want to improve the usability of the class in standalone or interactive mode

Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721

Context

There are 3 ways to access dataset.

catalog.datasets
catalog._data_sets
catalog._get_dataset(name)

They are used inconsistently in the codebase and all behavior slightly differently. catalog.datasets is the only "public" one which allow user to interact with the "read-only" FrozenDataset.

However, there is caveat to use catalog.datasets. For example, if you have transcoding dataset, it get converted to some internal non-readable string.

for dataset_name, dataset in catalog.datasets.__dict__.items():
    print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

With this hook, if you have a dataset name X_train@spark, the dataset_name will be X_train__spark which is unexpected when you try to access the dictionary. On the other hand, catalog.datasets is not subscriptable, so you cannot access it by index or name.

This is due to the addition of the catalog.dataset_name syntax, you cannot have a Python property with @ or # in it, and it should be a valid Python name (mostly designed for interactive mode). The conversion happens here:

kedro/kedro/io/data_catalog.py

Lines 89 to 99 in 0293dc1

    
           def _sub_nonword_chars(data_set_name: str) -> str: 
        
               """Replace non-word characters in data set names since Kedro 0.16.2. 
        
               Args: 
        
                   data_set_name: The data set name registered in the data catalog. 
        
               Returns: 
        
                   The name used in `DataCatalog.datasets`. 
        
               """ 
        
               return re.sub(WORDS_REGEX_PATTERN, "__", data_set_name)

All of these lead to a conclusion that we should try to simply the interface (both public and internal). DataCatalog and it attributes are non-readable, it make it really hard to work in interactive mode because you cannot see what's inside the attribute without looking at the source code.

The main challenge here is to make things consistent without breaking change (or if we need breaking change we should do it now before 0.19). We should try to keep catalog.datasets since this it is the public interface people relies on. Things that I am not sure about is that also tightly couple to kedro-viz? We should check when this is implemented.

Possible Implementation

We can improve the usablility with auto-completion and using dataclasses or make these object inherit from things like UserDict (We did similar thing for pipelines).

Possible Alternatives

The text was updated successfully, but these errors were encountered:

noklam · 2023-09-04T12:22:09Z

Initial attempt to make _FrozenDataset a "UserDict" or dictionary equivalent interface work fines.

Instead of

for dataset_name, dataset in catalog.datasets.__dict__.items():
    print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

we can do

for dataset_name, dataset in catalog.datasets.items():
    print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

We can do more design for this, but I think it's relatively flexible to change these because they are all internal class and won't cause any breaking issue.

This won't clean up datasets and _data_sets immediately, but would provide a cleaner user interface (interactive mode). User will still need _get_dataset for some other thing.

merelcht · 2024-10-22T10:46:11Z

@ElenaKhaustova can this ticket be closed?

merelcht · 2024-11-07T16:39:23Z

This has been addressed in the work on the new KedroDataCatalog.

noklam added the Issue: Feature Request New feature or improvement to existing feature label Sep 1, 2023

noklam changed the title ~~Clean upcatalog.datasets and catalog._data_sets~~ Clean up catalog.datasets and catalog._data_sets Sep 1, 2023

noklam added this to Kedro Framework Sep 1, 2023

noklam mentioned this issue Sep 1, 2023

Add hook example to access metadata #2998

Merged

6 tasks

merelcht added this to the Redesign Catalog and Datasets milestone Sep 4, 2023

noklam mentioned this issue Sep 25, 2023

Rename data_sets argument and _data_sets attribute of DataCatalog #3041

Closed

3 tasks

github-actions bot mentioned this issue Oct 1, 2023

Monthly issue metrics report #3100

Closed

merelcht closed this as completed Nov 7, 2024

github-project-automation bot moved this to Done in Kedro Framework Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up `catalog.datasets` and `catalog._data_sets` #2999

Clean up `catalog.datasets` and `catalog._data_sets` #2999

noklam commented Sep 1, 2023 •

edited

Loading

noklam commented Sep 4, 2023 •

edited

Loading

merelcht commented Oct 22, 2024

merelcht commented Nov 7, 2024

Clean up catalog.datasets and catalog._data_sets #2999

Clean up catalog.datasets and catalog._data_sets #2999

Comments

noklam commented Sep 1, 2023 • edited Loading

Description

Context

Possible Implementation

Possible Alternatives

noklam commented Sep 4, 2023 • edited Loading

merelcht commented Oct 22, 2024

merelcht commented Nov 7, 2024

Clean up `catalog.datasets` and `catalog._data_sets` #2999

Clean up `catalog.datasets` and `catalog._data_sets` #2999

noklam commented Sep 1, 2023 •

edited

Loading

noklam commented Sep 4, 2023 •

edited

Loading