Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up catalog.datasets and catalog._data_sets #2999

Closed
noklam opened this issue Sep 1, 2023 · 3 comments
Closed

Clean up catalog.datasets and catalog._data_sets #2999

noklam opened this issue Sep 1, 2023 · 3 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@noklam
Copy link
Contributor

noklam commented Sep 1, 2023

Description

In #2998, an example is added to access metadata in data catalog. This is useful when user need to extend Kedro, i.e. kedro-viz or plugin developers. We provide an example as follow

class MetadataHook:
    @hook_impl
    def after_catalog_created(
        self,
        catalog: DataCatalog,
    ):
        for dataset_name, dataset in catalog.datasets.__dict__.items():
            print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

It used an internal __dict__ which is not ideal. On the other hand, we also want to improve the usability of the class in standalone or interactive mode

Context

There are 3 ways to access dataset.

  • catalog.datasets
  • catalog._data_sets
  • catalog._get_dataset(name)

They are used inconsistently in the codebase and all behavior slightly differently. catalog.datasets is the only "public" one which allow user to interact with the "read-only" FrozenDataset.

However, there is caveat to use catalog.datasets. For example, if you have transcoding dataset, it get converted to some internal non-readable string.

for dataset_name, dataset in catalog.datasets.__dict__.items():
    print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

With this hook, if you have a dataset name X_train@spark, the dataset_name will be X_train__spark which is unexpected when you try to access the dictionary. On the other hand, catalog.datasets is not subscriptable, so you cannot access it by index or name.

This is due to the addition of the catalog.dataset_name syntax, you cannot have a Python property with @ or # in it, and it should be a valid Python name (mostly designed for interactive mode). The conversion happens here:

def _sub_nonword_chars(data_set_name: str) -> str:
"""Replace non-word characters in data set names since Kedro 0.16.2.
Args:
data_set_name: The data set name registered in the data catalog.
Returns:
The name used in `DataCatalog.datasets`.
"""
return re.sub(WORDS_REGEX_PATTERN, "__", data_set_name)

All of these lead to a conclusion that we should try to simply the interface (both public and internal). DataCatalog and it attributes are non-readable, it make it really hard to work in interactive mode because you cannot see what's inside the attribute without looking at the source code.

The main challenge here is to make things consistent without breaking change (or if we need breaking change we should do it now before 0.19). We should try to keep catalog.datasets since this it is the public interface people relies on. Things that I am not sure about is that also tightly couple to kedro-viz? We should check when this is implemented.

Possible Implementation

We can improve the usablility with auto-completion and using dataclasses or make these object inherit from things like UserDict (We did similar thing for pipelines).

Possible Alternatives

@noklam noklam added the Issue: Feature Request New feature or improvement to existing feature label Sep 1, 2023
@noklam noklam changed the title Clean upcatalog.datasets and catalog._data_sets Clean up catalog.datasets and catalog._data_sets Sep 1, 2023
@merelcht merelcht added this to the Redesign Catalog and Datasets milestone Sep 4, 2023
@noklam
Copy link
Contributor Author

noklam commented Sep 4, 2023

Initial attempt to make _FrozenDataset a "UserDict" or dictionary equivalent interface work fines.

Instead of

for dataset_name, dataset in catalog.datasets.__dict__.items():
    print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

we can do

for dataset_name, dataset in catalog.datasets.items():
    print(f"{dataset_name} metadata: \n  {str(dataset.metadata)}")

We can do more design for this, but I think it's relatively flexible to change these because they are all internal class and won't cause any breaking issue.

This won't clean up datasets and _data_sets immediately, but would provide a cleaner user interface (interactive mode). User will still need _get_dataset for some other thing.

@merelcht
Copy link
Member

@ElenaKhaustova can this ticket be closed?

@merelcht
Copy link
Member

merelcht commented Nov 7, 2024

This has been addressed in the work on the new KedroDataCatalog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: Done
Development

No branches or pull requests

2 participants