Add meaningful representation when printing a DataCatalog #3299

Galileo-Galilei · 2023-11-11T22:04:16Z

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

This PR focuses on solving (very partially) #1721. I focus only on making catalog printing more meaningful, instead of the current <kedro.io.data_catalog.DataCatalog at 0x5x231>.

The "best" representation is still not clear, and this PR aims sharing publicly trials and errors to make it more meaningful. Some requirements I'd like to meet for the "best representation" of the printed objects:

it should leverage existing underlying code as much as possible
it should be "copy-pasteable" to generate the object itself
it should be easy to read:
- for datasets, I think that "easy to read" means "look like a black formatted string"
- for the catalog, I think the information we are looking for is mostly the datasets names. I personally like having one dataset per line, even if it gives very long lines and breaks black formatting, because the details fo the datasets is mostly a detail. I am afraid than formatting the entire catalog with black can lead to hundreds of lines of printing. Such a cluttered output would be against what we are trying to achieve.

In scope : Making print(catalog) informative

Out of scope :

Autocompletion
it is likely worth leveraging the AbstractDataset.__repr__ method to generate the DataCatalog.__repr__. However, fixing potential issues in each dataset _describe is out of scope.

Development notes

AbstractDataset.__repr__

I removed the __str__ method and fully replaced it by __repr__. When __str__ is not user defined, __repr__ is used instead which is exactly what we want: consistency between the 2 methods.
I renamed the _to_str as _prettify_dict_to_str (which looks slightly more informative, but not very nice either TBH)
I also changed the behaviour of this _to_str method : it used to customize the string representation of dict. We now rely on the default __str__ method of dict. The main change is that we keep quotes around strings, which is necessary to ensure the output can be copy pasted and remain valid python code.
I created a separated _build_str_representation which builds the string on one line for ease of integration in the catalog. The __repr__ function only calls this methods and format it with black, which often renders it no several lines.

# ⚙️ Internal representation: output of _build_str_representation
CSVDataSet(filepath="temp.csv", protocol={"sep": ",", "decimal": ".", "header": True}, save_args={"index": False, "sep": ";", "decimal": ",", "header": False})

# ✅ output of __repr__
CSVDataSet(
    filepath="temp.csv",
    protocol={"sep": ",", "decimal": ".", "header": True},
    save_args={"index": False, "sep": ";", "decimal": ",", "header": False},
)

DataCatalog.__repr__

Implement a __repr__ method which prints one line per dataset beginning by the dataset name

# ✅Approach 1 (suggested)
DataCatalog(
    data_sets={
        'ds': CSVDataSet(filepath='temp.csv', protocol={'sep': ';'}, save_args={'index': False}),
        'ds2': CSVDataSet(filepath='temp.csv', protocol={'sep': ',', 'decimal': '.', 'header': True}, save_args={'index': False, 'sep': ';', 'decimal': ',', 'header': False}),
        'ds3': JSONDataSet(filepath='temp.csv')
})

For comparison, here is how it would have look like if we use black to render the string:

# ❌Approach 2 : black-like representation

DataCatalog(
    data_sets={
        "ds": CSVDataSet(
            filepath="temp.csv", protocol={"sep": ";"}, save_args={"index": False}
        ),
        "ds2": CSVDataSet(
            filepath="temp.csv",
            protocol={"sep": ",", "decimal": ".", "header": True},
            save_args={"index": False, "sep": ";", "decimal": ",", "header": False},
        ),
        "ds3": JSONDataSet(filepath="temp.csv"),
    }
)

⚠️ Points to discuss :

This introduces black as a dependency again while we've jsut moved to ruff. Do we want this?
Each dataset representation is very long and may be hard to read. Do people prefer approach 1 or 2?
"as is", the representation is incorrect and cannot be copy-pasted to generate a new catalog because the _describe() method of many dataset is incorrect (but this is out of scope), see the protocol argument above in CSVDataset. Is it ok to release it with incorrect representation?

📝 TODO:

make it more robust to missing _describe
make it more robust to potentially invalid code which may crashes black
add black as a dependency
write tests

Developer

Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Yolan Honoré-Rougé <[email protected]>

Galileo-Galilei · 2023-11-12T21:34:42Z

@astrojuanlu Any thoughts of the points to discuss above ? I find it quite satisfying right now but I want to double check before creating tests and make the CI pass.

datajoely · 2023-11-13T09:57:55Z

Nice nice work! I think I like Black representation because it will mostly be viewed in a terminal with limited width

datajoely · 2023-11-13T09:58:18Z

Are we doing anything to mask credentials?

astrojuanlu · 2023-11-13T11:56:14Z

Thanks @Galileo-Galilei! My main points are

I see that this is basically a "proper" repr for the DataCatalog in the sense that copy-pasting its output on the interpreter would rebuild the same object. that's fine by me for now, as long as
we are properly masking credentials or not showing them at all, and
we are not introducing a runtime dependency with black, which I think would be problematic

noklam · 2023-11-14T05:37:36Z

thank you for taking a stab on this already! I will also be cautious to introduce black for just formatting the repr, on the other hand, does rich handle this already or does it offer some kind of prettify method?

noklam · 2024-03-08T00:06:59Z

@Galileo-Galilei Are you still interested to finish this PR?

Galileo-Galilei · 2024-03-08T06:38:19Z

Nope, sorry, I close it. Very little time at the moment, and I did not come up with something totally satisfying for now so it still requires a bit of thinking.

Galileo-Galilei added 2 commits November 6, 2023 23:09

first attempt for a pretty __repr__

1444750

Signed-off-by: Yolan Honoré-Rougé <[email protected]>

Update prettifying function

8c71cfb

Galileo-Galilei requested a review from merelcht as a code owner November 11, 2023 22:04

Galileo-Galilei marked this pull request as draft November 11, 2023 22:04

Galileo-Galilei added 2 commits November 12, 2023 20:39

Replace abstractdataset __str__ by __repr__ and use black for formatting

9719edb

Put all datasets repr in catalog on one line

fb746ac

Galileo-Galilei changed the title ~~[SPIKE - DO NOT MERGE] Add meaningful representation when printing a DataCatalog~~ Add meaningful representation when printing a DataCatalog Nov 12, 2023

Galileo-Galilei closed this Mar 8, 2024

merelcht deleted the dataset_repr branch October 31, 2024 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add meaningful representation when printing a DataCatalog #3299

Add meaningful representation when printing a DataCatalog #3299

Galileo-Galilei commented Nov 11, 2023 •

edited

Loading

Galileo-Galilei commented Nov 12, 2023 •

edited

Loading

datajoely commented Nov 13, 2023

datajoely commented Nov 13, 2023

astrojuanlu commented Nov 13, 2023

noklam commented Nov 14, 2023 •

edited

Loading

noklam commented Mar 8, 2024

Galileo-Galilei commented Mar 8, 2024

Add meaningful representation when printing a DataCatalog #3299

Add meaningful representation when printing a DataCatalog #3299

Conversation

Galileo-Galilei commented Nov 11, 2023 • edited Loading

Description

Development notes

Developer

Checklist

Galileo-Galilei commented Nov 12, 2023 • edited Loading

datajoely commented Nov 13, 2023

datajoely commented Nov 13, 2023

astrojuanlu commented Nov 13, 2023

noklam commented Nov 14, 2023 • edited Loading

noklam commented Mar 8, 2024

Galileo-Galilei commented Mar 8, 2024

Galileo-Galilei commented Nov 11, 2023 •

edited

Loading

Galileo-Galilei commented Nov 12, 2023 •

edited

Loading

noklam commented Nov 14, 2023 •

edited

Loading