Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add meaningful representation when printing a DataCatalog #3299

Closed
wants to merge 4 commits into from

Conversation

Galileo-Galilei
Copy link
Member

@Galileo-Galilei Galileo-Galilei commented Nov 11, 2023

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

This PR focuses on solving (very partially) #1721. I focus only on making catalog printing more meaningful, instead of the current <kedro.io.data_catalog.DataCatalog at 0x5x231>.

The "best" representation is still not clear, and this PR aims sharing publicly trials and errors to make it more meaningful. Some requirements I'd like to meet for the "best representation" of the printed objects:

  • it should leverage existing underlying code as much as possible
  • it should be "copy-pasteable" to generate the object itself
  • it should be easy to read:
    • for datasets, I think that "easy to read" means "look like a black formatted string"
    • for the catalog, I think the information we are looking for is mostly the datasets names. I personally like having one dataset per line, even if it gives very long lines and breaks black formatting, because the details fo the datasets is mostly a detail. I am afraid than formatting the entire catalog with black can lead to hundreds of lines of printing. Such a cluttered output would be against what we are trying to achieve.

In scope : Making print(catalog) informative

Out of scope :

  • Autocompletion
  • it is likely worth leveraging the AbstractDataset.__repr__ method to generate the DataCatalog.__repr__. However, fixing potential issues in each dataset _describe is out of scope.

Development notes

  1. AbstractDataset.__repr__
  • I removed the __str__ method and fully replaced it by __repr__. When __str__ is not user defined, __repr__ is used instead which is exactly what we want: consistency between the 2 methods.
  • I renamed the _to_str as _prettify_dict_to_str (which looks slightly more informative, but not very nice either TBH)
  • I also changed the behaviour of this _to_str method : it used to customize the string representation of dict. We now rely on the default __str__ method of dict. The main change is that we keep quotes around strings, which is necessary to ensure the output can be copy pasted and remain valid python code.
  • I created a separated _build_str_representation which builds the string on one line for ease of integration in the catalog. The __repr__ function only calls this methods and format it with black, which often renders it no several lines.
# ⚙️ Internal representation: output of _build_str_representation
CSVDataSet(filepath="temp.csv", protocol={"sep": ",", "decimal": ".", "header": True}, save_args={"index": False, "sep": ";", "decimal": ",", "header": False})

# ✅ output of __repr__
CSVDataSet(
    filepath="temp.csv",
    protocol={"sep": ",", "decimal": ".", "header": True},
    save_args={"index": False, "sep": ";", "decimal": ",", "header": False},
)
  1. DataCatalog.__repr__
  • Implement a __repr__ method which prints one line per dataset beginning by the dataset name
# ✅Approach 1 (suggested)
DataCatalog(
    data_sets={
        'ds': CSVDataSet(filepath='temp.csv', protocol={'sep': ';'}, save_args={'index': False}),
        'ds2': CSVDataSet(filepath='temp.csv', protocol={'sep': ',', 'decimal': '.', 'header': True}, save_args={'index': False, 'sep': ';', 'decimal': ',', 'header': False}),
        'ds3': JSONDataSet(filepath='temp.csv')
})

For comparison, here is how it would have look like if we use black to render the string:

# ❌Approach 2 : black-like representation

DataCatalog(
    data_sets={
        "ds": CSVDataSet(
            filepath="temp.csv", protocol={"sep": ";"}, save_args={"index": False}
        ),
        "ds2": CSVDataSet(
            filepath="temp.csv",
            protocol={"sep": ",", "decimal": ".", "header": True},
            save_args={"index": False, "sep": ";", "decimal": ",", "header": False},
        ),
        "ds3": JSONDataSet(filepath="temp.csv"),
    }
)

⚠️ Points to discuss :

  • This introduces black as a dependency again while we've jsut moved to ruff. Do we want this?
  • Each dataset representation is very long and may be hard to read. Do people prefer approach 1 or 2?
  • "as is", the representation is incorrect and cannot be copy-pasted to generate a new catalog because the _describe() method of many dataset is incorrect (but this is out of scope), see the protocol argument above in CSVDataset. Is it ok to release it with incorrect representation?

📝 TODO:

  • make it more robust to missing _describe
  • make it more robust to potentially invalid code which may crashes black
  • add black as a dependency
  • write tests

Developer

Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@Galileo-Galilei Galileo-Galilei marked this pull request as draft November 11, 2023 22:04
@Galileo-Galilei
Copy link
Member Author

Galileo-Galilei commented Nov 12, 2023

@astrojuanlu Any thoughts of the points to discuss above ? I find it quite satisfying right now but I want to double check before creating tests and make the CI pass.

@Galileo-Galilei Galileo-Galilei changed the title [SPIKE - DO NOT MERGE] Add meaningful representation when printing a DataCatalog Add meaningful representation when printing a DataCatalog Nov 12, 2023
@datajoely
Copy link
Contributor

Nice nice work! I think I like Black representation because it will mostly be viewed in a terminal with limited width

@datajoely
Copy link
Contributor

Are we doing anything to mask credentials?

@astrojuanlu
Copy link
Member

Thanks @Galileo-Galilei! My main points are

  1. I see that this is basically a "proper" repr for the DataCatalog in the sense that copy-pasting its output on the interpreter would rebuild the same object. that's fine by me for now, as long as
  2. we are properly masking credentials or not showing them at all, and
  3. we are not introducing a runtime dependency with black, which I think would be problematic

@noklam
Copy link
Contributor

noklam commented Nov 14, 2023

thank you for taking a stab on this already! I will also be cautious to introduce black for just formatting the repr, on the other hand, does rich handle this already or does it offer some kind of prettify method?

@noklam
Copy link
Contributor

noklam commented Mar 8, 2024

@Galileo-Galilei Are you still interested to finish this PR?

@Galileo-Galilei
Copy link
Member Author

Nope, sorry, I close it. Very little time at the moment, and I did not come up with something totally satisfying for now so it still requires a bit of thinking.

@merelcht merelcht deleted the dataset_repr branch October 31, 2024 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants