Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721

noklam · 2022-07-21T16:14:54Z

Dataset discovery

Running the variable in a notebook cell is the most common way to inspect a variable in Jupyter. Currently catalog give us an useless memory address, it would be much nicer if it prints out what's available, potentially just wrap it as catalog.list() for simplicity.

demo:

Autocompletion

# Current Way - Autocompletion not possible
catalog.load("example_iris_data").head()

# Slightly more verbose way - but much easier to type with Tab-completion
catalog.datasets.example_iris_data.load().head()

Under the hood, auto-completion works with checking the dir(object) method

Possible solutions:

Modify dir to shows datasets, so user can do catalog. + Tab, it will shows other method available in catalog though. We can probably makes datasets show up at the top of autocompletion too and others method at the bottom. I think it just has to be a list most likely. Of course we have to implement the __getattr__ so it actually looks at catalog.datasets too when we do catalog.dataset_name.

demo:

Implement a dict-like interface so when user do catalog[ + Tab, it will shows all the datasets.

Bonus

session and context also has useless __repr__, but they are not too useful in the interactive workflow so it has lower priority.

The text was updated successfully, but these errors were encountered:

antonymilne · 2022-08-02T15:02:40Z

LOVE this idea. I've added it to the interactive workflow milestone which I still need to populate a bit more since I've had related thoughts that I haven't had a chance to write down properly yet.

Just a few notes for now:

the correct mechanism to access dataset_name through catalog needs to be formalised anyway as per my comments in Improving the I/O transparency with kedro run #1691. Easy tab completion in the interactive workflow should definitely be a factor here. Note there's a complication that namespaced datasets have . in their name which might make access through getattr awkward. So we need to think quite carefully about what are the different options and the pros and cons here
one of the things we plan to add to the interactive workflow that would make this suggestion less important is the ability to open a notebook in the context of a node, e.g. %load_node node_name which would automatically populate Python variables containing all the input datasets pre-loaded. Tabbed completion of catalog would still be really cool, but not such high priority then
pipelines could probably also benefit from a better __repr__, but as you say catalog is definitely the most important
not important, but rich offers us __rich__ and __rich_repr__ which might be nice here to make things even snazzier. Ultimately I'd like all kedro objects (nodes, pipelines, datasets, etc.) to automatically have beautiful, clearly formatted str/repr, but that's some way off

noklam · 2022-08-02T15:41:39Z

the correct mechanism to access dataset_name through catalog needs to be formalised anyway as per my comments in #1691. Easy tab completion in the interactive workflow should definitely be a factor here. Note there's a complication that namespaced datasets have . in their name which might make access through getattr awkward. So we need to think quite carefully about what are the different options and the pros and cons here

I agree it becomes tricky once namespace is in the picture, in this case, I think a dict-like interface is more appropriate, similar to how we access parameters with nested .

one of the things we plan to add to the interactive workflow that would make this suggestion less important is the ability to open a notebook in the context of a node, e.g. %load_node node_name which would automatically populate Python variables containing all the input datasets pre-loaded. Tabbed completion of catalog would still be really cool, but not such high priority then

%load_node is an interesting idea. We have a similar discussion about this a few months ago, I think one of the challenges here is to make MemoryDataSet work, if it is written into catalog already then it would be straightforward. I used to rely on the notebook for debugging whenever I need to debug a large pipeline which I need to inspect the data. The reason I do that is because PyCharm Debug mode could get really slow when the data size is large, it should freeze even if you do print with a few lines of data. The alternative is to just run the pipeline up to the interested node, then I just copy and paste the code of the node and start debugging. The tricky part is to get the required data load properly.

It may be worth discuss what should session.run return, for interactive/debugging purpose, it may make sense to keep the intermediate memory dataset. (IRRC currently we return node's output they are not consumed by any other node.)

antonymilne · 2022-09-22T08:29:23Z

I just came across this page which is well worth reading to get some more ideas:

there's ways to customise tab completion beyond just dir
there's special Jupyter repr methods that can render things more nicely than the default __repr__

datajoely · 2022-10-20T13:51:37Z

you can also use __rich_repr__ now...

noklam · 2023-08-31T16:22:47Z

I am working on #2676 and I try to debug and it's hard. I want to re-purpose this issue so it's not Jupyter focus. If we want to make users use this as a component it need to have nicer public API and str. Right now it's hard to find out what is available without going through the source code.

In catalog.yml I have X_train@pyspark but when I try to print it I get this:

X_train__pyspark metadata: 
  None

datajoely · 2023-08-31T16:34:56Z

Yeah the internal representation of namespaced datasets with a double underscore is really annoying - I hit this when doing an IDE prototype (#2821). It would be great if the catalog had a presentation-layer representation available in the public API

noklam · 2023-08-31T16:50:42Z

The clean up will go into this undefined milestone Redesign Catalog and Datasets. But before this happen we can still make the public interface nicer, maybe then deprecate or unify the rest later. Good shout about namespace I haven't try that.

noklam · 2023-09-04T12:41:53Z

AbstractDataSet and AbstractVersionedDataSet should be printed meaningfully #3006

noklam · 2023-10-23T14:16:17Z

One more, when worked with CST they often have a large catalog. A few things that I think is useful

catalog.list(<regex>) - not many people is aware that this is possible to help discover the name of the datasets.
catalog.datasets. (Tab) can use autocompletion (catalog.load doesn't)
- More often I just do datasets = catalog.datasets, then I do datasets.my_dataset.load() which is much faster to type (Large catalog usually mean nested namespace and long entry). There is a separate problem of . get map to __ which isn't consistent and we need better design.

I was wondering could we make this better? Can we have catalog["xxx"] so it works like a dictionary and can use autocomplete easily.
Alternatively, if we prefer less API change, is it possible to make catalog.load("xxx") auto-completable?

Reference:

https://ipython.readthedocs.io/en/stable/config/integrating.html#integrating-your-objects-with-ipython

noklam · 2023-10-23T16:11:28Z

I did some more research and I couldn't find any auto-completion feature for catalog.load(" (Hit Tab). The autocompletion only works for catalog. (__dir__) or catalog[" (_ipython_completion or inherit the dict interface).

noklam · 2023-10-23T16:31:55Z

found a nicer solution to use _repr_pretty instead of __repr__. IMO overriding __repr__ isn't a big problem, but it's not consistent of what it should do.

I think the full feature need to be designed properly, but some of the non-breaking stuff like printing would be already useful and we can change it afterward. I would like to do this sooner than later.

datajoely · 2023-10-23T16:49:05Z

You can also do a Rich specific repr https://rich.readthedocs.io/en/stable/pretty.html#typing

Galileo-Galilei · 2023-11-04T13:01:33Z

Hello,
I've been playing around recently more "interactively" with Kedro, especially to introduce data analysts to kedro in notebooks and these features spontaneously arise from the conversation. As @noklam, I think this is already worth introducing a better __repr__ for Datasets and DataCatalog. Can I start a PR for this?

My two cents on the implementation details:

Autocomplete would be great, but the thread shows that this feature is much more complex to implement (particularly because of namespaces), so we should ship the pretty __repr__ first.
I am more in favor of implementing __repr__ rather than _repr_pretty or __rich_repr__ (at least at first) because I feel this can lead to inconsistent experience in a debugger / a script / a notebook, so I'd prefer adding complexity incrementally
I think it is fine to iterate frequently and change the repr "often" because it is not a breaking change and people do not rely on it for production code but rather for interactive use.
I think the _describe() is a good cancdidate to use in the __repr__ method for datasets which seems to be introduced specifically for this purpose. The drawback is that I think it is not always implemented (especially for user defined dataset) so we should have at least a fallback strategy. Maybe it is worth implementing this method as abstract in the AbstractDataset class in the future?
The DataCatalog.__repr__ may be simpler than just calling {dataset.__repr__ for datasets in catalog.datasets} because the print can be overcrowded, especially if there are a lot of datasets. On the other hand, just calling catalog.list() seems not informative enough. What do we want for the first dratf?
The "credentials" specificity is natively handled by pydantic, but if we don't want to introduce pydantic as dependency, we should handle it manually. It is possible to pass a callable with attrs so we might give it a shot. Notice that in the _describe() method most people don't show credentials, so it should be fine if we go with proxying this method to print the dataset.

astrojuanlu · 2023-11-04T13:11:41Z

I agree with going ahead with __repr__ (especially in the face of #2928) and separating the autocomplete use case 👍🏽

Galileo-Galilei · 2023-11-04T17:50:17Z

I've just started, and here is a fun fact : a__str__ method already exists in AbstractDataset, originally introduced in May 2019 🤯 , and it leverages the _describe method to achieve something very similar to xhat is described above. The current state of printing in a notebook is described hereafter :

from kedro.extras.datasets.pandas import CSVDataSet

ds = CSVDataSet(
    filepath=r"temp.csv",
    load_args={"sep": ";"},
)
ds
repr(ds)
print(ds)
str(ds)

Printing method	Current result	Result with a `__repr__` implemented
`ds`	`<kedro.extras.datasets.pandas.csv_dataset.CSVDataSet at 0x1ab62e5cf70>`	result of `__repr__`
`repr(ds)`	`'<kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object at 0x000001AB62E88C70>'`	result of `__repr__`
`print(ds)`	`CSVDataSet(filepath=temp.csv, protocol={'sep': ;}, save_args={'index': False})`	same as current
`str(ds)`	`"CSVDataSet(filepath=temp.csv, protocol={'sep': ;}, save_args={'index': False})"`	same as current

This tutorial claims that:

The object representation is returned as a string by the Python method, repr(). This method is called when the object's repr() method is used. If feasible, the text returned should be a legitimate Python expression that may be used to recreate the object.

... which is what does our __str__ method. So it seems the current __str__ should be moved to __repr__ to make it more discoverable by users, and we can eventually discuss making a __str__ method later if necessary. WDYT?

astrojuanlu · 2023-11-05T09:05:03Z

I confirm from the private repository (before open sourcing) that the __str__ method is there since forever basically.

Maybe we can just make __repr__ and __str__ do the same for now? I don't see the need to change __str__, as useless as it might be...

astrojuanlu · 2024-04-15T08:16:07Z

Re-reading this:

# Current Way - Autocompletion not possible
catalog.load("example_iris_data").head()

# Slightly more verbose way - but much easier to type with Tab-completion
catalog.datasets.example_iris_data.load().head()

Would autocompletion of dataset names as strings be possible? See a similar thing for pandas DataFrame columns:

The problem with doing the dynamic properties is that some dataset names that are valid in YAML would become illegal in that way (same problem as with pandas columns) and also it would pollute the namespace of the DataCatalog (again, same problem)

astrojuanlu · 2024-07-21T21:54:42Z

What is missing from this issue?

merelcht · 2024-10-21T12:14:39Z

Closed because this was done in #3981

antonymilne added this to the Improve the Interactive Jupyter notebook workflow milestone Aug 2, 2022

merelcht removed this from Kedro Framework Aug 22, 2022

noklam mentioned this issue Aug 23, 2022

Should we change the output of session.run? #1802

Closed

1 task

noklam mentioned this issue Sep 6, 2022

Workflow of debugging Kedro pipeline in notebook #1832

Open

3 tasks

yetudada modified the milestones: Improve the Interactive Jupyter notebook workflow, Improving the debugging experience with Jupyter Notebook Jun 30, 2023

noklam modified the milestones: Improve the usability and debugging experience for Jupyter notebooks, Make it easier to use Kedro as a library Aug 31, 2023

noklam changed the title ~~Improve Jupyter workflow with autocompletion and better __str__ __repr__~~ Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed Aug 31, 2023

noklam mentioned this issue Sep 1, 2023

Clean up catalog.datasets and catalog._data_sets #2999

Closed

noklam modified the milestones: Make it easier to use Kedro as a library, Using Kedro with existing projects Sep 1, 2023

astrojuanlu added this to Kedro Framework Oct 23, 2023

Galileo-Galilei mentioned this issue Nov 11, 2023

Add meaningful representation when printing a DataCatalog #3299

Closed

14 tasks

This was referenced Jun 3, 2024

[DataCatalog]: Pretty printing #3913

Closed

[DataCatalog]: Autocompletion support for accessing datasets #3914

Closed

This was referenced Jul 2, 2024

Pretty printing: AbstractDataset.__repr__ #3980

Closed

Pretty printing: DataCatalog.__repr__ #3981

Closed

astrojuanlu mentioned this issue Sep 3, 2024

Design DataCatalog2.0 #3995

Open

3 tasks

merelcht closed this as completed Oct 21, 2024

github-project-automation bot moved this to Done in Kedro Framework Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721

Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721

noklam commented Jul 21, 2022

antonymilne commented Aug 2, 2022

noklam commented Aug 2, 2022

antonymilne commented Sep 22, 2022

datajoely commented Oct 20, 2022

noklam commented Aug 31, 2023 •

edited

Loading

datajoely commented Aug 31, 2023

noklam commented Aug 31, 2023 •

edited

Loading

noklam commented Sep 4, 2023 •

edited

Loading

noklam commented Oct 23, 2023 •

edited

Loading

noklam commented Oct 23, 2023

noklam commented Oct 23, 2023

datajoely commented Oct 23, 2023

Galileo-Galilei commented Nov 4, 2023 •

edited

Loading

astrojuanlu commented Nov 4, 2023

Galileo-Galilei commented Nov 4, 2023 •

edited

Loading

astrojuanlu commented Nov 5, 2023

astrojuanlu commented Apr 15, 2024

astrojuanlu commented Jul 21, 2024

merelcht commented Oct 21, 2024

Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721

Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721

Comments

noklam commented Jul 21, 2022

Dataset discovery

Autocompletion

Bonus

antonymilne commented Aug 2, 2022

noklam commented Aug 2, 2022

antonymilne commented Sep 22, 2022

datajoely commented Oct 20, 2022

noklam commented Aug 31, 2023 • edited Loading

datajoely commented Aug 31, 2023

noklam commented Aug 31, 2023 • edited Loading

noklam commented Sep 4, 2023 • edited Loading

noklam commented Oct 23, 2023 • edited Loading

noklam commented Oct 23, 2023

noklam commented Oct 23, 2023

datajoely commented Oct 23, 2023

Galileo-Galilei commented Nov 4, 2023 • edited Loading

astrojuanlu commented Nov 4, 2023

Galileo-Galilei commented Nov 4, 2023 • edited Loading

astrojuanlu commented Nov 5, 2023

astrojuanlu commented Apr 15, 2024

astrojuanlu commented Jul 21, 2024

merelcht commented Oct 21, 2024

noklam commented Aug 31, 2023 •

edited

Loading

noklam commented Aug 31, 2023 •

edited

Loading

noklam commented Sep 4, 2023 •

edited

Loading

noklam commented Oct 23, 2023 •

edited

Loading

Galileo-Galilei commented Nov 4, 2023 •

edited

Loading

Galileo-Galilei commented Nov 4, 2023 •

edited

Loading