Render concrete parameter/return types for dataset `save`/`load` #2199

deepyaman · 2023-01-12T04:47:29Z

Description

The parameter type for a dataset's save is _DI, and the return type for load is _DO. This doesn't make any sense to a user (and barely makes sense to me, as a Kedro developer, because I saw the typing PR go in and have that context). I'd rather it not display any types, if getting the correct concrete types is too difficult.

The text was updated successfully, but these errors were encountered:

antonymilne · 2023-01-13T10:52:31Z

This would indeed be great to fix but I think only someone very brave should attempt it as I can see it being very frustrating and difficult to achieve... Do you have any idea how you could actually get Sphinx to do that without us having to manually override every single page of automatically generated docs for datasets? Possibly we can edit a higher-level rst template to do it?

merelcht · 2023-02-27T14:53:25Z

To do: Try updating to the newest version of sphinx and see if it has improved.

stichbury · 2023-07-17T10:03:43Z

We are using a later version of Sphinx and this is still not resolved.

It may be something that @astrojuanlu has familiarity with but if it's a non-trivial fix I suggest we put it as a very low priority. Could we potentially have an explanation of what is going on somewhere so if people want to know what it means, they can search and find the reasoning?

astrojuanlu · 2023-07-25T19:33:45Z

Upgrading sphinx or sphinx-autodoc-typehints proved to create lots of headaches in the past. I agree the current result is less than ideal but I don't think this is trivial to fix at all. It might be easier to actually redesign the datasets...

astrojuanlu · 2023-08-22T13:31:29Z

It doesn't look much better on latest with the spun-off kedro_datasets https://docs.kedro.org/en/stable/kedro_datasets.pandas.CSVDataSet.html

The root cause is that the heavy lifting is done by _load, which is a private method. Conversation about this is scattered around different issues (#1778, #1936, #2409 to name a few) but essentially I think the only sane way to fix this is to come up with a design that favours composition over inheritance, something that looks more similar to the proposal in #1936 (comment).

astrojuanlu · 2023-12-18T13:52:02Z

I think it's easier to just wait on a redesign of our catalog and datasets #1778.

deepyaman · 2024-05-08T12:05:56Z

Desired output

We can redefine load and save in each class to produce our desired result:

from pandas.util._decorators import doc  # We could roll our own, simpler version


class LambdaDataset(AbstractDataset):
    ...

    @doc(AbstractDataset.save.__doc__)
    def save(self, data: int) -> None:
        super().save(data)
 
    @doc(AbstractDataset.load.__doc__)
    def load(self) -> pd.DataFrame:
        return super().load()

Note that clicking through to the code on the docs will show that generated code, so maybe we should make it explicit that this is a generated function in the comments:

Possible implementations

Now, the question is—how do we accomplish this for all datasets? It is possible to either:

Actually add the load and save methods to each dataset implementation. There is probably a negligible performance hit from the added function call (since I/O is almost certainly the dominant part of any reasonable dataset). It's not the prettiest from a code perspective, though, where you'd rather rely on inheritance.
You can generate the additional methods just for docs generation.
Is there another way to modify classes parsed by Sphinx before docs generation?

Notes

I tried adding a .pyi file, but that didn't seem to have any effect. Can't figure out for certain whether sphinx_autodoc_typehints looks at them (I don't think so).
Doesn't interact correctly with typing.overload. tox-dev/sphinx-autodoc-typehints#296 fixed typing.overload support, but is only available from sphinx_autodoc_typehints==1.21.2. While we probably shouldn't hard-pin sphinx_autodoc_typehints in kedro-sphinx-theme, and I personally am not sure what regression was the issue/if it may have been fixed in a future version (sphinx_autodoc_typehints is on 2.x now), the specific issue resolved re typing.overload (duplicate entries) does not seem to be the problem here.
Upon further investigation, maybe typing.overload doesn't have any value here to begin with, as you must specify the implementation after the typing.overload. At that point, we may as well just provide the implementation only in our case.

astrojuanlu · 2024-05-08T16:53:39Z

We discussed this on tech design today (2024-05-08). While there weren't clear arguments against this solution, we discussed alternative approaches:

Given that this would mainly be the presentation layer, maybe it would be worth exploring a Sphinx extension instead.
- However, creating Sphinx extensions can be an interesting exercise.
- @deepyaman wasn't particularly motivated to pursue this approach but didn't close the door on somebody else doing it.
- This is probably something that wouldn't be useful to the broader community, given that this is a very specific situation caused by the design.
We discussed about the datasets design (xref Re-design io.core and io.data_catalog #1778)
- Of course we acknowledged that it will take us ~months to get there, so it would be nice to have an interim solution
- We started discussing whether we could get rid of the custom error handling in AbstractDataSet (see below) given that it's directly affecting this issue and it's making it difficult for users to debug their dataset errors Easier CustomDataset Creation #1936 (comment) but we didn't have enough time to discuss

kedro/kedro/io/core.py

Lines 190 to 202 in c8a0e22

    
           self._logger.debug("Loading %s", str(self)) 
        
           try: 
        
               return self._load() 
        
           except DatasetError: 
        
               raise 
        
           except Exception as exc: 
        
               # This exception handling is by design as the composed data sets 
        
               # can throw any type of exception. 
        
               message = ( 
        
                   f"Failed while loading data from data set {str(self)}.\n{str(exc)}" 
        
               ) 
        
               raise DatasetError(message) from exc

deepyaman · 2024-05-13T13:40:32Z

We discussed this on tech design today (2024-05-08). While there weren't clear arguments against this solution, we discussed alternative approaches:

Given that this would mainly be the presentation layer, maybe it would be worth exploring a Sphinx extension instead.

However, creating Sphinx extensions can be an interesting exercise.

@deepyaman wasn't particularly motivated to pursue this approach but didn't close the door on somebody else doing it.

This is probably something that wouldn't be useful to the broader community, given that this is a very specific situation caused by the design.

We discussed about the datasets design (xref Re-design io.core and io.data_catalog #1778)

Of course we acknowledged that it will take us ~months to get there, so it would be nice to have an interim solution

We started discussing whether we could get rid of the custom error handling in AbstractDataSet (see below) given that it's directly affecting this issue and it's making it difficult for users to debug their dataset errors Easier CustomDataset Creation #1936 (comment) but we didn't have enough time to discuss

kedro/kedro/io/core.py

Lines 190 to 202 in c8a0e22

self._logger.debug("Loading %s", str(self))

try:

return self._load()

except DatasetError:

raise

except Exception as exc:

# This exception handling is by design as the composed data sets

# can throw any type of exception.

message = (

f"Failed while loading data from data set {str(self)}.\n{str(exc)}"

)

raise DatasetError(message) from exc

Thanks @astrojuanlu! Just copying in my summary, but I think you covered it:

It sounds like overriding load/save may be workable, there’s a potential weekend project if somebody wants to write their own Sphinx plugin, but maybe the first followup is to see whether we can get rid of _load and _save altogether?

I have some ideas on the latter, and I will investigate this.

deepyaman added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Jan 12, 2023

merelcht added this to Kedro Framework Jan 26, 2023

merelcht added this to the Individual documentation enhancements to improve overall usability/quality milestone Feb 27, 2023

deepyaman mentioned this issue Jun 4, 2024

Expose load and save publicly for each dataset #3920

Merged

7 tasks

merelcht moved this to To Do in Kedro Framework Jun 10, 2024

merelcht assigned deepyaman Jun 10, 2024

deepyaman moved this from To Do to In Review in Kedro Framework Jun 24, 2024

deepyaman closed this as completed in #3920 Jul 29, 2024

github-project-automation bot moved this from In Review to Done in Kedro Framework Jul 29, 2024

astrojuanlu mentioned this issue Nov 4, 2024

Clearer underlying dataset issues #3971

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Render concrete parameter/return types for dataset `save`/`load` #2199

Render concrete parameter/return types for dataset `save`/`load` #2199

deepyaman commented Jan 12, 2023 •

edited

Loading

antonymilne commented Jan 13, 2023

merelcht commented Feb 27, 2023

stichbury commented Jul 17, 2023

astrojuanlu commented Jul 25, 2023

astrojuanlu commented Aug 22, 2023 •

edited

Loading

astrojuanlu commented Dec 18, 2023

deepyaman commented May 8, 2024 •

edited

Loading

astrojuanlu commented May 8, 2024

deepyaman commented May 13, 2024

Render concrete parameter/return types for dataset save/load #2199

Render concrete parameter/return types for dataset save/load #2199

Comments

deepyaman commented Jan 12, 2023 • edited Loading

Description

antonymilne commented Jan 13, 2023

merelcht commented Feb 27, 2023

stichbury commented Jul 17, 2023

astrojuanlu commented Jul 25, 2023

astrojuanlu commented Aug 22, 2023 • edited Loading

astrojuanlu commented Dec 18, 2023

deepyaman commented May 8, 2024 • edited Loading

Desired output

Possible implementations

Notes

astrojuanlu commented May 8, 2024

deepyaman commented May 13, 2024

Render concrete parameter/return types for dataset `save`/`load` #2199

Render concrete parameter/return types for dataset `save`/`load` #2199

deepyaman commented Jan 12, 2023 •

edited

Loading

astrojuanlu commented Aug 22, 2023 •

edited

Loading

deepyaman commented May 8, 2024 •

edited

Loading