Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render concrete parameter/return types for dataset save/load #2199

Closed
deepyaman opened this issue Jan 12, 2023 · 9 comments · Fixed by #3920
Closed

Render concrete parameter/return types for dataset save/load #2199

deepyaman opened this issue Jan 12, 2023 · 9 comments · Fixed by #3920
Assignees
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@deepyaman
Copy link
Member

deepyaman commented Jan 12, 2023

Description

The parameter type for a dataset's save is _DI, and the return type for load is _DO. This doesn't make any sense to a user (and barely makes sense to me, as a Kedro developer, because I saw the typing PR go in and have that context). I'd rather it not display any types, if getting the correct concrete types is too difficult.

image

@deepyaman deepyaman added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Jan 12, 2023
@antonymilne
Copy link
Contributor

This would indeed be great to fix but I think only someone very brave should attempt it as I can see it being very frustrating and difficult to achieve... Do you have any idea how you could actually get Sphinx to do that without us having to manually override every single page of automatically generated docs for datasets? Possibly we can edit a higher-level rst template to do it?

@merelcht
Copy link
Member

To do: Try updating to the newest version of sphinx and see if it has improved.

@stichbury
Copy link
Contributor

We are using a later version of Sphinx and this is still not resolved.

It may be something that @astrojuanlu has familiarity with but if it's a non-trivial fix I suggest we put it as a very low priority. Could we potentially have an explanation of what is going on somewhere so if people want to know what it means, they can search and find the reasoning?

@astrojuanlu
Copy link
Member

Upgrading sphinx or sphinx-autodoc-typehints proved to create lots of headaches in the past. I agree the current result is less than ideal but I don't think this is trivial to fix at all. It might be easier to actually redesign the datasets...

@astrojuanlu
Copy link
Member

astrojuanlu commented Aug 22, 2023

It doesn't look much better on latest with the spun-off kedro_datasets https://docs.kedro.org/en/stable/kedro_datasets.pandas.CSVDataSet.html

image

The root cause is that the heavy lifting is done by _load, which is a private method. Conversation about this is scattered around different issues (#1778, #1936, #2409 to name a few) but essentially I think the only sane way to fix this is to come up with a design that favours composition over inheritance, something that looks more similar to the proposal in #1936 (comment).

@astrojuanlu
Copy link
Member

I think it's easier to just wait on a redesign of our catalog and datasets #1778.

@deepyaman
Copy link
Member Author

deepyaman commented May 8, 2024

Desired output

We can redefine load and save in each class to produce our desired result:

from pandas.util._decorators import doc  # We could roll our own, simpler version


class LambdaDataset(AbstractDataset):
    ...

    @doc(AbstractDataset.save.__doc__)
    def save(self, data: int) -> None:
        super().save(data)
 
    @doc(AbstractDataset.load.__doc__)
    def load(self) -> pd.DataFrame:
        return super().load()
image

Note that clicking through to the code on the docs will show that generated code, so maybe we should make it explicit that this is a generated function in the comments:

image

Possible implementations

Now, the question is—how do we accomplish this for all datasets? It is possible to either:

  1. Actually add the load and save methods to each dataset implementation. There is probably a negligible performance hit from the added function call (since I/O is almost certainly the dominant part of any reasonable dataset). It's not the prettiest from a code perspective, though, where you'd rather rely on inheritance.
  2. You can generate the additional methods just for docs generation.
  3. Is there another way to modify classes parsed by Sphinx before docs generation?

Notes

  • I tried adding a .pyi file, but that didn't seem to have any effect. Can't figure out for certain whether sphinx_autodoc_typehints looks at them (I don't think so).
  • Doesn't interact correctly with typing.overload. tox-dev/sphinx-autodoc-typehints#296 fixed typing.overload support, but is only available from sphinx_autodoc_typehints==1.21.2. While we probably shouldn't hard-pin sphinx_autodoc_typehints in kedro-sphinx-theme, and I personally am not sure what regression was the issue/if it may have been fixed in a future version (sphinx_autodoc_typehints is on 2.x now), the specific issue resolved re typing.overload (duplicate entries) does not seem to be the problem here.
  • Upon further investigation, maybe typing.overload doesn't have any value here to begin with, as you must specify the implementation after the typing.overload. At that point, we may as well just provide the implementation only in our case.

@astrojuanlu
Copy link
Member

We discussed this on tech design today (2024-05-08). While there weren't clear arguments against this solution, we discussed alternative approaches:

  • Given that this would mainly be the presentation layer, maybe it would be worth exploring a Sphinx extension instead.
    • However, creating Sphinx extensions can be an interesting exercise.
    • @deepyaman wasn't particularly motivated to pursue this approach but didn't close the door on somebody else doing it.
    • This is probably something that wouldn't be useful to the broader community, given that this is a very specific situation caused by the design.
  • We discussed about the datasets design (xref Re-design io.core and io.data_catalog #1778)
    • Of course we acknowledged that it will take us ~months to get there, so it would be nice to have an interim solution
    • We started discussing whether we could get rid of the custom error handling in AbstractDataSet (see below) given that it's directly affecting this issue and it's making it difficult for users to debug their dataset errors Easier CustomDataset Creation #1936 (comment) but we didn't have enough time to discuss

kedro/kedro/io/core.py

Lines 190 to 202 in c8a0e22

self._logger.debug("Loading %s", str(self))
try:
return self._load()
except DatasetError:
raise
except Exception as exc:
# This exception handling is by design as the composed data sets
# can throw any type of exception.
message = (
f"Failed while loading data from data set {str(self)}.\n{str(exc)}"
)
raise DatasetError(message) from exc

@deepyaman
Copy link
Member Author

We discussed this on tech design today (2024-05-08). While there weren't clear arguments against this solution, we discussed alternative approaches:

  • Given that this would mainly be the presentation layer, maybe it would be worth exploring a Sphinx extension instead.

    • However, creating Sphinx extensions can be an interesting exercise.
    • @deepyaman wasn't particularly motivated to pursue this approach but didn't close the door on somebody else doing it.
    • This is probably something that wouldn't be useful to the broader community, given that this is a very specific situation caused by the design.
  • We discussed about the datasets design (xref Re-design io.core and io.data_catalog #1778)

    • Of course we acknowledged that it will take us ~months to get there, so it would be nice to have an interim solution
    • We started discussing whether we could get rid of the custom error handling in AbstractDataSet (see below) given that it's directly affecting this issue and it's making it difficult for users to debug their dataset errors Easier CustomDataset Creation #1936 (comment) but we didn't have enough time to discuss

kedro/kedro/io/core.py

Lines 190 to 202 in c8a0e22

self._logger.debug("Loading %s", str(self))
try:
return self._load()
except DatasetError:
raise
except Exception as exc:
# This exception handling is by design as the composed data sets
# can throw any type of exception.
message = (
f"Failed while loading data from data set {str(self)}.\n{str(exc)}"
)
raise DatasetError(message) from exc

Thanks @astrojuanlu! Just copying in my summary, but I think you covered it:

It sounds like overriding load/save may be workable, there’s a potential weekend project if somebody wants to write their own Sphinx plugin, but maybe the first followup is to see whether we can get rid of _load and _save altogether?

I have some ideas on the latter, and I will investigate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants