Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose load and save publicly for each dataset #3920

Merged
merged 21 commits into from
Jul 29, 2024

Conversation

deepyaman
Copy link
Member

@deepyaman deepyaman commented Jun 4, 2024

Description

Resolves #2199

In discussing #2199, the question arose, why _load and _save need to be private methods. This PR exposes them publicly, choosing to "decorate" or wrap the load and save functionality for each dataset, rather than calling them as private methods from the public base implementation.

The changes are made so as to require minimal changes to existing datasets and be backwards-compatible with custom datasets users may have written.

Development notes

Update(6/27): It's no longer necessary to write load = _load, save = _save; this is handled in __init_subclass. That said, we can still roll out enforcement of users providing load and save as described below, as desired. It's actually much less pressing, but may help simplify the code a bit. :)

Previously on this PR...

I have updated the documentation on extending a dataset, but I have intentionally not explained that "core" datasets (in Kedro and Kedro-Datasets) should use the following pattern to work across older Kedro versions:

class MyDataset(...):
    def _load(...) -> ...:
        ...

    load = _load

    def _save(...) -> ...:
        ...

    save = _save

If we want to take things slow, in the next minor release of Kedro (0.20.0), we can start raising DeprecationWarnings for datasets that don't implement save. In 0.21.0, we can drop support for datasets that don't define save (and just have the older _save).

The reason I don't want to introduce this in the docs is that it will confuse new users who want to write local datasets, and it's a very small change to make before maintainers add datasets to the core.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@deepyaman deepyaman requested a review from merelcht as a code owner June 4, 2024 06:20
@deepyaman deepyaman marked this pull request as draft June 4, 2024 06:20
@deepyaman deepyaman changed the title Expose load and save publicly for each dataset [WIP] Expose load and save publicly for each dataset Jun 4, 2024
@datajoely
Copy link
Contributor

Love this

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Jun 5, 2024

This is a brilliant idea 😍 , I am jealous of not having it myself ;) I always have to explain to users who create a custom dataset that they should implement _load and _save instead of load and save, and some less advanced users struggle to get the point and this would likely help.

One small caveat is that I don't know if this would have unintended consequences on MlflowArtifactDataset which also wraps the methods (I think it shouldn't, but I'd rather check)

@deepyaman
Copy link
Member Author

One small caveat is that I don't know if this would have unintended consequences on MlflowArtifactDataset which also wraps the methods (I think it shouldn't, but I'd rather check)

Will be great if you can check that! I did guard against wrapping datasets that inherit load or save multiple times, but MlflowArtifactDataset implementation looks quite different, so I'll defer to you. :)

@deepyaman deepyaman changed the title [WIP] Expose load and save publicly for each dataset Expose load and save publicly for each dataset Jun 6, 2024
@deepyaman deepyaman marked this pull request as ready for review June 6, 2024 02:25
@deepyaman deepyaman enabled auto-merge (squash) June 17, 2024 16:40
@deepyaman deepyaman force-pushed the feat/render-concrete-types branch 2 times, most recently from ce48310 to c25bab3 Compare June 17, 2024 17:40
@astrojuanlu
Copy link
Member

I can't comment much on the implementation for now but I've been a long advocate of making load and save public, so big +1 from me!

We should be mindful how this affects the way people define custom datasets. Will they still need to implement _load and _save? If so, do we need to adjust our docs? And finally, is it too early to think of a world in which _load and _save are actually no longer necessary?

@deepyaman
Copy link
Member Author

deepyaman commented Jun 18, 2024

We should be mindful how this affects the way people define custom datasets. Will they still need to implement _load and _save? If so, do we need to adjust our docs? And finally, is it too early to think of a world in which _load and _save are actually no longer necessary?

I'd included some information in the development notes above, but let me copy them below:

I have updated the documentation on extending a dataset, but I have intentionally not explained that "core" datasets (in Kedro and Kedro-Datasets) should use the following pattern to work across older Kedro versions:

class MyDataset(...):
    def _load(...) -> ...:
        ...

    load = _load

    def _save(...) -> ...:
        ...

    save = _save

If we want to take things slow, in the next minor release of Kedro (0.20.0), we can start raising DeprecationWarnings for datasets that don't implement load or save. In 0.21.0, we can drop support for datasets that don't define load or save (and just have the older _load or _save).

The reason I don't want to introduce this in the docs is that it will confuse new users who want to write local datasets, and it's a very small change to make before maintainers add datasets to the core.

@rashidakanchwala
Copy link
Contributor

this change should not affect Kedro-Viz, as we don't use any private _load and _save methods anymore. For tests, we use the public methods, and for previews - we now use the public preview method.

@Galileo-Galilei
Copy link
Member

Hi, sorry there is something I don't get. I am trying to do this (inside spaceflights-pandas starter, this code is copy pasted from tour example in the docstring):

from pathlib import Path

import pandas as pd
from kedro.io import AbstractDataset


class MyOwnDataset(AbstractDataset):
    def __init__(self, filepath):
        self._filepath = Path(filepath)

    def load(self) -> pd.DataFrame:
        return pd.read_csv(self._filepath)

    def save(self, df: pd.DataFrame) -> None:
        df.to_csv(str(self._filepath))

    def _exists(self) -> bool:
        return Path(self._filepath.as_posix()).exists()

    def _describe(self):
        return dict(param1=self._filepath)


ds = MyOwnDataset(
    filepath=(Path(__file__).parents[1] / "data/01_raw/companies.csv").as_posix(),
)

ds.load()

And I am getting the error :

TypeError: Can't instantiate abstract class MyOwnDataset with abstract methods _load, _save

I've double checked and I have the correct developement version installed. while debugging I am entering the __init_subclass method though 😕 so I don't really get if it's how it is intended to be used. I don't see anywhere in the code where you monkeypatch _load and ````_save in case they are not defined in the custom dataset, but maybe I am missing something.

def __init_subclass__(cls, **kwargs: Any) -> None:
super().__init_subclass__(**kwargs)

if hasattr(cls, "load") and not cls.load.__qualname__.startswith("Abstract"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test on __qualname__.startswith exist in MlflowArtifactDataset but I think there is no test on MlflowAbstractDataset which could conflict, so hopefully we are fine 🤞

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an issue, we can try to make it more robust (see #3920 (comment)); I think for now, if it works, it's a reasonable implementation until we see more cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good for now, let's go for it!

@deepyaman
Copy link
Member Author

deepyaman commented Jun 26, 2024

TypeError: Can't instantiate abstract class MyOwnDataset with abstract methods _load, _save

I've double checked and I have the correct developement version installed. while debugging I am entering the __init_subclass method though 😕 so I don't really get if it's how it is intended to be used. I don't see anywhere in the code where you monkeypatch _load and ````_save in case they are not defined in the custom dataset, but maybe I am missing something.

Ah, this is probably right... let me try to take a look and patch that. Can probably just throw in some dummy method, since it shouldn't actually be hit.

Edit: Actually, thinking about it more, should just make load and save abstract instead of _load and _save, to make the new interface clear. Even if just _load and _save is being implemented, you mentioned it's going into __init_subclass__, and load and save will get created there.

@deepyaman deepyaman force-pushed the feat/render-concrete-types branch 4 times, most recently from c425cdb to ba8255a Compare June 27, 2024 22:18
@merelcht merelcht requested review from noklam and removed request for rashidakanchwala, yetudada and ravi-kumar-pilla July 3, 2024 14:19
@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Jul 9, 2024

Still need to deep dive in the implementation, but unfortunately this breaks kedro-mlflow MlflowArtifactDataset 😕 because of this line:

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py#L65C17-L65C36

I still need to think about the best workaround and check if i can just call __super__.save but this generates double logging so I am not sure this would be retrocompatible.

EDIT 1 : just testing in a notebook seems to work fine. However running the full test suite give a Windows fatal exception: stack overflow and I can't even locate the exact test failing. Still investigating.

@Galileo-Galilei
Copy link
Member

I am very annoyed about this.

I understand what is going on and I can't find a way to make the change backward compatible with kedro-mlflow, unless if make a huge code duplication to define MlflowArtifactDataset with a condition on kedro version once this is realeased, and I really don't like it.

Context

If one has read the code of MLflowArtifactDataset, he may have been very surprised because it is very different of standard dataset. I'll try to explain the rationale below.

To create a dataset that logs automatically in mlflow on save, several design were envisioned:

  • naive: create a dataset for each underlying format and ask user to replace their original code with my imlmentation : MLflowArtifact<CSV/Parquet/Whatever>Dataset This is definitely a no go because maitenance burden would be increasingly high, I would never ensure full consistency with kedro-datasets and it would not support custom datasets.

  • hook based: something similar to this

    my_data:
        type: pandas.CSVDataset
        filepath: /path/to/data.csv
        metadata:
            kedro-mlflow:
                log_as_artifact: true

    I never really considered this for an obvious reason : the metadata key did not exist back then :). However, I don't like this
    design because it ties a lot the dataset to the catalog and the session: the logging will be triggered in a
    after_node_run hook. As a consequence:

    • it does not work interactively (e.g. in a notebook), or if you use the catalog as a standalone component
    • it makes testing much harder
    • it is less backward compatible and satble over time because it is is impacted by session object modification
  • method decorators: in the old kedro versions (<=0.15), I had the possibility to decorate dataset method with somethin like this (I don't remember the exact syntax and I can't find doc before 0.18, likely something we should fix):

    my_data:
        type: pandas.CSVDataset
        filepath: /path/to/data.csv
        decorator: 
            - save: kedro_mlflow.log_in_mlflow" a custom function

    This was not very flexible and replaced with hooks, but this is very useful in this specific situation.
    I decided to emulate this behaviour with a class:

    my_data:
        type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
        dataset:
            type: pandas.CSVDataset
            filepath: /path/to/data.csv

    The syntax is very simple and compatible with all kedro versions (until this PR ;-) ) because it only assumes the existence of _load and _save

How does it work

The key idea of MlflowArtifactDataset is to use the __new__ method to fake inheritance : we create on the fly a new class that inherits from the target dataset (CSVDataset in above example) and decorates its _save method.

Finally, I create an instance of this new class which will be the one accessible for the end user.

If I modify the _load and _save methods of this fake class to load and save, it works smoothly with above PR; however this is not backward compatible because old datasets implementation without the changes in this PR will break (I'll modify their original load and save, which is not what I want to do).

@deepyaman
Copy link
Member Author

If I modify the _load and _save methods of this fake class to load and save, it works smoothly with above PR; however this is not backward compatible because old datasets implementation without the changes in this PR will break (I'll modify their original load and save, which is not what I want to do).

If I understand your comments and the line of code linked above correctly, the issue is that:

  1. It's looking for the _load and _save method on the dataset.
  2. It's implementing it's own wrapper on _load and _save.

Ideally, your wrapper class should implement _load_wrapper and _save_wrapper (e.g. see how AbstractVersionedDataset has redefined it).

Would it be sufficient if, for now, we didn't remove the _load or _save method on existing datasets? In this PR, we alias existing _load and _save to load and save automatically, so any datasets that still define the "old" versions will work fine:

        if hasattr(cls, "_load") and not cls._load.__qualname__.startswith("Abstract"):
            cls.load = cls._load  # type: ignore[method-assign]

        if hasattr(cls, "_save") and not cls._save.__qualname__.startswith("Abstract"):
            cls.save = cls._save  # type: ignore[method-assign]

As a result, we can hold off on making these renames to a future Kedro version (0.20.0?), as well as maintain the existing guidance of defining _load and _save (as opposed to load and save) in the docs.

@deepyaman
Copy link
Member Author

I understand what is going on and I can't find a way to make the change backward compatible with kedro-mlflow, unless if make a huge code duplication to define MlflowArtifactDataset with a condition on kedro version once this is realeased, and I really don't like it.

@Galileo-Galilei With this change, you can detect whether the logic is wrapped (like in the current load) or not wrapped (as in the current _load) from the __loadwrapped__ or __savewrapped__ attribute. For example:

                cls.load
                if not getattr(cls.load, "__loadwrapped__", False)
                else cls.load.__wrapped__

can be used to get the underlying load logic. This can be used to help make things backwards-compatible, I think.

@deepyaman deepyaman mentioned this pull request Jul 22, 2024
5 tasks
@deepyaman
Copy link
Member Author

@Galileo-Galilei I went ahead and reverted the _load -> load and _save -> save renames for core datasets (the backwards-compatible "legacy" path is still exposing them publicly). I think this should prevent breaking Kedro-MLFlows use of the internal attributes for now. If that works, I think we should be able to get this in.

In the meantime, we can find a solution for Kedro-MLFlow to look at load.__wrapped__ (if defined), and start removing the legacy _load/_save methods.

@Galileo-Galilei
Copy link
Member

Hi @deepyaman, I finally got a chance to take a stab at it, sorry for the delay and thanks for your patience 🙏

✅ It's good to go ...

I've tested interactively and with the CLI / for old kedro versions and this branch / with and without underscored methods, and it seems to work in all cases 🥳 I need to make a release to support the new version though. Basically I just need to replace super._load() by super().load() if getattr(cls.load, "__loadwrapped__", False) else super()._load() in the wrapper class.

🚚 ... but we still need to agree about migration strategy

I'd be inclined to do the following:

  • Add a DeprecationWarning to _load and _save to suggest the new behaviour to users
  • update the documentation to remove _load and _save`` and example, but only keep a warning admonition somewhere (to avoid blowing users mind if they find a working legacy code with underscored method and no mention anywhere in the doc nor in the code 🤯) ?
  • definitely remove it in 0.20, and I'll break kedro-mlflow in the meantime so user have to pay the migration cost once for all

Bonus: code for interactive tests

from pathlib import Path

import pandas as pd
from kedro.io import AbstractDataset
from kedro_mlflow.io.artifacts import MlflowArtifactDataset

# test 1 : new way without save and load


class MyOwnDataset_without_underscore_methods(AbstractDataset):
    def __init__(self, filepath):
        self._filepath = Path(filepath)

    def load(self) -> pd.DataFrame:
        return pd.read_csv(self._filepath)

    def save(self, df: pd.DataFrame) -> None:
        df.to_csv(str(self._filepath))

    # _load = load
    # _save = save

    def _exists(self) -> bool:
        return Path(self._filepath.as_posix()).exists()

    def _describe(self):
        return dict(param1=self._filepath)


ds1 = MyOwnDataset_without_underscore_methods(
    filepath=(Path(__file__).parents[1] / "data/01_raw/companies.csv").as_posix(),
)

companies = ds1.load()


mlflow_ds1 = MlflowArtifactDataset(
    dataset=dict(
        type=MyOwnDataset_without_underscore_methods,
        filepath=(Path(__file__).parents[1] / "data/01_raw/truc.csv").as_posix(),
    ),
    artifact_path="truc",
)


mlflow_ds1.save(companies)


# test 2 : old way


class MyOwnDataset_with_underscore_methods(AbstractDataset):
    def __init__(self, filepath):
        self._filepath = Path(filepath)

    def _load(self) -> pd.DataFrame:
        return pd.read_csv(self._filepath)

    def _save(self, df: pd.DataFrame) -> None:
        df.to_csv(str(self._filepath))

    # _load = load
    # _save = save

    def _exists(self) -> bool:
        return Path(self._filepath.as_posix()).exists()

    def _describe(self):
        return dict(param1=self._filepath)


ds = MyOwnDataset_with_underscore_methods(
    filepath=(Path(__file__).parents[1] / "data/01_raw/companies.csv").as_posix(),
)

companies = ds.load()


mlflow_ds2 = MlflowArtifactDataset(
    dataset=dict(
        type=MyOwnDataset_with_underscore_methods,
        filepath=(Path(__file__).parents[1] / "data/01_raw/truc.csv").as_posix(),
    ),
    artifact_path="truc",
)


mlflow_ds2.save(companies)

@Galileo-Galilei Galileo-Galilei self-requested a review July 25, 2024 21:07
@deepyaman
Copy link
Member Author

@Galileo-Galilei Thanks for taking another look!

I need to make a release to support the new version though. Basically I just need to replace super._load() by super().load() if getattr(cls.load, "__loadwrapped__", False) else super()._load() in the wrapper class.

Do you need this even if nobody defines a public load? I thought _load should always be available for you after my most-recent commits, until people start defining load and not _load.

🚚 ... but we still need to agree about migration strategy

I'd be inclined to do the following:

  • Add a DeprecationWarning to _load and _save to suggest the new behaviour to users
  • update the documentation to remove _load and _save`` and example, but only keep a warning admonition somewhere (to avoid blowing users mind if they find a working legacy code with underscored method and no mention anywhere in the doc nor in the code 🤯) ?
  • definitely remove it in 0.20, and I'll break kedro-mlflow in the meantime so user have to pay the migration cost once for all

Since I've reverted the guidance to use _load and _save in the docs, I think we can figure out migration strategy before you break Kedro-MLflow; for now, anybody looking at existing datasets as an example, or reading the docs, should not get the idea to define public load and save themselves.

I think we need to wait to issue a deprecation warning until Kedro-Datasets is updated; don't want people getting deprecation warnings they can't do anything about.

I'm good with the goal of removing it in 0.20, though!

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Jul 27, 2024

Do you need this even if nobody defines a public load? I thought _load should always be available for you after my most-recent commits, until people start defining load and not _load

My first code example shows that people now can define a public load instead of _load and hasttr(ds1, "_load") returns False. I need this line of code incase people start defining only the public methods.

for now, anybody looking at existing datasets as an example, or reading the docs, should not get the idea to define public load and save themselves.

Once again, I don't understand : suggesting people to define load and save is the very goal of this PR, isn't it? We want people to start defining themselves the public method and not the private ones, all above discussion is about backward compatibilty, not about the fact that it is what we want users to lend towards.

I think we need to wait to issue a deprecation warning until Kedro-Datasets is updated; don't want people getting deprecation warnings they can't do anything about.

Very valid point, totally agree 👍

@deepyaman
Copy link
Member Author

Once again, I don't understand : suggesting people to define load and save is the very goal of this PR, isn't it? We want people to start defining themselves the public method and not the private ones, all above discussion is about backward compatibilty, not about the fact that it is what we want users to lend towards.

The reason I started looking into this was actually to resolve #2199, which is achieved even without explicitly defining load and save. This is achieved already by the "legacy" path handling in this PR.

Allowing people to define public load and save was a nice result of the solution I landed on. :) I 100% think we should move towards that; I just want to be cognizant of causing the least breakages possible. If this means giving some more time before breaking Kedro-MLflow, I was open to it. (I'm also happy if you want to rip off the bandaid and do the breaking change sooner, up to you!)

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Jul 29, 2024

I don't consider the current solution to be a breaking change from the kedro-mlflow point of view : when kedro adds a feature, I have to release a minor version to support it. Since we already enable people to define only load and save and this is the desirable behaviour, I think we should already emphasize in the documentation that it is "the kedro way" to define a custom dataset, just to ensure "a path to the least breakage" in 0.20 when we will force it. Documenting it immediately is also a good way to inform people, since we don't want to add the deprecation warning yet.

Completely happy to merge it as is 🚀, and postpone removing _load and _save from the abstract class on 0.20 so I'll break kedro-mlflow simultaneously. We should open an issue to keep track of it!

Copy link
Member

@Galileo-Galilei Galileo-Galilei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should already document that user should define only load and save, otherwise it's good to go 🚀

def __init_subclass__(cls, **kwargs: Any) -> None:
super().__init_subclass__(**kwargs)

if hasattr(cls, "load") and not cls.load.__qualname__.startswith("Abstract"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good for now, let's go for it!

@deepyaman deepyaman merged commit 52458c2 into main Jul 29, 2024
41 checks passed
@deepyaman deepyaman deleted the feat/render-concrete-types branch July 29, 2024 21:40
@deepyaman
Copy link
Member Author

I really think we should already document that user should define only load and save, otherwise it's good to go 🚀

Sorry, this was on auto-merge. 😅

I will raise a PR for this. Let me also see if I can't take a quick look at Kedro-MLflow later.

rashidakanchwala added a commit to kedro-org/kedro-viz that referenced this pull request Sep 19, 2024
This is to get Kedro-viz (mainly kedro viz --lite) to work with Kedro 18 and some Kedro 19 versions based on changes this PR (kedro-org/kedro#3920) introduced.

In July, Kedro made the _save and _load methods public. At that time, Kedro-viz did not rely on these methods. However, when we recently merged kedro viz --lite, we introduced an UnavailableDataset class, which is an AbstractDataset. This class now uses the public load and save methods. To maintain backward compatibility with older versions of the dataset, we followed a suggestion made by @deepyaman

class MyDataset(...):
    def _load(...) -> ...:
        ...

    load = _load

    def _save(...) -> ...:
        ...

    save = _save
Originally posted by @deepyaman in kedro-org/kedro#3920 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Render concrete parameter/return types for dataset save/load
6 participants