-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose load
and save
publicly for each dataset
#3920
Conversation
load
and save
publicly for each datasetload
and save
publicly for each dataset
Love this |
This is a brilliant idea 😍 , I am jealous of not having it myself ;) I always have to explain to users who create a custom dataset that they should implement One small caveat is that I don't know if this would have unintended consequences on |
Will be great if you can check that! I did guard against wrapping datasets that inherit |
cd7d953
to
9fddad0
Compare
load
and save
publicly for each datasetload
and save
publicly for each dataset
e27d481
to
49a428a
Compare
ce48310
to
c25bab3
Compare
I can't comment much on the implementation for now but I've been a long advocate of making We should be mindful how this affects the way people define custom datasets. Will they still need to implement |
I'd included some information in the development notes above, but let me copy them below: I have updated the documentation on extending a dataset, but I have intentionally not explained that "core" datasets (in Kedro and Kedro-Datasets) should use the following pattern to work across older Kedro versions: class MyDataset(...):
def _load(...) -> ...:
...
load = _load
def _save(...) -> ...:
...
save = _save If we want to take things slow, in the next minor release of Kedro ( The reason I don't want to introduce this in the docs is that it will confuse new users who want to write local datasets, and it's a very small change to make before maintainers add datasets to the core. |
this change should not affect Kedro-Viz, as we don't use any private |
Hi, sorry there is something I don't get. I am trying to do this (inside from pathlib import Path
import pandas as pd
from kedro.io import AbstractDataset
class MyOwnDataset(AbstractDataset):
def __init__(self, filepath):
self._filepath = Path(filepath)
def load(self) -> pd.DataFrame:
return pd.read_csv(self._filepath)
def save(self, df: pd.DataFrame) -> None:
df.to_csv(str(self._filepath))
def _exists(self) -> bool:
return Path(self._filepath.as_posix()).exists()
def _describe(self):
return dict(param1=self._filepath)
ds = MyOwnDataset(
filepath=(Path(__file__).parents[1] / "data/01_raw/companies.csv").as_posix(),
)
ds.load() And I am getting the error : TypeError: Can't instantiate abstract class MyOwnDataset with abstract methods _load, _save I've double checked and I have the correct developement version installed. while debugging I am entering the |
def __init_subclass__(cls, **kwargs: Any) -> None: | ||
super().__init_subclass__(**kwargs) | ||
|
||
if hasattr(cls, "load") and not cls.load.__qualname__.startswith("Abstract"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test on __qualname__.startswith
exist in MlflowArtifactDataset but I think there is no test on MlflowAbstractDataset
which could conflict, so hopefully we are fine 🤞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an issue, we can try to make it more robust (see #3920 (comment)); I think for now, if it works, it's a reasonable implementation until we see more cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good for now, let's go for it!
Edit: Actually, thinking about it more, should just make |
c425cdb
to
ba8255a
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
Still need to deep dive in the implementation, but unfortunately this breaks kedro-mlflow I still need to think about the best workaround and check if i can just call EDIT 1 : just testing in a notebook seems to work fine. However running the full test suite give a |
I am very annoyed about this. I understand what is going on and I can't find a way to make the change backward compatible with kedro-mlflow, unless if make a huge code duplication to define MlflowArtifactDataset with a condition on kedro version once this is realeased, and I really don't like it. ContextIf one has read the code of To create a dataset that logs automatically in mlflow on
How does it workThe key idea of Finally, I create an instance of this new class which will be the one accessible for the end user. If I modify the |
If I understand your comments and the line of code linked above correctly, the issue is that:
Ideally, your wrapper class should implement Would it be sufficient if, for now, we didn't remove the if hasattr(cls, "_load") and not cls._load.__qualname__.startswith("Abstract"):
cls.load = cls._load # type: ignore[method-assign]
if hasattr(cls, "_save") and not cls._save.__qualname__.startswith("Abstract"):
cls.save = cls._save # type: ignore[method-assign] As a result, we can hold off on making these renames to a future Kedro version (0.20.0?), as well as maintain the existing guidance of defining |
@Galileo-Galilei With this change, you can detect whether the logic is wrapped (like in the current cls.load
if not getattr(cls.load, "__loadwrapped__", False)
else cls.load.__wrapped__ can be used to get the underlying load logic. This can be used to help make things backwards-compatible, I think. |
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
@Galileo-Galilei I went ahead and reverted the In the meantime, we can find a solution for Kedro-MLFlow to look at |
Hi @deepyaman, I finally got a chance to take a stab at it, sorry for the delay and thanks for your patience 🙏 ✅ It's good to go ...I've tested interactively and with the CLI / for old kedro versions and this branch / with and without underscored methods, and it seems to work in all cases 🥳 I need to make a release to support the new version though. Basically I just need to replace 🚚 ... but we still need to agree about migration strategyI'd be inclined to do the following:
Bonus: code for interactive testsfrom pathlib import Path
import pandas as pd
from kedro.io import AbstractDataset
from kedro_mlflow.io.artifacts import MlflowArtifactDataset
# test 1 : new way without save and load
class MyOwnDataset_without_underscore_methods(AbstractDataset):
def __init__(self, filepath):
self._filepath = Path(filepath)
def load(self) -> pd.DataFrame:
return pd.read_csv(self._filepath)
def save(self, df: pd.DataFrame) -> None:
df.to_csv(str(self._filepath))
# _load = load
# _save = save
def _exists(self) -> bool:
return Path(self._filepath.as_posix()).exists()
def _describe(self):
return dict(param1=self._filepath)
ds1 = MyOwnDataset_without_underscore_methods(
filepath=(Path(__file__).parents[1] / "data/01_raw/companies.csv").as_posix(),
)
companies = ds1.load()
mlflow_ds1 = MlflowArtifactDataset(
dataset=dict(
type=MyOwnDataset_without_underscore_methods,
filepath=(Path(__file__).parents[1] / "data/01_raw/truc.csv").as_posix(),
),
artifact_path="truc",
)
mlflow_ds1.save(companies)
# test 2 : old way
class MyOwnDataset_with_underscore_methods(AbstractDataset):
def __init__(self, filepath):
self._filepath = Path(filepath)
def _load(self) -> pd.DataFrame:
return pd.read_csv(self._filepath)
def _save(self, df: pd.DataFrame) -> None:
df.to_csv(str(self._filepath))
# _load = load
# _save = save
def _exists(self) -> bool:
return Path(self._filepath.as_posix()).exists()
def _describe(self):
return dict(param1=self._filepath)
ds = MyOwnDataset_with_underscore_methods(
filepath=(Path(__file__).parents[1] / "data/01_raw/companies.csv").as_posix(),
)
companies = ds.load()
mlflow_ds2 = MlflowArtifactDataset(
dataset=dict(
type=MyOwnDataset_with_underscore_methods,
filepath=(Path(__file__).parents[1] / "data/01_raw/truc.csv").as_posix(),
),
artifact_path="truc",
)
mlflow_ds2.save(companies) |
@Galileo-Galilei Thanks for taking another look!
Do you need this even if nobody defines a public
Since I've reverted the guidance to use I think we need to wait to issue a deprecation warning until Kedro-Datasets is updated; don't want people getting deprecation warnings they can't do anything about. I'm good with the goal of removing it in 0.20, though! |
My first code example shows that people now can define a public
Once again, I don't understand : suggesting people to define
Very valid point, totally agree 👍 |
The reason I started looking into this was actually to resolve #2199, which is achieved even without explicitly defining Allowing people to define public |
I don't consider the current solution to be a breaking change from the Completely happy to merge it as is 🚀, and postpone removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should already document that user should define only load
and save
, otherwise it's good to go 🚀
def __init_subclass__(cls, **kwargs: Any) -> None: | ||
super().__init_subclass__(**kwargs) | ||
|
||
if hasattr(cls, "load") and not cls.load.__qualname__.startswith("Abstract"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good for now, let's go for it!
Sorry, this was on auto-merge. 😅 I will raise a PR for this. Let me also see if I can't take a quick look at Kedro-MLflow later. |
This is to get Kedro-viz (mainly kedro viz --lite) to work with Kedro 18 and some Kedro 19 versions based on changes this PR (kedro-org/kedro#3920) introduced. In July, Kedro made the _save and _load methods public. At that time, Kedro-viz did not rely on these methods. However, when we recently merged kedro viz --lite, we introduced an UnavailableDataset class, which is an AbstractDataset. This class now uses the public load and save methods. To maintain backward compatibility with older versions of the dataset, we followed a suggestion made by @deepyaman class MyDataset(...): def _load(...) -> ...: ... load = _load def _save(...) -> ...: ... save = _save Originally posted by @deepyaman in kedro-org/kedro#3920 (comment)
Description
Resolves #2199
In discussing #2199, the question arose, why
_load
and_save
need to be private methods. This PR exposes them publicly, choosing to "decorate" or wrap the load and save functionality for each dataset, rather than calling them as private methods from the public base implementation.The changes are made so as to require minimal changes to existing datasets and be backwards-compatible with custom datasets users may have written.
Development notes
Update(6/27): It's no longer necessary to write
load = _load
,save = _save
; this is handled in__init_subclass
. That said, we can still roll out enforcement of users providingload
andsave
as described below, as desired. It's actually much less pressing, but may help simplify the code a bit. :)Previously on this PR...
I have updated the documentation on extending a dataset, but I have intentionally not explained that "core" datasets (in Kedro and Kedro-Datasets) should use the following pattern to work across older Kedro versions:
If we want to take things slow, in the next minor release of Kedro (
0.20.0
), we can start raisingDeprecationWarning
s for datasets that don't implementsave
. In0.21.0
, we can drop support for datasets that don't definesave
(and just have the older_save
).The reason I don't want to introduce this in the docs is that it will confuse new users who want to write local datasets, and it's a very small change to make before maintainers add datasets to the core.
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file