-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-design io.core and io.data_catalog #1778
Comments
@noklam also commented that we should consider what actually belongs to
and the following is repeated 37 times:
Is there anything we can do to make it easier to define a custom dataset? e.g. why is |
Adding this while I am looking at #1768 some object store related issues. Currently, we actually put this into exists_function=self._fs.exists,
glob_function=self._fs.glob, A potential solution for #1768 may be passing some arguments into this line, but there is no easy way to pass in any arguments. I am also not sure how Line 537 in f491420
|
A really important issue IMO and a great write-up.
I think the class (
|
Related: #1936 |
Added Expose version load version information when load_version=None Another symptom here is:
|
Notes from Technical Design: Catalog API
Datasets API
A related exercise is to completely re-design how the catalog and datasets work: #1981 |
One more thing (yes, I think about this issue and #1936 several times a day every single day):
I really love this design, and I clearly see how But it turns out users want the underlying datasets for all sorts of things. Just today I had two users ask me how they could access the underlying dataset object for various "wicked" uses of Kedro (one was related with dynamic pipelines and the other with kedro-mlflow). I seem to always forget about the But if you think only users that should know better are using this protected method, hold your horses, because our own kedro-viz does, too! So I think we should definitely explore the possibility of "opening" this abstraction. |
I am glad this is still on the radar because it also still pops into my head on a daily basis 😀 Just to add some more context and another couple of data points... Hiding datasets does indeed make sense when using kedro as a framework, but it makes life extremely difficult when it comes to writing plugins/extensions/integrations with kedro. There are IMO many non-nefarious reasons to want to pull dataset information from the catalog (especially now there's the I always recommend |
Another flaw of the current inheritance model is that if users want to configure a different versioning strategy for their datasets, they have to create a custom |
Spun out of #1691 (comment)... Let's collect ideas here on what current problems are with
io
. To me it feels like we've neglected it and it's ripe for a re-design.#1691 and #1580 are actually just symptoms of a more fundamental underlying issue: the API and underlying workings of
io.core
andio.data_catalog
are very confusing and should be rethought in general. These are very old components in kedro and maybe some of the decisions that were originally made about their design should be revised. I think there's also very likely to be old bits of code there that could now be removed or renamed (e.g. who would guess that something namedadd_feed_dict
is used to add parameters to the catalog?). It feels like tech debt rather than intentional design currently.I don't think they're massively wrong as it stands, but I think it would be a good exercise to go through them and work out exactly what functionality we should expose in the API and how we might like to rework them. e.g. in the case raised here there is quite a bit of confusion about how to get the filepath:
catalog.datasets
is presumably the "official" route to get a dataset rather than_get_dataset
, butcatalog.datasets
wouldn't allow a namespaced datasets to be accessed without doinggetattr
. There's some very subtle and non-obvious differences betweendatasets
and_get_dataset
, and then there's alsocatalog._data_sets
(which I think might just be a historical leftover... but not sure). In Improve resume pipeline suggestion for SequentialRunner #1795 @jmholzer usedvars(catalog.datasets)[dataset_name]
._filepath
is only defined for versioned datasets (? seems weird)get_filepath_str(self._get_load_path(), self._protocol)
which is pretty obscure. Similar to Refactor load version logic #1654So I think we should look holistically at the structures involved here and work out what the API should look like so there's one, clear way to access the things that people need to access. I actually don't think this is such a huge task. Then we can tell much more easily whether we need any new functionality in these structures (like a
catalog.dumps
) or whether it's just a case of making what we already have better organised, documented and clearer.The text was updated successfully, but these errors were encountered: