Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The road to DataSets 1.0 #43

Open
1 of 15 tasks
c42f opened this issue May 6, 2022 · 1 comment
Open
1 of 15 tasks

The road to DataSets 1.0 #43

c42f opened this issue May 6, 2022 · 1 comment

Comments

@c42f
Copy link
Contributor

c42f commented May 6, 2022

Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.

  • Streamline access for small datasets by providing a "high level" API for use when working with a fully in-memory representation of the data which doesn't require the management of separate resources. ("Separate resources" would be things like managing an on-disk cache of the data, incremental async download/upload; that kind of thing.). Perhaps we can use the verbs load() / save() for this — thinking of DataSets.jl as a new FileIO.jl, I think this would make sense. (Actually, this isn't breaking, so it doesn't need to wait for 1.0.)
  • Somehow allow load() and save() to return some "default type the user cares about" for convenience. For example, returning a DataFrame for a tabular dataset. This will require addressing the problems of dynamically loading Julia modules that were partially faced in Data Layers #17
  • Consider the fate of dataset() and open() — currently the open(dataset(...)) idiom is a bit of an awkward double step and leads to some ambiguities. Perhaps we could repurpose dataset(name) to mean what open(dataset(name)) currently does?
  • Perhaps unexport DataSet? Users should rarely need to use this directly.
  • Storage API; finalize how we're going to deal with "resources" which back a lazily downloaded dataset: cache mangement, etc. We could adopt the approach from ResourceContexts.jl, for example using ctx = ResourceContext(); x = dataset(ctx, "name"); ...; close(ctx). Or from ContextManagers.jl in the style ctx = dataset("name"); x = value(ctx); close(ctx). (Both of these have macros for syntactic shortcuts.)
  • Improve and formalize the BlobTree API
  • Figure out how we can integrate with FilePathsBase and whether there's a type which can implement the AbstractPath interface well enough to allow things like CSV.read(x) to work for some x. Perhaps we need a DataSpecification type for the URI-like concept currently called "dataspec" in the codebase? We could have CSV.read(data"foo?version=2#a/b")?
  • Consider deprecating and removing the "data entry point" stuff @datarun and @datafunc. I feel introducing these was premature and the semantics is probably not quite right. We can bring something similar back in the future if it seems like a good idea.
  • Fix some issues with Data.toml
    • Consider representing [datasets] section as a dictionary mapping names to configs, not as an array with name properties. This is safe because TOML syntax does allow arbitrary strings as section names. (Note that either representation is valid when a given DataSet is specifically tied to a project.)
    • Move data storage driver type outside of the storage section?
    • Fix up the mess with @__DIR__ templating somehow (fixed in DataSet configuration #46)
  • Dataset resolution
    • Rename DataSets.PROJECT to DataSets.PROJECTS if this is always a StackedDataProject.
    • Consider whether we really want a data stack vs how "data authorities" could perhaps work (ie, the authority section in the URI; eg, juliahub.com)
@CarloLucibello
Copy link

For dynamically loading Julia modules, in MLDatasets.jl we now use the @lazy import macro from LazyModules and our own @require import macro (similar to @lazy but requiring the user to add the import to its code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants