You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.
Streamline access for small datasets by providing a "high level" API for use when working with a fully in-memory representation of the data which doesn't require the management of separate resources. ("Separate resources" would be things like managing an on-disk cache of the data, incremental async download/upload; that kind of thing.). Perhaps we can use the verbs load() / save() for this — thinking of DataSets.jl as a new FileIO.jl, I think this would make sense. (Actually, this isn't breaking, so it doesn't need to wait for 1.0.)
Somehow allow load() and save() to return some "default type the user cares about" for convenience. For example, returning a DataFrame for a tabular dataset. This will require addressing the problems of dynamically loading Julia modules that were partially faced in Data Layers #17
Consider the fate of dataset() and open() — currently the open(dataset(...)) idiom is a bit of an awkward double step and leads to some ambiguities. Perhaps we could repurpose dataset(name) to mean what open(dataset(name)) currently does?
Perhaps unexport DataSet? Users should rarely need to use this directly.
Storage API; finalize how we're going to deal with "resources" which back a lazily downloaded dataset: cache mangement, etc. We could adopt the approach from ResourceContexts.jl, for example using ctx = ResourceContext(); x = dataset(ctx, "name"); ...; close(ctx). Or from ContextManagers.jl in the style ctx = dataset("name"); x = value(ctx); close(ctx). (Both of these have macros for syntactic shortcuts.)
Figure out how we can integrate with FilePathsBase and whether there's a type which can implement the AbstractPath interface well enough to allow things like CSV.read(x) to work for some x. Perhaps we need a DataSpecification type for the URI-like concept currently called "dataspec" in the codebase? We could have CSV.read(data"foo?version=2#a/b")?
Consider deprecating and removing the "data entry point" stuff @datarun and @datafunc. I feel introducing these was premature and the semantics is probably not quite right. We can bring something similar back in the future if it seems like a good idea.
Fix some issues with Data.toml
Consider representing [datasets] section as a dictionary mapping names to configs, not as an array with name properties. This is safe because TOML syntax does allow arbitrary strings as section names. (Note that either representation is valid when a given DataSet is specifically tied to a project.)
Move data storage driver type outside of the storage section?
For dynamically loading Julia modules, in MLDatasets.jl we now use the @lazy import macro from LazyModules and our own @require import macro (similar to @lazy but requiring the user to add the import to its code).
Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.
load()
/save()
for this — thinking of DataSets.jl as a new FileIO.jl, I think this would make sense. (Actually, this isn't breaking, so it doesn't need to wait for 1.0.)load()
andsave()
to return some "default type the user cares about" for convenience. For example, returning aDataFrame
for a tabular dataset. This will require addressing the problems of dynamically loading Julia modules that were partially faced in Data Layers #17dataset()
andopen()
— currently theopen(dataset(...))
idiom is a bit of an awkward double step and leads to some ambiguities. Perhaps we could repurposedataset(name)
to mean whatopen(dataset(name))
currently does?DataSet
? Users should rarely need to use this directly.ctx = ResourceContext(); x = dataset(ctx, "name"); ...; close(ctx)
. Or from ContextManagers.jl in the stylectx = dataset("name"); x = value(ctx); close(ctx)
. (Both of these have macros for syntactic shortcuts.)BlobTree
APIBlobTree -> FileTree
#41BlobTree
—pairs
orvalues
?basename
? #42FilePathsBase
and whether there's a type which can implement theAbstractPath
interface well enough to allow things likeCSV.read(x)
to work for somex
. Perhaps we need aDataSpecification
type for the URI-like concept currently called "dataspec" in the codebase? We could haveCSV.read(data"foo?version=2#a/b")
?@datarun
and@datafunc
. I feel introducing these was premature and the semantics is probably not quite right. We can bring something similar back in the future if it seems like a good idea.[datasets]
section as a dictionary mapping names to configs, not as an array withname
properties. This is safe becauseTOML
syntax does allow arbitrary strings as section names. (Note that either representation is valid when a givenDataSet
is specifically tied to a project.)@__DIR__
templating somehow (fixed in DataSet configuration #46)DataSets.PROJECT
toDataSets.PROJECTS
if this is always aStackedDataProject
.The text was updated successfully, but these errors were encountered: