Flexible Backend - AbstractDataStore definition #4309

aurghs · 2020-08-04T16:14:16Z

I just want to do a small recap of the current proposals for the class AbstractDataStore refactor discussed with @shoyer, @jhamman, and @alexamici.

Proposal 1: Store returns:

xr.Variables with the list of filters to apply to every variable
dataset attributes
encodings

Xarray applies to every variable only the filters selected by the backend before building the xr.Dataset.

Proposal 2: Store returns:

xr.Variables with all needed filters applied (configured by xarray),
dataset attributes
encodings

Xarray builds the xr.Dataset

Proposal 3: Store returns:

xr.Dataset

Before going on I'd like to collect pros and cons. For my understanding:

Proposal 1

pros:

the backend is free to decide which representation to provide.
more control on the backend (? not necessary true, the backend can decide to apply all the filters internally and provide xarray and empty list of filters to be applied)
enable / disable filters logic would be in xarray.
all the filters (applied by xarray) should have a similar interface.
maybe registered filters could be used by other backends

cons:

confusing backend-xarray interface.
more difficult to define interfaces. More conflicts (registered filters with the same name...)
need more structure to define this interface, more code to maintain.

Proposal 2

pros:

interface backend-xarray is clearer / backend and xarray have well different defined tasks.
interface would be minimal and easier to implement
no intermediate representations
less code to maintain

cons:

less control on filters.
more complex explicit definition of the interface (every filter must understand what decode_times means in their case)
more complexity inside the filters

The minimal interface would be something like that:

class AbstractBackEnd:
    def __init__(self, path, encode_times=True, ..., **kwargs):  # signature of open_dataset
        raise NotImplementedError
    def get_variables():
        """Return a dictionary of variable name and xr.Variable"""
        raise NotImplementedError
    def get_attrs():
        """returns """
        raise NotImplementedError
    def get_encoding():
        """ """
        raise NotImplementedError
    def close(self): 
        pass

Proposal 3

pros w.r.t. porposal 2:

decode_coordinates is done by the backend as the other filters.

cons?

Any suggestions?

The text was updated successfully, but these errors were encountered:

alexamici · 2020-08-05T15:05:08Z

Note that the above proposals only address the store / encoding part the backend API. We will address the BackendArray part later and I expect it to be trickier.

max-sixty · 2020-08-05T15:19:56Z

Thanks @alexamici

I'm a bit behind here: what's an example of a filter? A selection of the data?

Edit: as discussed filter == encoding

alexamici · 2020-08-25T10:46:48Z

We agreed with @shoyer to go with proposal 3 for reading: the backends will return a fully build xr.Dataset

Among other advantages this reduces the amount of backend-specific documentation as the documentation of xr.Dataset and xr.Variable already contains almost all that is needed by backend developers.

The one bit of documentation that needs addressing is the use of BackendArray as the data argument for xr.Variable.

alexamici · 2020-09-02T18:46:58Z

@shoyer and @jhamman, I'm looking into the write support and if we let a backend return a xr.Dataset, as agreed, we lose the ability to support in-place change of a file, no update of attributes or mode='a', unless the backend has a unambiguous way to identify persistent xr.Dataset.

I'm not sure what options for in-place change are supported, but I see at least mode='a' for zarr. Let's discuss this tomorrow.

alexamici · 2020-09-23T20:23:17Z

@shoyer & @jhamman just to give you an idea, I aim to see open_dataset reduced to the following:

def open_dataset(filename_or_obj, *, engine=None, chunks=None, cache=None, backend_kwargs=None, **kwargs):
    filename_or_obj = nomalize_filename_or_obj(filename_or_obj)
    if engine is None:
        engine = autodetect_engine(filename_or_obj)
    open_backend_dataset = get_opener(engine)

    backend_ds = open_backend_dataset(filename_or_obj, **backend_kwargs, **kwargs)
    ds = dataset_from_backend_dataset(
        backend_ds, chunks, cache, filename_or_obj=filename_or_obj, **kwargs
    )
    return ds

Where the key observation is that backend_ds variable must be either np.ndarray or subclasses of BackendArray. That is backend should not be concerned with the in-memory representation of the variables, so they know nothign about dask, cache behaviour, etc. (@shoyer this was addressed briefly today)

alexamici · 2020-12-16T21:35:57Z

I'm looking at other bits of knowledge of how backends work that are still present in the generic parts of open_dataset.

We see _autodetect_engine and _normalize_path.

We aim at removing _autodetect_engine in favour of adding a new can_open(filename_or_obj, ...) backend function declared via the plugin interface.

On the other hand _normalize_path can be removed entirely once ZarrStore.open_group will accept os.PathLike objects.

…PIv2 #4309 (#4899)

jhamman added API design topic-backends grant-czi labels Aug 4, 2020

alexamici mentioned this issue Aug 4, 2020

Remove all unused and warn-raising methods from AbstractDataStore #4310

Merged

4 tasks

This was referenced Sep 30, 2020

WIP: Proposed refactor of read API for backends #4477

Merged

Group together decoding options into a single argument #4490

Open

Remove maybe chunck duplicated function #4494

Merged

aurghs mentioned this issue Oct 8, 2020

Flexible backends - Harmonise zarr chunking with other backends chunking #4496

Closed

aurghs mentioned this issue Dec 10, 2020

Port all the engines to apiv2 #4673

Merged

3 tasks

This was referenced Dec 16, 2020

Allow pathlib.Path to be passed to all engines #4701

Merged

APIv2: pass user defined filename_or_obj to backends as is #4707

Merged

APIv2: move all _autodetect_engine logic to the plugins #4709

Merged

APIv2 internal cleanups #4721

Merged

This was referenced Dec 23, 2020

Remove entrypoints in setup for internal backends #4724

Merged

remove autoclose in open_dataset and related warning test #4725

Merged

alexamici mentioned this issue Jan 14, 2021

Remove the references to _file_obj outside low level code paths, change to _close #4809

Merged

5 tasks

This was referenced Feb 12, 2021

release 0.17.0 #4894

Closed

Revert defaults of backends' open_datasets to prepare the switch to APIv2 #4899

Merged

alexamici added a commit that referenced this issue Feb 17, 2021

Revert defaults of beckends' open_datasets to prepare the switch to A…

7c4e2ac

…PIv2 #4309 (#4899)

weiji14 mentioned this issue Feb 21, 2021

Universal virtualfile_from_data function to replace virtualfile_from_grid/matrix/vectors GenericMappingTools/pygmt#949

Closed

6 tasks

alexamici mentioned this issue Mar 3, 2021

Switch backend API to v2 #4989

Merged

5 tasks

shoyer closed this as completed in #4989 Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible Backend - AbstractDataStore definition #4309

Flexible Backend - AbstractDataStore definition #4309

aurghs commented Aug 4, 2020 •

edited

Loading

alexamici commented Aug 5, 2020

max-sixty commented Aug 5, 2020 •

edited

Loading

alexamici commented Aug 25, 2020

alexamici commented Sep 2, 2020 •

edited

Loading

alexamici commented Sep 23, 2020 •

edited

Loading

alexamici commented Dec 16, 2020

Flexible Backend - AbstractDataStore definition #4309

Flexible Backend - AbstractDataStore definition #4309

Comments

aurghs commented Aug 4, 2020 • edited Loading

alexamici commented Aug 5, 2020

max-sixty commented Aug 5, 2020 • edited Loading

alexamici commented Aug 25, 2020

alexamici commented Sep 2, 2020 • edited Loading

alexamici commented Sep 23, 2020 • edited Loading

alexamici commented Dec 16, 2020

aurghs commented Aug 4, 2020 •

edited

Loading

max-sixty commented Aug 5, 2020 •

edited

Loading

alexamici commented Sep 2, 2020 •

edited

Loading

alexamici commented Sep 23, 2020 •

edited

Loading