Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible Backend - AbstractDataStore definition #4309

Closed
aurghs opened this issue Aug 4, 2020 · 6 comments · Fixed by #4989
Closed

Flexible Backend - AbstractDataStore definition #4309

aurghs opened this issue Aug 4, 2020 · 6 comments · Fixed by #4989

Comments

@aurghs
Copy link
Collaborator

aurghs commented Aug 4, 2020

I just want to do a small recap of the current proposals for the class AbstractDataStore refactor discussed with @shoyer, @jhamman, and @alexamici.

Proposal 1: Store returns:

  • xr.Variables with the list of filters to apply to every variable
  • dataset attributes
  • encodings

Xarray applies to every variable only the filters selected by the backend before building the xr.Dataset.

Proposal 2: Store returns:

  • xr.Variables with all needed filters applied (configured by xarray),
  • dataset attributes
  • encodings

Xarray builds the xr.Dataset

Proposal 3: Store returns:

  • xr.Dataset

Before going on I'd like to collect pros and cons. For my understanding:

Proposal 1

pros:

  • the backend is free to decide which representation to provide.
  • more control on the backend (? not necessary true, the backend can decide to apply all the filters internally and provide xarray and empty list of filters to be applied)
  • enable / disable filters logic would be in xarray.
  • all the filters (applied by xarray) should have a similar interface.
  • maybe registered filters could be used by other backends

cons:

  • confusing backend-xarray interface.
  • more difficult to define interfaces. More conflicts (registered filters with the same name...)
  • need more structure to define this interface, more code to maintain.

Proposal 2

pros:

  • interface backend-xarray is clearer / backend and xarray have well different defined tasks.
  • interface would be minimal and easier to implement
  • no intermediate representations
  • less code to maintain

cons:

  • less control on filters.
  • more complex explicit definition of the interface (every filter must understand what decode_times means in their case)
  • more complexity inside the filters

The minimal interface would be something like that:

class AbstractBackEnd:
    def __init__(self, path, encode_times=True, ..., **kwargs):  # signature of open_dataset
        raise NotImplementedError
    def get_variables():
        """Return a dictionary of variable name and xr.Variable"""
        raise NotImplementedError
    def get_attrs():
        """returns """
        raise NotImplementedError
    def get_encoding():
        """ """
        raise NotImplementedError
    def close(self): 
        pass

Proposal 3

pros w.r.t. porposal 2:

  • decode_coordinates is done by the backend as the other filters.

cons?

Any suggestions?

@alexamici
Copy link
Collaborator

Note that the above proposals only address the store / encoding part the backend API. We will address the BackendArray part later and I expect it to be trickier.

@max-sixty
Copy link
Collaborator

max-sixty commented Aug 5, 2020

Thanks @alexamici

I'm a bit behind here: what's an example of a filter? A selection of the data?

Edit: as discussed filter == encoding

@alexamici
Copy link
Collaborator

We agreed with @shoyer to go with proposal 3 for reading: the backends will return a fully build xr.Dataset

Among other advantages this reduces the amount of backend-specific documentation as the documentation of xr.Dataset and xr.Variable already contains almost all that is needed by backend developers.

The one bit of documentation that needs addressing is the use of BackendArray as the data argument for xr.Variable.

@alexamici
Copy link
Collaborator

alexamici commented Sep 2, 2020

@shoyer and @jhamman, I'm looking into the write support and if we let a backend return a xr.Dataset, as agreed, we lose the ability to support in-place change of a file, no update of attributes or mode='a', unless the backend has a unambiguous way to identify persistent xr.Dataset.

I'm not sure what options for in-place change are supported, but I see at least mode='a' for zarr. Let's discuss this tomorrow.

@alexamici
Copy link
Collaborator

alexamici commented Sep 23, 2020

@shoyer & @jhamman just to give you an idea, I aim to see open_dataset reduced to the following:

def open_dataset(filename_or_obj, *, engine=None, chunks=None, cache=None, backend_kwargs=None, **kwargs):
    filename_or_obj = nomalize_filename_or_obj(filename_or_obj)
    if engine is None:
        engine = autodetect_engine(filename_or_obj)
    open_backend_dataset = get_opener(engine)

    backend_ds = open_backend_dataset(filename_or_obj, **backend_kwargs, **kwargs)
    ds = dataset_from_backend_dataset(
        backend_ds, chunks, cache, filename_or_obj=filename_or_obj, **kwargs
    )
    return ds

Where the key observation is that backend_ds variable must be either np.ndarray or subclasses of BackendArray. That is backend should not be concerned with the in-memory representation of the variables, so they know nothign about dask, cache behaviour, etc. (@shoyer this was addressed briefly today)

@alexamici
Copy link
Collaborator

I'm looking at other bits of knowledge of how backends work that are still present in the generic parts of open_dataset.

We see _autodetect_engine and _normalize_path.

We aim at removing _autodetect_engine in favour of adding a new can_open(filename_or_obj, ...) backend function declared via the plugin interface.

On the other hand _normalize_path can be removed entirely once ZarrStore.open_group will accept os.PathLike objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants