Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

Revision of Cate's Data Management

Norman Fomferra edited this page Mar 15, 2017 · 4 revisions

This is the current DataSource API (of Cate v0.7):

class DataSource(metaclass=ABCMeta):

    @abstractmethod
    def open_dataset(self,
                     time_range: Tuple[datetime, datetime]=None,
                     protocol: str=None) -> xr.Dataset:
        pass

    def sync(self,
             time_range: Tuple[datetime, datetime]=None,
             protocol: str=None,
             monitor: Monitor=Monitor.NONE) -> Tuple[int, int]:
        pass

Here are some suggestions how to improve the API and make it more appropriate wrt our current needs.

The most important concept change is that we would like to replace the implicit data source synchronization by an explicit creation of local data sources.

1. Replace sync by make_local

For this, the DataSource method sync shall be replaced by make_local and shall receive a new parameter local_id. The general contract is to generate a new local data source from an existing one which is usually remote but may also be another local one. The new local data source is implicitly added to the data store local.

2. open_dataset without protocol but with more subset parameters

It is not really required to specify a protocol with open_dataset when there is not implicit synchronization. However, we'd like to add two new subset parameters:

  • region - a polygon-like geometry for spatial subsets. If None, we mean global.
  • var_names - a list of variables names. If None, we mean all.

3. make_local with subset parameters

The new DataSource method make_local should also have all three subset parameters:

  • time_range - a date range for temporal subsets. If None, we mean all.
  • region - a polygon-like geometry for spatial subsets. If None, we mean global.
  • var_names - a list of variables names. If None, we mean all.

4. Use new types

We'd also like to use the new Like types for the subset parameters time_range, region, var_names.

5. Generalize toward CDM

open_dataset and make_local should also work with vector data, not only xarray datasets and netCDF files. This may be implemented later.

6. Resulting API and implementation changes

Considering point 1 to 5, here is the new DataSource API:

class DataSource(metaclass=ABCMeta):

    @abstractmethod
    def open_dataset(self,
                     time_range: DateRangeLike = None,
                     region: PolygonLike = None,
                     var_names: VarNamesLike = None) -> Any:
        pass

    @abstractmethod
    def make_local(self,
                   local_name: str,
                   time_range: DateRangeLike = None,
                   region: PolygonLike = None,
                   var_names: VarNamesLike = None,
                   protocol: Optional[str] = None,
                   monitor: Monitor=Monitor.NONE) -> DataSource:
        pass

EsaOdpDataSource implementation

  • open_dataset always uses OPeNDAP.
  • make_local can use OPeNDAP. If global coverage requested, then HTTP file download is much faster. If subset parameters given, it should use OPeNDAP.

LocalDataSource implementation

  • open_dataset opens the dataset as usual from local files. xarray/geopandas ops may be performed if subset parameters given.
  • make_local ignores protocol parameter. xarray/geopandas ops may be performed if subset parameters given. New files are being written. (Note, make_local for local data sources without subset parameters is just a file copy).

Rename LocalFilePatternDataStore and LocalFilePatternDataSource to LocalDataStore and LocalDataSource

For obvious reasons our local data store/sources are not bound to file patterns, instead they just comprise local files.