DataSource.open_dataset() performance and code review #645

forman · 2018-05-10T06:54:48Z

I have several questions regarding the current implementation of ODP wrt OPeNDAP performance and efficiency. The background is that I am about to address #644.

They are:

L746-L748: why are these variables not included in dropped_variables from the beginning?
L756-L759: Note that adjust_spatial_attrs_impl does not and can not guarantee that spatial attributes are present!
L766: It seems, datasets read from remote are written without external chunking set, so either we have no compression or compression is only don on entire datasets - with bad impact on performance/memory consumtion.

@papesci and @kbernat, could you please investigate.

Specifications

Cate master as of 2018-05-10

The text was updated successfully, but these errors were encountered:

papesci · 2018-05-10T14:05:01Z

@forman @kbernat it appear that opeandap query feature is not used. I mean the remote URL should be enhanced with query constraints e.g ?var1[start:step:stop], befor to ask xarray to open it.
However to build the query string we need to know the data structure, so we need to parse the response of the DAS and DDS which can be accessed adding .das at the end of the URL [e.g. http://../dataset.nc.das]. I didn'r find an xarray method to access only metadata unless we read the full dataset [with data].

I think we should find a way to get the structure information and then build an opendap query which reduce consistently the size when the user ask to get only a spatial/time subset.

the response of a openDap server to a DAS request is something like this Das Response. Anyone know a library we can use to access/parse it ?

kbernat · 2018-05-11T00:42:49Z

@papesci That was also my first thought when I started using OPeNDAP protocol - in fact xarray (and netcdf4 driver underneath) builds OPeNDAP query automatically, that feature works efficiently.
The best way to verify that is to change ESFG connection protocol in the code from HTTPS to HTTP

cate/cate/ds/esa_cci_odp.py

Line 70 in fe1308f

_ESGF_CEDA_URL = "https://esgf-index1.ceda.ac.uk/esg-search/search/"

and use Wireshark to analyse plain-text network traffic.

For test puropse I used datasource esacci.CLOUD.mon.L3C.CLD_PRODUCTS.MODIS.Aqua.MODIS_AQUA.2-0.r1 and single variable cfc, with time range constraint 2003-01-01,2003-01-01 and specified spatial coordinates - 0,0,1,1

cate ds copy esacci.CLOUD.mon.L3C.CLD_PRODUCTS.MODIS.Aqua.MODIS_AQUA.2-0.r1 --vars cfc --time 2003-01-01,2003-01-01 --region 0,0,1,1

@forman

L746-L748:

While dropped_variables excludes specified variables and does that internally - e.g. broken or unsupported variables,
var_names list contains exhaustive list of included variables - specified by user to be opened and/or made local.

L756-L759:
That's true, there is no check if attribure exists. Is making it optional and setting default value to 0-360 a sufficient solution?

L766:
I see your point, it requires some internal chunking strategy, or way to get that info from remote.. @papesci any ideas?

papesci · 2018-05-16T10:04:56Z

@forman @kbernat

thanks Chris, i have check it also in Llinux you are right. I wonder where and when exactly xArray access the remote data building the query. Doesn't look to be open_dataset method.

L766:
there is no general rule for data distributing strategies but we can assume that the user, in this particular application field, is interested to investigate on data that are close to each other. So regular cube should be accepted as default chunking strategy, i already have an algorithms able to do that. By the way the dataset will be saved locally so we cannot make assumption about the target machine resources especially in term of memory, so i propose to enable the user to configure some parameter like threshold and/or select a chunking strategy from a list [ spatial, temporal or linear on one dimension]

JanisGailis · 2018-05-17T09:18:48Z

L756-L759:

That comes from my changes to that code at some point. Initially the values were taken directly from the dataset, in this case it could not be guaranteed that these are correct. The whole point of having adjust_spatial_attrs_impl is to find these attributes by introspection and then either alter existing ones, or add them if these don't exist. See:

cate/cate/core/opimpl.py

Line 300 in 107dbfc

In case the determined attributes do not exist in the dataset, these will

Note that we drop spatial attributes that can't be reliably found through introspection.

EDIT:

Yes, the current iteration of adjust_spatial_attrs_impl doesn't guarantee that spatial boundaries attributes are added, as it can silently not do anything. That was not the original intention of that operation, as is evident by the docstring, because operations working on gridded geospatial datasets later down the road may depend on these attributes being present.

forman · 2018-05-22T12:01:34Z

@JanisGailis we already had a lengthy discussion on this. The current implementation is tolerant w.r.t. not being able to derive the spatial attributes if this is not possible. The former implementation just raised exceptions and frustrated users. I thought we already agreed, that Cate should be data-tolerant in general. (However, I agree adjust_spatial_attrs_impl api-docs must be updated.)

Normalisation of datasets should make sure that coordinate variables time, lat, lon are (a) set and (b) set correctly and CF-conform, and we should not rely on the existence of individual attributes, which can btw disappear very easily in a processing chain when xarray or dask operations are used. However, normalisation may rely on geo-spatial attributes to construct missing coordinate variables, as it is done in normalize_missing_time.

Just look at #655, where Cate fails, because the geo-spatial resolution attribute also contains the units.

JanisGailis · 2018-05-23T09:20:50Z

Sure, sure. I just explained how those lines came to be as neither @papesci nor @kbernat would know.

The reasoning was that we shouldn't do the same thing in many ways and places all over the codebase. So, if there is code that determines spatial attributes of a dataset by introspection, it should be used whenever we need to find out things like spatial extents and resolution. Especially because it can then be updated to deal with weird datasets and corner cases in a single place.

So in this case in my opinion it should then check for None or do try catch to catch cases where we apparently have a non-geospatial dataset, a dataset on an irregular grid or any other cases where introspection might have failed.

forman · 2018-05-25T09:09:42Z

I agree. And rather than relying on specific attributes and/or coordinates, we should consistently use a single Cate API function that will fetch the spatio-temporal ranges and resolutions in a most flexible and tolerant way, including CF conventions and later on from other frequently seen encodings, e.g. from HDF-EOS, GeoTIFF.

forman · 2018-06-13T15:32:57Z

I'm closing this as the initial questions are all answered. All software issues should be further discussed in #644.

forman added the ds label May 10, 2018

forman assigned kbernat and papesci May 10, 2018

forman added the in_progress label Jun 13, 2018

forman mentioned this issue Jun 13, 2018

ESA sea-level data not correctly displayed #661

Closed

forman removed the in_progress label Jun 13, 2018

forman closed this as completed Jun 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataSource.open_dataset() performance and code review #645

DataSource.open_dataset() performance and code review #645

forman commented May 10, 2018 •

edited

Loading

papesci commented May 10, 2018

kbernat commented May 11, 2018

papesci commented May 16, 2018

JanisGailis commented May 17, 2018 •

edited

Loading

forman commented May 22, 2018

JanisGailis commented May 23, 2018

forman commented May 25, 2018 •

edited

Loading

forman commented Jun 13, 2018

DataSource.open_dataset() performance and code review #645

DataSource.open_dataset() performance and code review #645

Comments

forman commented May 10, 2018 • edited Loading

Specifications

papesci commented May 10, 2018

kbernat commented May 11, 2018

papesci commented May 16, 2018

JanisGailis commented May 17, 2018 • edited Loading

forman commented May 22, 2018

JanisGailis commented May 23, 2018

forman commented May 25, 2018 • edited Loading

forman commented Jun 13, 2018

forman commented May 10, 2018 •

edited

Loading

JanisGailis commented May 17, 2018 •

edited

Loading

forman commented May 25, 2018 •

edited

Loading