Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

DataSource.open_dataset() performance and code review #645

Closed
3 tasks done
forman opened this issue May 10, 2018 · 8 comments
Closed
3 tasks done

DataSource.open_dataset() performance and code review #645

forman opened this issue May 10, 2018 · 8 comments
Assignees
Labels

Comments

@forman
Copy link
Member

forman commented May 10, 2018

I have several questions regarding the current implementation of ODP wrt OPeNDAP performance and efficiency. The background is that I am about to address #644.

They are:

  • L746-L748: why are these variables not included in dropped_variables from the beginning?
  • L756-L759: Note that adjust_spatial_attrs_impl does not and can not guarantee that spatial attributes are present!
  • L766: It seems, datasets read from remote are written without external chunking set, so either we have no compression or compression is only don on entire datasets - with bad impact on performance/memory consumtion.

@papesci and @kbernat, could you please investigate.

Specifications

Cate master as of 2018-05-10

@papesci
Copy link
Contributor

papesci commented May 10, 2018

@forman @kbernat it appear that opeandap query feature is not used. I mean the remote URL should be enhanced with query constraints e.g ?var1[start:step:stop], befor to ask xarray to open it.
However to build the query string we need to know the data structure, so we need to parse the response of the DAS and DDS which can be accessed adding .das at the end of the URL [e.g. http://../dataset.nc.das]. I didn'r find an xarray method to access only metadata unless we read the full dataset [with data].

I think we should find a way to get the structure information and then build an opendap query which reduce consistently the size when the user ask to get only a spatial/time subset.

the response of a openDap server to a DAS request is something like this Das Response. Anyone know a library we can use to access/parse it ?

@kbernat
Copy link
Collaborator

kbernat commented May 11, 2018

@papesci That was also my first thought when I started using OPeNDAP protocol - in fact xarray (and netcdf4 driver underneath) builds OPeNDAP query automatically, that feature works efficiently.
The best way to verify that is to change ESFG connection protocol in the code from HTTPS to HTTP

_ESGF_CEDA_URL = "https://esgf-index1.ceda.ac.uk/esg-search/search/"
and use Wireshark to analyse plain-text network traffic.

For test puropse I used datasource esacci.CLOUD.mon.L3C.CLD_PRODUCTS.MODIS.Aqua.MODIS_AQUA.2-0.r1 and single variable cfc, with time range constraint 2003-01-01,2003-01-01 and specified spatial coordinates - 0,0,1,1

cate ds copy esacci.CLOUD.mon.L3C.CLD_PRODUCTS.MODIS.Aqua.MODIS_AQUA.2-0.r1 --vars cfc --time 2003-01-01,2003-01-01 --region 0,0,1,1

image

@forman

L746-L748:

While dropped_variables excludes specified variables and does that internally - e.g. broken or unsupported variables,
var_names list contains exhaustive list of included variables - specified by user to be opened and/or made local.

L756-L759:
That's true, there is no check if attribure exists. Is making it optional and setting default value to 0-360 a sufficient solution?

L766:
I see your point, it requires some internal chunking strategy, or way to get that info from remote.. @papesci any ideas?

@papesci
Copy link
Contributor

papesci commented May 16, 2018

@forman @kbernat

thanks Chris, i have check it also in Llinux you are right. I wonder where and when exactly xArray access the remote data building the query. Doesn't look to be open_dataset method.

L766:
there is no general rule for data distributing strategies but we can assume that the user, in this particular application field, is interested to investigate on data that are close to each other. So regular cube should be accepted as default chunking strategy, i already have an algorithms able to do that. By the way the dataset will be saved locally so we cannot make assumption about the target machine resources especially in term of memory, so i propose to enable the user to configure some parameter like threshold and/or select a chunking strategy from a list [ spatial, temporal or linear on one dimension]

@JanisGailis
Copy link
Member

JanisGailis commented May 17, 2018

L756-L759:

That comes from my changes to that code at some point. Initially the values were taken directly from the dataset, in this case it could not be guaranteed that these are correct. The whole point of having adjust_spatial_attrs_impl is to find these attributes by introspection and then either alter existing ones, or add them if these don't exist. See:

In case the determined attributes do not exist in the dataset, these will

Note that we drop spatial attributes that can't be reliably found through introspection.

EDIT:

Yes, the current iteration of adjust_spatial_attrs_impl doesn't guarantee that spatial boundaries attributes are added, as it can silently not do anything. That was not the original intention of that operation, as is evident by the docstring, because operations working on gridded geospatial datasets later down the road may depend on these attributes being present.

@forman
Copy link
Member Author

forman commented May 22, 2018

@JanisGailis we already had a lengthy discussion on this. The current implementation is tolerant w.r.t. not being able to derive the spatial attributes if this is not possible. The former implementation just raised exceptions and frustrated users. I thought we already agreed, that Cate should be data-tolerant in general. (However, I agree adjust_spatial_attrs_impl api-docs must be updated.)

Normalisation of datasets should make sure that coordinate variables time, lat, lon are (a) set and (b) set correctly and CF-conform, and we should not rely on the existence of individual attributes, which can btw disappear very easily in a processing chain when xarray or dask operations are used. However, normalisation may rely on geo-spatial attributes to construct missing coordinate variables, as it is done in normalize_missing_time.

Just look at #655, where Cate fails, because the geo-spatial resolution attribute also contains the units.

@JanisGailis
Copy link
Member

Sure, sure. I just explained how those lines came to be as neither @papesci nor @kbernat would know.

The reasoning was that we shouldn't do the same thing in many ways and places all over the codebase. So, if there is code that determines spatial attributes of a dataset by introspection, it should be used whenever we need to find out things like spatial extents and resolution. Especially because it can then be updated to deal with weird datasets and corner cases in a single place.

So in this case in my opinion it should then check for None or do try catch to catch cases where we apparently have a non-geospatial dataset, a dataset on an irregular grid or any other cases where introspection might have failed.

@forman
Copy link
Member Author

forman commented May 25, 2018

I agree. And rather than relying on specific attributes and/or coordinates, we should consistently use a single Cate API function that will fetch the spatio-temporal ranges and resolutions in a most flexible and tolerant way, including CF conventions and later on from other frequently seen encodings, e.g. from HDF-EOS, GeoTIFF.

@forman
Copy link
Member Author

forman commented Jun 13, 2018

I'm closing this as the initial questions are all answered. All software issues should be further discussed in #644.

@forman forman closed this as completed Jun 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants