Dask chunk sizes and different data access methods #350

wachsylon · 2021-08-03T10:08:42Z

Hi,
I appreciate intake a lot and provide catalogs for climate data of the German Climate Computing Center.
I have two questions:

Is it possible to set the size of chunks that dask will be using for the function to_dataset_dict in the catalog descriptor? The data that our catalogs describe is in netCDF format. When chunks is not specified in the function arguments, dask sets the chunks size to the dimension size of an entire file instead of the original file chunks. This leads to memory issues when working on the data afterwards.
Can I define more than one access method and column in the descriptor? The CMIP6 netCDF data can be accessed either locally on file system where users need to be logged in. Or, if it is published, it can be reached via OpenDAP from remote. But I am missing how I can specify the column to use for open the data in to_dataset_dict.

Thanks a lot for any help.
Fabi

The text was updated successfully, but these errors were encountered:

andersy005 · 2021-08-03T16:43:18Z

@wachsylon,

Is it possible to set the size of chunks that dask will be using for the function to_dataset_dict in the catalog descriptor? The data that our catalogs describe is in netCDF format. When chunks is not specified in the function arguments, dask sets the chunks size to the dimension size of an entire file instead of the original file chunks. This leads to memory issues when working on the data afterwards.

Have you tried passing passing cdf_kwargs to to_dataset_dict()... Something along these lines

dsets = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time': 10, 'lat': 90}})

Can I define more than one access method and column in the descriptor? The CMIP6 netCDF data can be accessed either locally on file system where users need to be logged in. Or, if it is published, it can be reached via OpenDAP from remote. But I am missing how I can specify the column to use for open the data in to_dataset_dict.

I expect this to work. Under the hood intake-esm figures out whether a file is on a regular posix filesystem or some remote filesystem. Here's a catalog example that exposes both the openDAP URLs and the local paths:

In [1]: import intake

In [2]: url = "http://haden.ldeo.columbia.edu/catalogs/hyrax_cmip6.json"

In [3]: cat = intake.open_esm_datastore(url)

In [4]: cat.df.head()
Out[4]: 
  activity_id institution_id  ...                                        OPENDAP_url                                               path
0        CMIP           CAMS  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CMIP/CAMS/CAMS-CSM1-0/histo...
1        CMIP           CAMS  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CMIP/CAMS/CAMS-CSM1-0/histo...
2        CMIP      NOAA-GFDL  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CMIP/NOAA-GFDL/GFDL-CM4/his...
3        CMIP      NOAA-GFDL  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CMIP/NOAA-GFDL/GFDL-CM4/his...
4       CFMIP           IPSL  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CFMIP/IPSL/IPSL-CM6A-LR/abr...

[5 rows x 13 columns]

In [5]: cat.path_column_name
Out[5]: 'OPENDAP_url'

If you take a peek at the JSON file, you see this section

],
  "assets": {
    "column_name": "OPENDAP_url",
    "format": "netcdf"
  },

and it's this section of the catalog that tells intake-esm which column to use when loading files...

A simpler solution would be to maintain two JSON files, one that points to the local files and another that points to the openDAP URLS. This way, whenever someone loads the catalog (JSON), they know exactly what type of access they are working with.

An alternative is to modify the path_column_name via

cat.esmcol_data['assets']['column_name'] = 'path'

This approach is hacky and fragile. So, I wouldn't recommend it. However, it may work for your use case.

wachsylon · 2021-08-04T07:20:28Z

@andersy005 Thanks for the quick answer.

dsets = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time': 10, 'lat': 90}})

Yes, this works but I would like to set the chunks kwargs as default in the .json of the catalog. However, if I try to do so like:

    "aggregations": [
      {
        "type": "union",
        "attribute_name": "variable_id",
        "options": {"cdf_kwargs":{"chunks":{"time":1}}}
      },

, xarray sets the kwargs in concat or merge where chunks is unknown.

An alternative is to modify the path_column_name via

Thanks for this idea, that is a good hack.

A simpler solution would be to maintain two JSON files

This is what I did. Is it possible somehow to have a master .json which contains pointers to both json?

wachsylon · 2021-08-04T08:52:58Z

For the second point, I figured it out:

cat master.yaml

description: 'DKRZ master catalog'
sources:

  CMIP6:
    args:
      esmcol_obj: "/mnt/lustre02/work/ik1017/Catalogs/mistral-cmip6.json"
    description: 'CMIP6 data on mistral'
    driver: intake.open_esm_datastore

  CORDEX:
    args:
      esmcol_obj: "/mnt/lustre01/work/kd0956/Catalogs/mistral-cordex.json"
    description: 'CORDEX data on mistral'
    driver: intake.open_esm_datastore
    metadata: {}

wachsylon · 2021-08-05T14:20:51Z

@andersy005

Is it possible to add
"cdf_kwargs":{"chunks":{"time":1}}
anyhow in the esm_collection .json file?

andersy005 · 2021-08-05T14:53:13Z

Allowing to globally define chunks in the catalog json file would result in issues due to the fact a catalog can point to files with variables with different dimensions names. For instance, using a global chunking scheme of {'time': 10, 'lat': 90} when the catalog has files that don't have the time dimension and/or some files' lat dimension is named latitude would just break the data loading...

I think it's reasonable to offload the responsibility for specifying chunks to the end-user since they know which datasets they are loading. The simplest compromise we came up with was to load datasets as a single chunk (via cdf_kwargs":{"chunks":{}} when the user hasn't specified the chunks

wachsylon · 2021-08-06T12:34:21Z

would just break the data loading...
Ups, I missed that.

I think it's reasonable to offload the responsibility for specifying chunks to the end-user since they know which datasets they are loading

I want to have your end-users... :)

The simplest compromise we came up with was to load datasets as a single chunk (via cdf_kwargs":{"chunks":{}}

I see, its not a intake issue. I found this issue which relates to my problem:
pydata/xarray#1440

To use only one chunk for the variable array of an entire file is too large. I think more reasonable would be to use the chunk sizes that were given in the original netCDF file.

wachsylon closed this as completed Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask chunk sizes and different data access methods #350

Dask chunk sizes and different data access methods #350

wachsylon commented Aug 3, 2021

andersy005 commented Aug 3, 2021

wachsylon commented Aug 4, 2021

wachsylon commented Aug 4, 2021

wachsylon commented Aug 5, 2021

andersy005 commented Aug 5, 2021 •

edited

Loading

wachsylon commented Aug 6, 2021

Dask chunk sizes and different data access methods #350

Dask chunk sizes and different data access methods #350

Comments

wachsylon commented Aug 3, 2021

andersy005 commented Aug 3, 2021

wachsylon commented Aug 4, 2021

wachsylon commented Aug 4, 2021

wachsylon commented Aug 5, 2021

andersy005 commented Aug 5, 2021 • edited Loading

wachsylon commented Aug 6, 2021

andersy005 commented Aug 5, 2021 •

edited

Loading