Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask chunk sizes and different data access methods #350

Closed
wachsylon opened this issue Aug 3, 2021 · 6 comments
Closed

Dask chunk sizes and different data access methods #350

wachsylon opened this issue Aug 3, 2021 · 6 comments

Comments

@wachsylon
Copy link

Hi,
I appreciate intake a lot and provide catalogs for climate data of the German Climate Computing Center.
I have two questions:

  1. Is it possible to set the size of chunks that dask will be using for the function to_dataset_dict in the catalog descriptor? The data that our catalogs describe is in netCDF format. When chunks is not specified in the function arguments, dask sets the chunks size to the dimension size of an entire file instead of the original file chunks. This leads to memory issues when working on the data afterwards.
  2. Can I define more than one access method and column in the descriptor? The CMIP6 netCDF data can be accessed either locally on file system where users need to be logged in. Or, if it is published, it can be reached via OpenDAP from remote. But I am missing how I can specify the column to use for open the data in to_dataset_dict.

Thanks a lot for any help.
Fabi

@andersy005
Copy link
Member

@wachsylon,

Is it possible to set the size of chunks that dask will be using for the function to_dataset_dict in the catalog descriptor? The data that our catalogs describe is in netCDF format. When chunks is not specified in the function arguments, dask sets the chunks size to the dimension size of an entire file instead of the original file chunks. This leads to memory issues when working on the data afterwards.

Have you tried passing passing cdf_kwargs to to_dataset_dict()... Something along these lines

dsets = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time': 10, 'lat': 90}})

Can I define more than one access method and column in the descriptor? The CMIP6 netCDF data can be accessed either locally on file system where users need to be logged in. Or, if it is published, it can be reached via OpenDAP from remote. But I am missing how I can specify the column to use for open the data in to_dataset_dict.

I expect this to work. Under the hood intake-esm figures out whether a file is on a regular posix filesystem or some remote filesystem. Here's a catalog example that exposes both the openDAP URLs and the local paths:

In [1]: import intake

In [2]: url = "http://haden.ldeo.columbia.edu/catalogs/hyrax_cmip6.json"

In [3]: cat = intake.open_esm_datastore(url)

In [4]: cat.df.head()
Out[4]: 
  activity_id institution_id  ...                                        OPENDAP_url                                               path
0        CMIP           CAMS  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CMIP/CAMS/CAMS-CSM1-0/histo...
1        CMIP           CAMS  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CMIP/CAMS/CAMS-CSM1-0/histo...
2        CMIP      NOAA-GFDL  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CMIP/NOAA-GFDL/GFDL-CM4/his...
3        CMIP      NOAA-GFDL  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CMIP/NOAA-GFDL/GFDL-CM4/his...
4       CFMIP           IPSL  ...  http://mary.ldeo.columbia.edu:8080/opendap/CMI...  /m2/haibo/CMIP6mon/CFMIP/IPSL/IPSL-CM6A-LR/abr...

[5 rows x 13 columns]

In [5]: cat.path_column_name
Out[5]: 'OPENDAP_url'

If you take a peek at the JSON file, you see this section

],
  "assets": {
    "column_name": "OPENDAP_url",
    "format": "netcdf"
  },

and it's this section of the catalog that tells intake-esm which column to use when loading files...

A simpler solution would be to maintain two JSON files, one that points to the local files and another that points to the openDAP URLS. This way, whenever someone loads the catalog (JSON), they know exactly what type of access they are working with.

An alternative is to modify the path_column_name via

cat.esmcol_data['assets']['column_name'] = 'path'

This approach is hacky and fragile. So, I wouldn't recommend it. However, it may work for your use case.

@wachsylon
Copy link
Author

@andersy005 Thanks for the quick answer.

dsets = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time': 10, 'lat': 90}})

Yes, this works but I would like to set the chunks kwargs as default in the .json of the catalog. However, if I try to do so like:

    "aggregations": [
      {
        "type": "union",
        "attribute_name": "variable_id",
        "options": {"cdf_kwargs":{"chunks":{"time":1}}}
      },

, xarray sets the kwargs in concat or merge where chunks is unknown.

An alternative is to modify the path_column_name via

Thanks for this idea, that is a good hack.

A simpler solution would be to maintain two JSON files

This is what I did. Is it possible somehow to have a master .json which contains pointers to both json?

@wachsylon
Copy link
Author

For the second point, I figured it out:

cat master.yaml

description: 'DKRZ master catalog'
sources:

  CMIP6:
    args:
      esmcol_obj: "/mnt/lustre02/work/ik1017/Catalogs/mistral-cmip6.json"
    description: 'CMIP6 data on mistral'
    driver: intake.open_esm_datastore

  CORDEX:
    args:
      esmcol_obj: "/mnt/lustre01/work/kd0956/Catalogs/mistral-cordex.json"
    description: 'CORDEX data on mistral'
    driver: intake.open_esm_datastore
    metadata: {}

@wachsylon
Copy link
Author

@andersy005

Is it possible to add
"cdf_kwargs":{"chunks":{"time":1}}
anyhow in the esm_collection .json file?

@andersy005
Copy link
Member

andersy005 commented Aug 5, 2021

Allowing to globally define chunks in the catalog json file would result in issues due to the fact a catalog can point to files with variables with different dimensions names. For instance, using a global chunking scheme of {'time': 10, 'lat': 90} when the catalog has files that don't have the time dimension and/or some files' lat dimension is named latitude would just break the data loading...

I think it's reasonable to offload the responsibility for specifying chunks to the end-user since they know which datasets they are loading. The simplest compromise we came up with was to load datasets as a single chunk (via cdf_kwargs":{"chunks":{}} when the user hasn't specified the chunks

@wachsylon
Copy link
Author

would just break the data loading...
Ups, I missed that.

I think it's reasonable to offload the responsibility for specifying chunks to the end-user since they know which datasets they are loading

I want to have your end-users... :)

The simplest compromise we came up with was to load datasets as a single chunk (via cdf_kwargs":{"chunks":{}}

I see, its not a intake issue. I found this issue which relates to my problem:
pydata/xarray#1440

To use only one chunk for the variable array of an entire file is too large. I think more reasonable would be to use the chunk sizes that were given in the original netCDF file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants