-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask chunk sizes and different data access methods #350
Comments
Have you tried passing passing dsets = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time': 10, 'lat': 90}})
I expect this to work. Under the hood intake-esm figures out whether a file is on a regular posix filesystem or some remote filesystem. Here's a catalog example that exposes both the openDAP URLs and the local paths: In [1]: import intake
In [2]: url = "http://haden.ldeo.columbia.edu/catalogs/hyrax_cmip6.json"
In [3]: cat = intake.open_esm_datastore(url)
In [4]: cat.df.head()
Out[4]:
activity_id institution_id ... OPENDAP_url path
0 CMIP CAMS ... http://mary.ldeo.columbia.edu:8080/opendap/CMI... /m2/haibo/CMIP6mon/CMIP/CAMS/CAMS-CSM1-0/histo...
1 CMIP CAMS ... http://mary.ldeo.columbia.edu:8080/opendap/CMI... /m2/haibo/CMIP6mon/CMIP/CAMS/CAMS-CSM1-0/histo...
2 CMIP NOAA-GFDL ... http://mary.ldeo.columbia.edu:8080/opendap/CMI... /m2/haibo/CMIP6mon/CMIP/NOAA-GFDL/GFDL-CM4/his...
3 CMIP NOAA-GFDL ... http://mary.ldeo.columbia.edu:8080/opendap/CMI... /m2/haibo/CMIP6mon/CMIP/NOAA-GFDL/GFDL-CM4/his...
4 CFMIP IPSL ... http://mary.ldeo.columbia.edu:8080/opendap/CMI... /m2/haibo/CMIP6mon/CFMIP/IPSL/IPSL-CM6A-LR/abr...
[5 rows x 13 columns]
In [5]: cat.path_column_name
Out[5]: 'OPENDAP_url' If you take a peek at the JSON file, you see this section ],
"assets": {
"column_name": "OPENDAP_url",
"format": "netcdf"
},
and it's this section of the catalog that tells intake-esm which column to use when loading files... A simpler solution would be to maintain two JSON files, one that points to the local files and another that points to the openDAP URLS. This way, whenever someone loads the catalog (JSON), they know exactly what type of access they are working with. An alternative is to modify the cat.esmcol_data['assets']['column_name'] = 'path' This approach is hacky and fragile. So, I wouldn't recommend it. However, it may work for your use case. |
@andersy005 Thanks for the quick answer.
Yes, this works but I would like to set the
,
Thanks for this idea, that is a good hack.
This is what I did. Is it possible somehow to have a master |
For the second point, I figured it out:
|
Is it possible to add |
Allowing to globally define chunks in the catalog json file would result in issues due to the fact a catalog can point to files with variables with different dimensions names. For instance, using a global chunking scheme of I think it's reasonable to offload the responsibility for specifying chunks to the end-user since they know which datasets they are loading. The simplest compromise we came up with was to load datasets as a single chunk (via |
I want to have your end-users... :)
I see, its not a intake issue. I found this issue which relates to my problem: To use only one chunk for the variable array of an entire file is too large. I think more reasonable would be to use the chunk sizes that were given in the original netCDF file. |
Hi,
I appreciate intake a lot and provide catalogs for climate data of the German Climate Computing Center.
I have two questions:
chunks
that dask will be using for the functionto_dataset_dict
in the catalog descriptor? The data that our catalogs describe is in netCDF format. Whenchunks
is not specified in the function arguments, dask sets the chunks size to the dimension size of an entire file instead of the original file chunks. This leads to memory issues when working on the data afterwards.OpenDAP
from remote. But I am missing how I can specify the column to use for open the data into_dataset_dict
.Thanks a lot for any help.
Fabi
The text was updated successfully, but these errors were encountered: