-
Notifications
You must be signed in to change notification settings - Fork 15
wrong chunk size does not allow to import some dataset #631
Comments
The mention dataset do not verify some assumption made to find the chunk size when the DS is converted into an xArray. A more general formula should be apply in order to allow their use while maintaining the balance between performance and memory used. |
@papesci I agree. In my branch https://github.com/CCI-Tools/cate/tree/623-nf-add_time_dim_on_open_dataset, I just removed that chunk size computation (and its failure) and replaced it by chunk sizes effectively used in NetCDF files. I primarily did this for image display speed, because often the computed chunk sizes where multiples of and/or not aligned with chunk sizes used in the files. We may use the computed chunk sizes later, either controlled by keyword arg or by default, if file chunk sizes would require too much memory. @JanisGailis, what do you think? |
Here be dragons!!! The assumption that I would like to have something implemented that would let Cate to open all of those datasets. But I would say that the default should be the current chunking implementation on For context, please see the discussions I've had about this with https://groups.google.com/forum/#!searchin/xarray/gailis|sort:date/xarray/MoVrG_792dg/Dam-7jrEBgAJ And related: |
I think we should apply a chuck strategy to the DS that doesn't fit in memory. The algorithm try to find the optimum chunk distribution, nevertheless sometime here is the python code
|
@JanisGailis I'm fully aware of what you wrote. But for CCI data as our primary data source, chunks size have often been selected carefully with spatial access performance in mind. Therefore the NetCDF chunk sizes are a good first guess and therefore also @papesci is right, however there is never a single best chunking strategy. It depends on the use case. That's why we should make the dask chunking strategy a parameter, so we can use different settings for GUI image display (including fast on-the-fly tile pyramid generation) and, e.g. long-term timeseries, batch mode processing using the API and CLI. |
Bear in mind that netCDF chunks can and often are different between variables in a dataset netCDF chunking. Either way, I completely agree that we need to make this better such that all those datasets can be opened. And really, for most datasets this is a non-issue as a single time slice of global data fits in memory quite well. This matters mostly for some exceptions such as SST. So, whatever gets implemented, just make sure UC6 happy path from |
@forman @JanisGailis |
Should be fixed now. |
Expected behavior
Cate should allow to use the following dataset
Actual behavior
At he moment an exception is raised when a dataset of this collection is imported with a size greater then 250 MB with this error message
The number reported could be different for different dataset.
The text was updated successfully, but these errors were encountered: