wrong chunk size does not allow to import some dataset #631

papesci · 2018-05-01T07:55:38Z

Expected behavior

Cate should allow to use the following dataset

ID	Dataset Name
36	esacci.LC.13-yrs.L4.WB.asar.Envisat.Map.4-0.r1
37	esacci.LC.5-yrs.L4.LCCS.multi-sensor.multi-platform.Map.1-6-1.r1
44	esacci.OC.5-days.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.r1
46	esacci.OC.5-days.L3S.RRS.multi-sensor.multi-platform.MERGED.2-0.r1
54	esacci.OC.8-days.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.r1
56	esacci.OC.8-days.L3S.RRS.multi-sensor.multi-platform.MERGED.2-0.r1
84	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.1997-r1
86	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.1998-r1
88	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.1999-r1
90	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2000-r1
92	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2001-r1
94	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2002-r1
96	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2003-r1
98	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2004-r1
100	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2005-r1
102	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2006-r1
104	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2007-r1
106	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2008-r1
108	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2009-r1
110	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2010-r1
112	esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2011-r1

Actual behavior

At he moment an exception is raised when a dataset of this collection is imported with a size greater then 250 MB with this error message

The number reported could be different for different dataset.

papesci · 2018-05-01T08:01:53Z

The mention dataset do not verify some assumption made to find the chunk size when the DS is converted into an xArray. A more general formula should be apply in order to allow their use while maintaining the balance between performance and memory used.

forman · 2018-05-01T12:51:15Z

@papesci I agree. In my branch https://github.com/CCI-Tools/cate/tree/623-nf-add_time_dim_on_open_dataset, I just removed that chunk size computation (and its failure) and replaced it by chunk sizes effectively used in NetCDF files. I primarily did this for image display speed, because often the computed chunk sizes where multiples of and/or not aligned with chunk sizes used in the files. We may use the computed chunk sizes later, either controlled by keyword arg or by default, if file chunk sizes would require too much memory. @JanisGailis, what do you think?

JanisGailis · 2018-05-01T20:22:28Z

Here be dragons!!!

The assumption that netCDF chunks are fine to use as Dask chunks is not always correct. This was the main thing that tripped up SST data and SST operations that we needed for Use Case 6. So, whatever is being done, please be careful to make sure that Use Case 6 still works on entry level machines with 4GB memory.

I would like to have something implemented that would let Cate to open all of those datasets. But I would say that the default should be the current chunking implementation on master and in case it fails, try something else.

For context, please see the discussions I've had about this with xarray guys:

https://groups.google.com/forum/#!searchin/xarray/gailis|sort:date/xarray/MoVrG_792dg/Dam-7jrEBgAJ
pydata/xarray#1440

And related:
pydata/xarray#1832

papesci · 2018-05-02T13:03:52Z

I think we should apply a chuck strategy to the DS that doesn't fit in memory.
There is a case when is easy to find a chunk, when the dataset is regular in the size of the dimensions.
A square, a cube .. so we could in principle reduce the problem to a regular shape scaling the dimension to fit the biggest one, find the chunk and then rescale it to its original size.

The algorithm try to find the optimum chunk distribution, nevertheless sometime
is not possible so it try to find an approximation eventually returning a single chunk.

here is the python code

dims: dictionay with dimension e.g. {'lat':12000,'lon':800}
ds_size: size of the entire DS in bytes
threshold: maximun size in byte of each chunk

def compute_chunk(dims,ds_size,threshold):
   # an empty dictionary mean no chunk is required
    if ds_size <= threshold:
        return dict()
    # avoid to process dimension equal to one
    chunks = {k: v for k, v in dims.items() if v == 1}

    # dimension to process
    c_dims = {k: v for k, v in dims.items() if v > 1}
    print (ds_size/threshold)
    # normalize the problem to a cube
    scale = dict()
    pivot = max(c_dims.values())
    normalization_factor = 1
    for d, v in c_dims.items():
        scale[d] = pivot / v
        normalization_factor *= scale[d]
   # K factor is the scaled dimension to chunk
    k_factor = math.ceil(ds_size * normalization_factor / threshold)
    # now we have a scaled problem on a regular shape
    # we can find the side value to fit the chunk
    bucket = math.ceil(k_factor ** (1 / len(c_dims)))
    
    # find the correct number of scaled bucket
   # for each dimension 
    for d, v in c_dims.items():
        chunks[d] = v
        j = math.ceil(bucket / scale[d])
        for i in range(j, 0, -1):
            print (i)
            if v % j == 0:
                chunks[d] = int(v / i)
                break
    return chunks

forman · 2018-05-03T17:40:20Z

@JanisGailis I'm fully aware of what you wrote. But for CCI data as our primary data source, chunks size have often been selected carefully with spatial access performance in mind. Therefore the NetCDF chunk sizes are a good first guess and therefore also @papesci is right, however there is never a single best chunking strategy. It depends on the use case.

That's why we should make the dask chunking strategy a parameter, so we can use different settings for GUI image display (including fast on-the-fly tile pyramid generation) and, e.g. long-term timeseries, batch mode processing using the API and CLI.

JanisGailis · 2018-05-03T20:01:25Z

Bear in mind that netCDF chunks can and often are different between variables in a dataset netCDF chunking.

Either way, I completely agree that we need to make this better such that all those datasets can be opened. And really, for most datasets this is a non-issue as a single time slice of global data fits in memory quite well. This matters mostly for some exceptions such as SST. So, whatever gets implemented, just make sure UC6 happy path from cate-e2e still works.

papesci · 2018-05-04T11:20:39Z

@forman @JanisGailis
I like the option to enable the user to select its own chunking strategy. It is hard to find a general rule indeed but an expert user could take advantage from an optimization functionality like chunking. However we need to store its preference in order to use it next time he open the DS

forman · 2018-05-08T18:57:14Z

Should be fixed now.

papesci self-assigned this May 1, 2018

forman mentioned this issue May 1, 2018

Add time dim in open_xarray_dataset() and use file chunking by default #632

Closed

forman added ds bug blocker labels May 8, 2018

forman assigned forman and unassigned papesci May 8, 2018

forman closed this as completed May 8, 2018

HelenClifton mentioned this issue Jun 7, 2018

"Write to read-only" errors when downloading datasets #667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong chunk size does not allow to import some dataset #631

wrong chunk size does not allow to import some dataset #631

papesci commented May 1, 2018

papesci commented May 1, 2018

forman commented May 1, 2018

JanisGailis commented May 1, 2018 •

edited

Loading

papesci commented May 2, 2018

forman commented May 3, 2018 •

edited

Loading

JanisGailis commented May 3, 2018

papesci commented May 4, 2018

forman commented May 8, 2018

wrong chunk size does not allow to import some dataset #631

wrong chunk size does not allow to import some dataset #631

Comments

papesci commented May 1, 2018

Expected behavior

Actual behavior

papesci commented May 1, 2018

forman commented May 1, 2018

JanisGailis commented May 1, 2018 • edited Loading

papesci commented May 2, 2018

forman commented May 3, 2018 • edited Loading

JanisGailis commented May 3, 2018

papesci commented May 4, 2018

forman commented May 8, 2018

JanisGailis commented May 1, 2018 •

edited

Loading

forman commented May 3, 2018 •

edited

Loading