Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

wrong chunk size does not allow to import some dataset #631

Closed
papesci opened this issue May 1, 2018 · 8 comments
Closed

wrong chunk size does not allow to import some dataset #631

papesci opened this issue May 1, 2018 · 8 comments
Assignees

Comments

@papesci
Copy link
Contributor

papesci commented May 1, 2018

Expected behavior

Cate should allow to use the following dataset

ID Dataset Name
36 esacci.LC.13-yrs.L4.WB.asar.Envisat.Map.4-0.r1
37 esacci.LC.5-yrs.L4.LCCS.multi-sensor.multi-platform.Map.1-6-1.r1
44 esacci.OC.5-days.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.r1
46 esacci.OC.5-days.L3S.RRS.multi-sensor.multi-platform.MERGED.2-0.r1
54 esacci.OC.8-days.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.r1
56 esacci.OC.8-days.L3S.RRS.multi-sensor.multi-platform.MERGED.2-0.r1
84 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.1997-r1
86 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.1998-r1
88 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.1999-r1
90 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2000-r1
92 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2001-r1
94 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2002-r1
96 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2003-r1
98 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2004-r1
100 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2005-r1
102 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2006-r1
104 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2007-r1
106 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2008-r1
108 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2009-r1
110 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2010-r1
112 esacci.OC.day.L3S.OC_PRODUCTS.multi-sensor.multi-platform.MERGED.2-0.2011-r1

Actual behavior

At he moment an exception is raised when a dataset of this collection is imported with a size greater then 250 MB with this error message
image

The number reported could be different for different dataset.

@papesci papesci self-assigned this May 1, 2018
@papesci
Copy link
Contributor Author

papesci commented May 1, 2018

The mention dataset do not verify some assumption made to find the chunk size when the DS is converted into an xArray. A more general formula should be apply in order to allow their use while maintaining the balance between performance and memory used.

@forman
Copy link
Member

forman commented May 1, 2018

@papesci I agree. In my branch https://github.com/CCI-Tools/cate/tree/623-nf-add_time_dim_on_open_dataset, I just removed that chunk size computation (and its failure) and replaced it by chunk sizes effectively used in NetCDF files. I primarily did this for image display speed, because often the computed chunk sizes where multiples of and/or not aligned with chunk sizes used in the files. We may use the computed chunk sizes later, either controlled by keyword arg or by default, if file chunk sizes would require too much memory. @JanisGailis, what do you think?

@JanisGailis
Copy link
Member

JanisGailis commented May 1, 2018

Here be dragons!!!

The assumption that netCDF chunks are fine to use as Dask chunks is not always correct. This was the main thing that tripped up SST data and SST operations that we needed for Use Case 6. So, whatever is being done, please be careful to make sure that Use Case 6 still works on entry level machines with 4GB memory.

I would like to have something implemented that would let Cate to open all of those datasets. But I would say that the default should be the current chunking implementation on master and in case it fails, try something else.

For context, please see the discussions I've had about this with xarray guys:

https://groups.google.com/forum/#!searchin/xarray/gailis|sort:date/xarray/MoVrG_792dg/Dam-7jrEBgAJ
pydata/xarray#1440

And related:
pydata/xarray#1832

@papesci
Copy link
Contributor Author

papesci commented May 2, 2018

I think we should apply a chuck strategy to the DS that doesn't fit in memory.
There is a case when is easy to find a chunk, when the dataset is regular in the size of the dimensions.
A square, a cube .. so we could in principle reduce the problem to a regular shape scaling the dimension to fit the biggest one, find the chunk and then rescale it to its original size.

The algorithm try to find the optimum chunk distribution, nevertheless sometime
is not possible so it try to find an approximation eventually returning a single chunk.

here is the python code

  • dims: dictionay with dimension e.g. {'lat':12000,'lon':800}
  • ds_size: size of the entire DS in bytes
  • threshold: maximun size in byte of each chunk
def compute_chunk(dims,ds_size,threshold):
   # an empty dictionary mean no chunk is required
    if ds_size <= threshold:
        return dict()
    # avoid to process dimension equal to one
    chunks = {k: v for k, v in dims.items() if v == 1}

    # dimension to process
    c_dims = {k: v for k, v in dims.items() if v > 1}
    print (ds_size/threshold)
    # normalize the problem to a cube
    scale = dict()
    pivot = max(c_dims.values())
    normalization_factor = 1
    for d, v in c_dims.items():
        scale[d] = pivot / v
        normalization_factor *= scale[d]
   # K factor is the scaled dimension to chunk
    k_factor = math.ceil(ds_size * normalization_factor / threshold)
    # now we have a scaled problem on a regular shape
    # we can find the side value to fit the chunk
    bucket = math.ceil(k_factor ** (1 / len(c_dims)))
    
    # find the correct number of scaled bucket
   # for each dimension 
    for d, v in c_dims.items():
        chunks[d] = v
        j = math.ceil(bucket / scale[d])
        for i in range(j, 0, -1):
            print (i)
            if v % j == 0:
                chunks[d] = int(v / i)
                break
    return chunks

@forman
Copy link
Member

forman commented May 3, 2018

@JanisGailis I'm fully aware of what you wrote. But for CCI data as our primary data source, chunks size have often been selected carefully with spatial access performance in mind. Therefore the NetCDF chunk sizes are a good first guess and therefore also @papesci is right, however there is never a single best chunking strategy. It depends on the use case.

That's why we should make the dask chunking strategy a parameter, so we can use different settings for GUI image display (including fast on-the-fly tile pyramid generation) and, e.g. long-term timeseries, batch mode processing using the API and CLI.

@JanisGailis
Copy link
Member

Bear in mind that netCDF chunks can and often are different between variables in a dataset netCDF chunking.

Either way, I completely agree that we need to make this better such that all those datasets can be opened. And really, for most datasets this is a non-issue as a single time slice of global data fits in memory quite well. This matters mostly for some exceptions such as SST. So, whatever gets implemented, just make sure UC6 happy path from cate-e2e still works.

@papesci
Copy link
Contributor Author

papesci commented May 4, 2018

@forman @JanisGailis
I like the option to enable the user to select its own chunking strategy. It is hard to find a general rule indeed but an expert user could take advantage from an optimization functionality like chunking. However we need to store its preference in order to use it next time he open the DS

@forman
Copy link
Member

forman commented May 8, 2018

Should be fixed now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants