xarray with and without dask

An investigation how xarray can handle access to large number of file.

xarray.open_dataset and xarray.open_mfdataset can use dask as a backend
to trigger usage of dask the paramter chunks is required
the chunk parameter is not taken from the _ChunkSizes netCDF variable attribute
allows to open huge amounts of netCDF files
tested with
- 1 year of daily SST (366 files)
- 10 years of monthly Aerosol (117 files
for a very large number of files see open_mfdataset too many files #463

AEROSOL: "*-ESACCI-L3C_AEROSOL-AER_PRODUCTS-AATSR-ENVISAT-ADV_MOTNHLY-v2.30.nc"

===================================================
using xarray
num datasets:  116
TIME for open        :  0:00:00.916795
TIME for combine     :  0:00:02.849429
===================================================
using xarray + dask
num datasets:  116
TIME for open        :  0:00:01.523071
TIME for combine     :  0:00:02.081897
===================================================
dimensions:  {'time': 116, 'longitude': 360, 'latitude': 180}
===================================================

SST: "*-ESACCI-L4_GHRSST-SSTdepth-OSTIA-GLOB_LT-v02.0-fv01.1.nc"

===================================================
using xarray (3 days)
num datasets:  3
TIME for open        :  0:00:00.202182
TIME for combine     :  0:00:04.096742
===================================================
using xarray + dask (3 days)
num datasets:  3
TIME for open        :  0:00:00.119849
TIME for combine     :  0:00:00.060847
===================================================
using xarray + dask (1 year)
num datasets:  366
TIME for open        :  0:00:15.474201
TIME for combine     :  0:00:07.600017
===================================================
dimensions:  {'lat': 3600, 'lon': 7200, 'bnds': 2, 'time': 366}
===================================================

without the usage of dask 10 files leads to a memory usage above 15 GB
so no comparison with a year worth of data
on disk size of 1 year is 6.6 GB

time series / subsets

this xarray dataset can be used to create time series or space-time based subsets

time series (lat point / lon point)

TIME for time_series:  0:00:00.575770
TIME for ts_load     :  0:00:09.502877

subset (lat slice / lon slice / time slice)

TIME for subset      :  0:00:00.015677
TIME for subset load :  0:00:00.187682
dimensions of subset:  {'lat': 300, 'lon': 300, 'bnds': 2, 'time': 0}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray with and without dask

AEROSOL: "*-ESACCI-L3C_AEROSOL-AER_PRODUCTS-AATSR-ENVISAT-ADV_MOTNHLY-v2.30.nc"

SST: "*-ESACCI-L4_GHRSST-SSTdepth-OSTIA-GLOB_LT-v02.0-fv01.1.nc"

time series / subsets

time series (lat point / lon point)

subset (lat slice / lon slice / time slice)

Clone this wiki locally