Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

xarray with and without dask

Marco Zühlke edited this page May 19, 2016 · 2 revisions

An investigation how xarray can handle access to large number of file.

  • xarray.open_dataset and xarray.open_mfdataset can use dask as a backend
  • to trigger usage of dask the paramter chunks is required
  • the chunk parameter is not taken from the _ChunkSizes netCDF variable attribute
  • allows to open huge amounts of netCDF files
  • tested with
    • 1 year of daily SST (366 files)
    • 10 years of monthly Aerosol (117 files
  • for a very large number of files see open_mfdataset too many files #463

AEROSOL: "*-ESACCI-L3C_AEROSOL-AER_PRODUCTS-AATSR-ENVISAT-ADV_MOTNHLY-v2.30.nc"

===================================================
using xarray
num datasets:  116
TIME for open        :  0:00:00.916795
TIME for combine     :  0:00:02.849429
===================================================
using xarray + dask
num datasets:  116
TIME for open        :  0:00:01.523071
TIME for combine     :  0:00:02.081897
===================================================
dimensions:  {'time': 116, 'longitude': 360, 'latitude': 180}
===================================================

SST: "*-ESACCI-L4_GHRSST-SSTdepth-OSTIA-GLOB_LT-v02.0-fv01.1.nc"

===================================================
using xarray (3 days)
num datasets:  3
TIME for open        :  0:00:00.202182
TIME for combine     :  0:00:04.096742
===================================================
using xarray + dask (3 days)
num datasets:  3
TIME for open        :  0:00:00.119849
TIME for combine     :  0:00:00.060847
===================================================
using xarray + dask (1 year)
num datasets:  366
TIME for open        :  0:00:15.474201
TIME for combine     :  0:00:07.600017
===================================================
dimensions:  {'lat': 3600, 'lon': 7200, 'bnds': 2, 'time': 366}
===================================================

  • without the usage of dask 10 files leads to a memory usage above 15 GB
  • so no comparison with a year worth of data
  • on disk size of 1 year is 6.6 GB

time series / subsets

  • this xarray dataset can be used to create time series or space-time based subsets

time series (lat point / lon point)

TIME for time_series:  0:00:00.575770
TIME for ts_load     :  0:00:09.502877

subset (lat slice / lon slice / time slice)

TIME for subset      :  0:00:00.015677
TIME for subset load :  0:00:00.187682
dimensions of subset:  {'lat': 300, 'lon': 300, 'bnds': 2, 'time': 0}
Clone this wiki locally