This repository has been archived by the owner on Aug 29, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
xarray with and without dask
Marco Zühlke edited this page May 19, 2016
·
2 revisions
An investigation how xarray can handle access to large number of file.
-
xarray.open_dataset
andxarray.open_mfdataset
can use dask as a backend - to trigger usage of dask the paramter
chunks
is required - the
chunk
parameter is not taken from the_ChunkSizes
netCDF variable attribute - allows to open huge amounts of netCDF files
- tested with
- 1 year of daily SST (366 files)
- 10 years of monthly Aerosol (117 files
- for a very large number of files see open_mfdataset too many files #463
===================================================
using xarray
num datasets: 116
TIME for open : 0:00:00.916795
TIME for combine : 0:00:02.849429
===================================================
using xarray + dask
num datasets: 116
TIME for open : 0:00:01.523071
TIME for combine : 0:00:02.081897
===================================================
dimensions: {'time': 116, 'longitude': 360, 'latitude': 180}
===================================================
===================================================
using xarray (3 days)
num datasets: 3
TIME for open : 0:00:00.202182
TIME for combine : 0:00:04.096742
===================================================
using xarray + dask (3 days)
num datasets: 3
TIME for open : 0:00:00.119849
TIME for combine : 0:00:00.060847
===================================================
using xarray + dask (1 year)
num datasets: 366
TIME for open : 0:00:15.474201
TIME for combine : 0:00:07.600017
===================================================
dimensions: {'lat': 3600, 'lon': 7200, 'bnds': 2, 'time': 366}
===================================================
- without the usage of dask 10 files leads to a memory usage above 15 GB
- so no comparison with a year worth of data
- on disk size of 1 year is 6.6 GB
- this xarray dataset can be used to create time series or space-time based subsets
TIME for time_series: 0:00:00.575770
TIME for ts_load : 0:00:09.502877
TIME for subset : 0:00:00.015677
TIME for subset load : 0:00:00.187682
dimensions of subset: {'lat': 300, 'lon': 300, 'bnds': 2, 'time': 0}