to_netcdf(compute=False) can be slow #2242

neishm · 2018-06-21T19:50:36Z

Code Sample

import xarray as xr
from dask.array import ones
import dask
from dask.diagnostics import ProgressBar
ProgressBar().register()

# Define a mock DataSet
dset = {}
for i in range(5):
  name = 'var'+str(i)
  data = i*ones((8,79,200,401),dtype='f4',chunks=(1,1,200,401))
  var = xr.DataArray(data=data, dims=('time','level','lat','lon'), name=name)
  dset[name] = var
dset = xr.Dataset(dset)

# Single thread to facilitate debugging.
# (may require dask < 0.18)
with dask.set_options(get=dask.get):

  # This works fine.
  print ("Testing immediate netCDF4 writing")
  dset.to_netcdf("test1.nc")

  # This can be twice as slow as the version above.
  # Can be even slower (like 10x slower) on a shared filesystem.
  print ("Testing delayed netCDF4 writing")
  dset.to_netcdf("test2.nc",compute=False).compute()

Problem description

Using the delayed version of to_netcdf can cause a slowdown in writing the file. Running through cProfile, I see _open_netcdf4_group is called many times, suggesting the file is opened and closed for each chunk written. In my scripts (which dump to an NFS filesystem), writes can take 10 times longer than they should.

Is there a reason for the repeated open/close cycles (e.g. #1198?), or can this behaviour be fixed so the file stays open for the duration of the compute() call?

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-135-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

xarray: 0.10.7
pandas: 0.23.0
numpy: 1.14.4
scipy: None
netCDF4: 1.4.0
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: 0.17.5
distributed: None
matplotlib: 1.3.1
cartopy: None
seaborn: None
setuptools: 39.2.0
pip: None
conda: None
pytest: None
IPython: None
sphinx: None

The text was updated successfully, but these errors were encountered:

shoyer · 2018-06-21T23:37:10Z

I suspect this can be improved. Looking at the code, it appears that we only intentionally use autoclose=True for writes when using multiprocessing or the distributed dask scheduler.

xarray/xarray/backends/api.py

Lines 709 to 710 in 73b476e

    
           autoclose = (dataset.chunks and 
        
                        scheduler in ['distributed', 'multiprocessing'])

jhamman · 2018-06-22T04:51:54Z

I think, at least to some extent, the performance hit is to be expected. I don't think we should be opening the file more than once when using the serial or threaded schedulers so that may be a place where you can find some improvement. There will always be a performance hit when writing dask arrays to netcdf files chunk-by-chunk. For 1, there is a threading lock that limits parallel throughput. More importantly, the chunked writes are going to always be slower than larger reads coming directly from numpy arrays.

In your example above, the snippit @shoyer mentions should evaluate to autoclose=False. However, the profiling you mention seems to indicate the opposite. Perhaps we should start by digging deeper on that point.

neishm · 2018-06-22T16:10:45Z

True, I would expect some performance hit due to writing chunk-by-chunk, however that same performance hit is present in both of the test cases.

In addition to the snippet @shoyer mentioned, I found that xarray also intentionally uses autoclose=True when writing chunks to netCDF:

xarray/xarray/backends/netCDF4_.py

Lines 45 to 48 in 73b476e

    
           def __setitem__(self, key, value): 
        
               with self.datastore.ensure_open(autoclose=True): 
        
                   data = self.get_array() 
        
                   data[key] = value

However, ensure_open only uses autoclose if the file isn't already open:

xarray/xarray/backends/common.py

Lines 496 to 503 in 73b476e

    
           if not self._isopen: 
        
               try: 
        
                   self._ds = self._opener() 
        
                   self._isopen = True 
        
                   yield 
        
               finally: 
        
                   if autoclose: 
        
                       self.close()

So if the file is already open before getting to BaseNetCDF4Array__setitem__, it will remain open. If the file isn't yet opened, it will be opened, but then immediately closed after writing the chunk. I suspect this is what's happening in the delayed version - the starting state of NetCDF4DataStore._isopen is False for some reason, and so it is doomed to re-close itself for each chunk processed.

If I remove the autoclose=True from BaseNetCDF4Array__setitem__, the file remains open and performance is comparable between the two tests.

shoyer · 2018-06-22T16:33:11Z

This autoclose business is really hard to reason about in its current version, as part of the backend class. I'm hoping that refactoring it out into a separate object that we can use with composition instead of inheritance will help (e.g., alongside PickleByReconstructionWrapper).

jhamman · 2019-01-13T21:13:28Z

I just reran the example above and things seem to be resolved now. The write step for the two datasets is basically identical.

dcherian added the topic-backends label Jan 8, 2019

jhamman closed this as completed Jan 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_netcdf(compute=False) can be slow #2242

to_netcdf(compute=False) can be slow #2242

neishm commented Jun 21, 2018

INSTALLED VERSIONS

shoyer commented Jun 21, 2018

jhamman commented Jun 22, 2018

neishm commented Jun 22, 2018

shoyer commented Jun 22, 2018

jhamman commented Jan 13, 2019

to_netcdf(compute=False) can be slow #2242

to_netcdf(compute=False) can be slow #2242

Comments

neishm commented Jun 21, 2018

Code Sample

Problem description

Output of xr.show_versions()

INSTALLED VERSIONS

shoyer commented Jun 21, 2018

jhamman commented Jun 22, 2018

neishm commented Jun 22, 2018

shoyer commented Jun 22, 2018

jhamman commented Jan 13, 2019

Output of `xr.show_versions()`