Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_netcdf(compute=False) can be slow #2242

Closed
neishm opened this issue Jun 21, 2018 · 5 comments
Closed

to_netcdf(compute=False) can be slow #2242

neishm opened this issue Jun 21, 2018 · 5 comments

Comments

@neishm
Copy link
Contributor

neishm commented Jun 21, 2018

Code Sample

import xarray as xr
from dask.array import ones
import dask
from dask.diagnostics import ProgressBar
ProgressBar().register()

# Define a mock DataSet
dset = {}
for i in range(5):
  name = 'var'+str(i)
  data = i*ones((8,79,200,401),dtype='f4',chunks=(1,1,200,401))
  var = xr.DataArray(data=data, dims=('time','level','lat','lon'), name=name)
  dset[name] = var
dset = xr.Dataset(dset)

# Single thread to facilitate debugging.
# (may require dask < 0.18)
with dask.set_options(get=dask.get):

  # This works fine.
  print ("Testing immediate netCDF4 writing")
  dset.to_netcdf("test1.nc")

  # This can be twice as slow as the version above.
  # Can be even slower (like 10x slower) on a shared filesystem.
  print ("Testing delayed netCDF4 writing")
  dset.to_netcdf("test2.nc",compute=False).compute()

Problem description

Using the delayed version of to_netcdf can cause a slowdown in writing the file. Running through cProfile, I see _open_netcdf4_group is called many times, suggesting the file is opened and closed for each chunk written. In my scripts (which dump to an NFS filesystem), writes can take 10 times longer than they should.

Is there a reason for the repeated open/close cycles (e.g. #1198?), or can this behaviour be fixed so the file stays open for the duration of the compute() call?

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-135-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

xarray: 0.10.7
pandas: 0.23.0
numpy: 1.14.4
scipy: None
netCDF4: 1.4.0
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: 0.17.5
distributed: None
matplotlib: 1.3.1
cartopy: None
seaborn: None
setuptools: 39.2.0
pip: None
conda: None
pytest: None
IPython: None
sphinx: None

@shoyer
Copy link
Member

shoyer commented Jun 21, 2018

I suspect this can be improved. Looking at the code, it appears that we only intentionally use autoclose=True for writes when using multiprocessing or the distributed dask scheduler.

autoclose = (dataset.chunks and
scheduler in ['distributed', 'multiprocessing'])

@jhamman
Copy link
Member

jhamman commented Jun 22, 2018

I think, at least to some extent, the performance hit is to be expected. I don't think we should be opening the file more than once when using the serial or threaded schedulers so that may be a place where you can find some improvement. There will always be a performance hit when writing dask arrays to netcdf files chunk-by-chunk. For 1, there is a threading lock that limits parallel throughput. More importantly, the chunked writes are going to always be slower than larger reads coming directly from numpy arrays.

In your example above, the snippit @shoyer mentions should evaluate to autoclose=False. However, the profiling you mention seems to indicate the opposite. Perhaps we should start by digging deeper on that point.

@neishm
Copy link
Contributor Author

neishm commented Jun 22, 2018

True, I would expect some performance hit due to writing chunk-by-chunk, however that same performance hit is present in both of the test cases.

In addition to the snippet @shoyer mentioned, I found that xarray also intentionally uses autoclose=True when writing chunks to netCDF:

def __setitem__(self, key, value):
with self.datastore.ensure_open(autoclose=True):
data = self.get_array()
data[key] = value

However, ensure_open only uses autoclose if the file isn't already open:

if not self._isopen:
try:
self._ds = self._opener()
self._isopen = True
yield
finally:
if autoclose:
self.close()

So if the file is already open before getting to BaseNetCDF4Array__setitem__, it will remain open. If the file isn't yet opened, it will be opened, but then immediately closed after writing the chunk. I suspect this is what's happening in the delayed version - the starting state of NetCDF4DataStore._isopen is False for some reason, and so it is doomed to re-close itself for each chunk processed.

If I remove the autoclose=True from BaseNetCDF4Array__setitem__, the file remains open and performance is comparable between the two tests.

@shoyer
Copy link
Member

shoyer commented Jun 22, 2018

This autoclose business is really hard to reason about in its current version, as part of the backend class. I'm hoping that refactoring it out into a separate object that we can use with composition instead of inheritance will help (e.g., alongside PickleByReconstructionWrapper).

@jhamman
Copy link
Member

jhamman commented Jan 13, 2019

I just reran the example above and things seem to be resolved now. The write step for the two datasets is basically identical.

@jhamman jhamman closed this as completed Jan 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants