-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_netcdf(compute=False) can be slow #2242
Comments
I suspect this can be improved. Looking at the code, it appears that we only intentionally use Lines 709 to 710 in 73b476e
|
I think, at least to some extent, the performance hit is to be expected. I don't think we should be opening the file more than once when using the serial or threaded schedulers so that may be a place where you can find some improvement. There will always be a performance hit when writing dask arrays to netcdf files chunk-by-chunk. For 1, there is a threading lock that limits parallel throughput. More importantly, the chunked writes are going to always be slower than larger reads coming directly from numpy arrays. In your example above, the snippit @shoyer mentions should evaluate to |
True, I would expect some performance hit due to writing chunk-by-chunk, however that same performance hit is present in both of the test cases. In addition to the snippet @shoyer mentioned, I found that xarray also intentionally uses xarray/xarray/backends/netCDF4_.py Lines 45 to 48 in 73b476e
However, xarray/xarray/backends/common.py Lines 496 to 503 in 73b476e
So if the file is already open before getting to If I remove the |
This autoclose business is really hard to reason about in its current version, as part of the backend class. I'm hoping that refactoring it out into a separate object that we can use with composition instead of inheritance will help (e.g., alongside PickleByReconstructionWrapper). |
I just reran the example above and things seem to be resolved now. The write step for the two datasets is basically identical. |
Code Sample
Problem description
Using the delayed version of
to_netcdf
can cause a slowdown in writing the file. Running through cProfile, I see_open_netcdf4_group
is called many times, suggesting the file is opened and closed for each chunk written. In my scripts (which dump to an NFS filesystem), writes can take 10 times longer than they should.Is there a reason for the repeated open/close cycles (e.g. #1198?), or can this behaviour be fixed so the file stays open for the duration of the
compute()
call?Output of
xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-135-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
xarray: 0.10.7
pandas: 0.23.0
numpy: 1.14.4
scipy: None
netCDF4: 1.4.0
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: 0.17.5
distributed: None
matplotlib: 1.3.1
cartopy: None
seaborn: None
setuptools: 39.2.0
pip: None
conda: None
pytest: None
IPython: None
sphinx: None
The text was updated successfully, but these errors were encountered: