open_mfdataset in v.0.11.1 is very slow #2662

malmans2 · 2019-01-08T19:59:47Z

I have several repositories corresponding to different time periods of time.
Each repository contains several netCDF files with different variables.
Here is a simplified example: path/period1/ contains varsABC_period1.nc and varsDEF_period1.nc, and path/period2/ contains varsABC_period2.nc and varsDEF_period2.nc. All variables have the dimension T (time), but the other dimensions can be different (e.g., A has dimensions [T,X1,Y1], B has dimensions [T,X1,Y2], ...).
Before version v0.11.1 a was able to easily open a dataset as follow:

ds=xr.open_mfdataset('path\period*\*.nc', concat_dim='T')

However, open_mfdataset now fails. It actually takes forever without returning any output, so my guess is that it's just very very slow.
I've noticed that I'm able to create the dataset concatenating the files with the same variables first, then merging them:

dsABC=xr.open_mfdataset('path\period*\varsABC*.nc', concat_dim='T')
dsDEF=xr.open_mfdataset('path\period*\varsDEF*.nc', concat_dim='T')
ds=xr.merge([dsABC, dsDEF])

I can't find what's the update in v0.11.1 that is causing this behavior. Any idea?

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] python-bits: 64 OS: Linux OS-release: 3.10.0-862.3.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: None.None libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.11.2
pandas: 0.23.4
numpy: 1.14.3
scipy: 1.1.0
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
cyordereddict: None
dask: 1.0.0
distributed: 1.25.2
matplotlib: 3.0.0
cartopy: None
seaborn: 0.9.0
setuptools: 40.4.3
pip: 18.1
conda: 4.5.12
pytest: 3.8.2
IPython: 7.0.1
sphinx: 1.8.1

The text was updated successfully, but these errors were encountered:

TomNicholas · 2019-01-08T21:43:31Z

I'm not sure what might be causing this, but I wonder if you could help narrow it down a bit?

Can you for example see if it's making it past here? That would at least tell us if it is opening each of the datasets okay.

(Or even better: post some example datasets which will cause this problem?)

malmans2 · 2019-01-14T17:20:03Z

I've created a little script to reproduce the problem.
@TomNicholas it looks like datasets are opened correctly. The problem arises when open_mfdatasets calls _auto_combine. Indeed, _auto_combine was introduced in v0.11.1

import numpy as np
import xarray as xr
import os
Tsize=100; T = np.arange(Tsize);
Xsize=900; X = np.arange(Xsize);
Ysize=800; Y = np.arange(Ysize) 
data = np.random.randn(Tsize, Xsize, Ysize)
for i in range(2):
    
    # Create 2 datasets with different variables
    dsA = xr.Dataset({'A': xr.DataArray(data, coords={'T': T+i*Tsize}, dims=('T', 'X', 'Y'))})
    dsB = xr.Dataset({'B': xr.DataArray(data, coords={'T': T+i*Tsize}, dims=('T', 'X', 'Y'))})
    
    # Save datasets in one folder
    dsA.to_netcdf('dsA'+str(i)+'.nc')
    dsB.to_netcdf('dsB'+str(i)+'.nc')
    
    # Save datasets in two folders
    dirname='rep'+str(i)
    os.mkdir(dirname)
    dsA.to_netcdf(dirname+'/'+'dsA'+str(i)+'.nc')
    dsB.to_netcdf(dirname+'/'+'dsB'+str(i)+'.nc')

Fast if netCDFs are stored in one folder:

%%time
ds_1folder = xr.open_mfdataset('*.nc',  concat_dim='T')

CPU times: user 49.9 ms, sys: 5.06 ms, total: 55 ms
Wall time: 59.7 ms

Slow if netCDFs are stored in several folders:

%%time
ds_2folders = xr.open_mfdataset('rep*/*.nc', concat_dim='T')

CPU times: user 8.6 s, sys: 5.95 s, total: 14.6 s
Wall time: 10.3 s

Fast if files containing different variables are opened separately, then merged:

%%time
ds_A = xr.open_mfdataset('rep*/dsA*.nc', concat_dim='T')
ds_B = xr.open_mfdataset('rep*/dsB*.nc', concat_dim='T')
ds_merged = xr.merge([ds_A, ds_B])

CPU times: user 33.8 ms, sys: 3.7 ms, total: 37.5 ms
Wall time: 34.5 ms

shoyer · 2019-01-15T10:56:03Z

@malmans2 thanks for this reproducible test case!

From xarray's perspective, the difference is the order in which the arrays are concatenated/processed. This is determined by sorting the (globbed) file names:

In [16]: sorted(glob.glob('rep*/*.nc'))
Out[16]: ['rep0/dsA0.nc', 'rep0/dsB0.nc', 'rep1/dsA1.nc', 'rep1/dsB1.nc']

In [17]: sorted(glob.glob('*.nc'))
Out[17]: ['dsA0.nc', 'dsA1.nc', 'dsB0.nc', 'dsB1.nc']

It appears that the slow case [A0, B0, A1, B1] now requires computing data with dask, whereas [A0, A1, B0, B1] does not.

I suspect the issue is that we're now using some different combination of merge/concat. In particular it looks like the compute is being triggered from within merge. This sort of makes sense: if we're using merge instead of concat for joining along the dimension T, that is super slow because that goes through a path that checks arrays for conflicting values by loading data into memory (even though in this case that isn't possible, because the original coordinates were not overlapping).

We could (and should) optimize this path in merge to avoid eagerly loading data, but the immediate fix here is probably to make sure we're using concat instead of merge.

TomNicholas · 2019-01-15T15:05:22Z

Yes thankyou @malmans2, this is very helpful!

I suspect the issue is that we're now using some different combination of merge/concat.

This was very puzzling because the code is supposed to split the datasets up according to their data variables, which means merge won't be used to concatenate and this should be fast, as before.

But I found the problem! In _auto_combine_1d I should have sorted the datasets before attempting to group them by data variable, i.e. I needed a line

sorted_datasets = sorted(datasets, key=lambda ds: tuple(sorted(ds)))

before

grouped = itertools.groupby(sorted_datasets, key=lambda ds: tuple(sorted(ds)))

With this change then I get

# No longer slow if netCDFs are stored in several folders:    
%timeit ds_2folders = xr.open_mfdataset('rep*/*.nc', concat_dim='T')

9.35 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Without this pre-sorting, itertools.groupby isn't guaranteed to do the grouping properly (in contrast to itertoolz.groupby which is), and as a result wasn't necessarily grouping the datasets by their variables. Then it wouldn't have finished concatenating along the dimension 'T' before it tried to merge everything back together.

Whether or not groupby sorted properly depended on the order of datasets in the input to groupby, which eventually depended on the way they were loaded (as the example in this issue makes clear).

The reason this mistake got past the unit tests is that auto_combine still gives the correct result in every case! Merge will still combine these datasets, but it will load their values in first to check that's okay, which was why it was ~1000 times slower. None of the unit tests checked performance though, and the tests I wrote were all supposed to be very fast, so this slowdown wasn't noticeable in any of them.

malmans2 · 2019-01-15T15:45:03Z

I checked PR #2678 with the data that originated the issue and it fixes the problem!

dcherian · 2019-01-15T16:14:12Z

We have airspeedvelocity performance tests. I don't know there's one for auto_combine but maybe you can add one @TomNicholas

* Hotfix for #2662 * Separate keyfunc * Added a test, and made others slightly stricter * Comment explaining test better

* master: stale requires a label (pydata#2701) Update indexing.rst (pydata#2700) add line break to message posted (pydata#2698) Config for closing stale issues (pydata#2684) to_dict without data (pydata#2659) Update asv.conf.json (pydata#2693) try no rasterio in py36 env (pydata#2691) Detailed report for testing.assert_equal and testing.assert_identical (pydata#1507) Hotfix for pydata#2662 (pydata#2678) Update README.rst (pydata#2682) Fix test failures with numpy=1.16 (pydata#2675)

* refactor-plot-utils: (22 commits) review comment. small rename stale requires a label (pydata#2701) Update indexing.rst (pydata#2700) add line break to message posted (pydata#2698) Config for closing stale issues (pydata#2684) to_dict without data (pydata#2659) Update asv.conf.json (pydata#2693) try no rasterio in py36 env (pydata#2691) Detailed report for testing.assert_equal and testing.assert_identical (pydata#1507) Hotfix for pydata#2662 (pydata#2678) Update README.rst (pydata#2682) Fix test failures with numpy=1.16 (pydata#2675) lint Back to map_dataarray_line Refactor out cmap_params, cbar_kwargs processing Refactor out colorbar making to plot.utils._add_colorbar flake8 facetgrid refactor Refactor out utility functions. ...

* Hotfix for #2662 * Separate keyfunc * Added a test, and made others slightly stricter * Comment explaining test better

* concatenates along a single dimension * Wrote function to find correct tile_IDs from nested list of datasets * Wrote function to check that combined_tile_ids structure is valid * Added test of 2d-concatenation * Tests now check that dataset ordering is correct * Test concatentation along a new dimension * Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm * All unit tests now passing * Fixed a failing test which I didn't notice because I don't have pseudoNetCDF * Began updating open_mfdataset to handle N-D input * Refactored to remove duplicate logic in open_mfdataset & auto_combine * Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator * --amend * Now raises ValueError if input not ordered correctly before concatenation * Added some more prototype tests defining desired behaviour more clearly * Now raises informative errors on invalid forms of input * Refactoring to alos merge along each dimension * Refactored to literally just apply the old auto_combine along each dimension * Added unit tests for open_mfdatset * Removed TODOs * Removed format strings * test_get_new_tile_ids now doesn't assume dicts are ordered * Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered * Test for getting new tile id * Fixed itertoolz import so that it's compatible with older versions * Increased test coverage * Added toolz as an explicit dependency to pass tests on python2.7 * Updated 'what's new' * No longer attempts to shortcut all concatenation at once if concat_dims=None * Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz * Fixed erroneous removal of utils import * Updated docstrings to include an example of multidimensional concatenation * Clarified auto_combine docstring for N-D behaviour * Added unit test for nested list of Datasets with different variables * Minor spelling and pep8 fixes * Started working on a new api with both auto_combine and manual_combine * Wrote basic function to infer concatenation order from coords. Needs better error handling though. * Attempt at finalised version of public-facing API. All the internals still need to be redone to match though. * No longer uses entire old auto_combine internally, only concat or merge * Updated what's new * Removed uneeded addition to what's new for old release * Fixed incomplete merge in docstring for open_mfdataset * Tests for manual combine passing * Tests for auto_combine now passing * xfailed weird behaviour with manual_combine trying to determine concat_dim * Add auto_combine and manual_combine to API page of docs * Tests now passing for open_mfdataset * Completed merge so that #2648 is respected, and added tests. Also moved concat to it's own file to avoid a circular dependency * Separated the tests for concat and both combines * Some PEP8 fixes * Pre-empting a test which will fail with opening uamiv format * Satisfy pep8speaks bot * Python 3.5 compatibile after changing some error string formatting * Order coords using pandas.Index objects * Fixed performance bug from GH #2662 * Removed ToDos about natural sorting of string coords * Generalized auto_combine to handle monotonically-decreasing coords too * Added more examples to docstring for manual_combine * Added note about globbing aspect of open_mfdataset * Removed auto-inferring of concatenation dimension in manual_combine * Added example to docstring for auto_combine * Minor correction to docstring * Another very minor docstring correction * Added test to guard against issue #2777 * Started deprecation cycle for auto_combine * Fully reverted open_mfdataset tests * Updated what's new to match deprecation cycle * Reverted uamiv test * Removed dependency on itertools * Deprecation tests fixed * Satisfy pycodestyle * Started deprecation cycle of auto_combine * Added specific error for edge case combine_manual can't handle * Check that global coordinates are monotonic * Highlighted weird behaviour when concatenating with no data variables * Added test for impossible-to-auto-combine coordinates * Removed uneeded test * Satisfy linter * Added airspeedvelocity benchmark for combining functions * Benchmark will take longer now * Updated version numbers in deprecation warnings to fit with recent release of 0.12 * Updated api docs for new function names * Fixed docs build failure * Revert "Fixed docs build failure" This reverts commit ddfc6dd. * Updated documentation with section explaining new functions * Suppressed deprecation warnings in test suite * Resolved ToDo by pointing to issue with concat, see #2975 * Various docs fixes * Slightly renamed tests to match new name of tested function * Included minor suggestions from shoyer * Removed trailing whitespace * Simplified error message for case combine_manual can't handle * Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim * Simple fixes suggested by shoyer * Change deprecation warning behaviour * linting

dcherian added the topic-performance label Jan 13, 2019

TomNicholas mentioned this issue Jan 15, 2019

Hotfix for #2662 #2678

Merged

1 task

shoyer closed this as completed in #2678 Jan 17, 2019

shoyer pushed a commit that referenced this issue Jan 17, 2019

Hotfix for #2662 (#2678)

dc87dea

* Hotfix for #2662 * Separate keyfunc * Added a test, and made others slightly stricter * Comment explaining test better

shoyer pushed a commit that referenced this issue Jan 24, 2019

Hotfix for #2662 (#2678)

55fbd41

* Hotfix for #2662 * Separate keyfunc * Added a test, and made others slightly stricter * Comment explaining test better

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_mfdataset in v.0.11.1 is very slow #2662

open_mfdataset in v.0.11.1 is very slow #2662

malmans2 commented Jan 8, 2019

TomNicholas commented Jan 8, 2019

malmans2 commented Jan 14, 2019

shoyer commented Jan 15, 2019

TomNicholas commented Jan 15, 2019

malmans2 commented Jan 15, 2019

dcherian commented Jan 15, 2019

open_mfdataset in v.0.11.1 is very slow #2662

open_mfdataset in v.0.11.1 is very slow #2662

Comments

malmans2 commented Jan 8, 2019

Output of xr.show_versions()

TomNicholas commented Jan 8, 2019

malmans2 commented Jan 14, 2019

Fast if netCDFs are stored in one folder:

Slow if netCDFs are stored in several folders:

Fast if files containing different variables are opened separately, then merged:

shoyer commented Jan 15, 2019

TomNicholas commented Jan 15, 2019

malmans2 commented Jan 15, 2019

dcherian commented Jan 15, 2019

Output of `xr.show_versions()`