open_mfdataset reads coords from disk multiple times #1521

crusaderky · 2017-08-24T09:29:57Z

I have 200x of the below dataset, split on the 'scenario' axis:

<xarray.Dataset>
Dimensions:      (fx_id: 39, instr_id: 16095, scenario: 2501)
Coordinates:
    currency     (instr_id) object 'GBP' 'USD' 'GBP' 'GBP' 'GBP' 'EUR' 'CHF' ...
  * fx_id        (fx_id) object 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' 'CAD' ...
  * instr_id     (instr_id) object 'property_standard_gbp' ...
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
    type         (instr_id) object 'Common Stock' 'Fixed Amortizing Bond' ...
Data variables:
    fx_rates     (fx_id, scenario) float64 1.236 1.191 1.481 1.12 1.264 ...
    instruments  (instr_id, scenario) float64 1.0 1.143 0.9443 1.013 1.176 ...
Attributes:
    base_currency:  GBP

I individually dump them to disk with Dataset.to_netcdf(fname, engine='h5netcdf').
Then I try loading them back up with open_mfdataset, but it's mortally slow:

%%time
xarray.open_mfdataset('*.nc', engine='h5netcdf')

Wall time: 30.3 s

The problem is caused by the coords being read from disk multiple times.
Workaround:

%%time
def load_coords(ds):
    for coord in ds.coords.values():
        coord.load()
    return ds
xarray.open_mfdataset('*.nc', engine='h5netcdf', preprocess=load_coords)
Wall time: 12.3 s

Proposed solutions:

Implement the above workaround directly inside open_mfdataset()
change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

An additional, more radical observation is that, very frequently, a user knows in advance that all coords are aligned. In this use case, the user could explicitly request xarray to blindly trust this assumption, and thus skip loading the coords not based on concat_dim in all datasets beyond the first.

The text was updated successfully, but these errors were encountered:

crusaderky · 2017-08-24T09:41:22Z

change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 2**30), chunks=(1, 2**30). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over?

shoyer · 2017-08-24T17:51:42Z

change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

In principle, coords can have the same shape as data variables. In those cases, you probably want to use the same chunking scheme.

An additional, more radical observation is that, very frequently, a user knows in advance that all coords are aligned. In this use case, the user could explicitly request xarray to blindly trust this assumption, and thus skip loading the coords not based on concat_dim in all datasets beyond the first.

@rabernat is interested in this use case. See #1385 and #1413 for discussion.

This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 230), chunks=(1, 230). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over?

Yes, I think you're correct here as well. This is also an annoying inefficiency, but the API design is a little tricky.

crusaderky · 2017-09-02T15:24:20Z

As suspected, the problem is caused specifically by non-index coords:

import xarray
import numpy

data = numpy.random.randint(1<<63, size=1000000)

for r in range(50):
    ds = xarray.Dataset(
        coords={'r': [r], 'c': data, 'otherindex': data},
        data_vars={'data': (('r', 'c'), data.reshape(1, data.size))})
    ds.to_netcdf('fast.%02d.nc' % r)
    del ds['otherindex']
    ds.coords['nonindex'] = ('c', data)
    ds.to_netcdf('slow.%02d.nc' % r)

def load_coords(ds):
    for coord in ds.coords.values():
        coord.load()
    return ds
%time xarray.open_mfdataset('fast.*.nc')
%time xarray.open_mfdataset('fast.*.nc', preprocess=load_coords)
%time xarray.open_mfdataset('slow.*.nc')
%time xarray.open_mfdataset('slow.*.nc', preprocess=load_coords)

output:

CPU times: user 332 ms, sys: 88 ms, total: 420 ms
Wall time: 420 ms
CPU times: user 348 ms, sys: 84 ms, total: 432 ms
Wall time: 430 ms
CPU times: user 1.13 s, sys: 200 ms, total: 1.33 s
Wall time: 1.07 s
CPU times: user 596 ms, sys: 104 ms, total: 700 ms
Wall time: 697 ms

crusaderky · 2017-09-02T16:24:56Z

Getting closer. The problem is in xarray.concat, which resolves non-index dask coords, TWICE, even if it should not resolve them at all (as alignment should be done on index coords only?)

import xarray
import numpy
import dask.array

def kernel(label):
    print("Kernel [%s] invoked!" % label)
    return numpy.array([1, 2])

a = dask.array.Array(name='a', dask={('a', 0): (kernel, 'a')}, chunks=((2, ), ), dtype=int)
b = dask.array.Array(name='b', dask={('b', 0): (kernel, 'b')}, chunks=((2, ), ), dtype=int)

ds0 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', a)})
ds1 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', b)})
xarray.concat([ds0, ds1], dim='z')

Output:

Kernel [a] invoked!
Kernel [b] invoked!
Kernel [b] invoked!Kernel [a] invoked!

<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 1 2
    y        (x) int64 dask.array<shape=(2,), chunksize=(2,)>
Data variables:
    *empty*

shoyer · 2017-09-04T05:13:59Z

The problem is these lines in combine.py:

xarray/xarray/core/combine.py

Lines 158 to 168 in 78ca20a

    
           if opt == 'different': 
        
               def differs(vname): 
        
                   # simple helper function which compares a variable 
        
                   # across all datasets and indicates whether that 
        
                   # variable differs or not. 
        
                   v = datasets[0].variables[vname] 
        
                   return any(not ds.variables[vname].equals(v) 
        
                              for ds in datasets[1:]) 
        
               # all nonindexes that are not the same in each dataset 
        
               concat_new = set(k for k in getattr(datasets[0], subset) 
        
                                if k not in concat_over and differs(k))

We inspect compare coordinates for equality in order to decide whether to ignore redundant coordinates or stack them up. This happens if coords='different'. That is the default choice, which was convenient before we supported dask, but is now a source of performance trouble as you point out.

shoyer · 2017-09-04T05:18:55Z

So, to be more precise, I think the problem is that the first variable is computed many times over (once per comparison), inside the differs helper function above.

A very simple fix, slightly more conservative than loading every coordinate into memory, is to simply compute these first coordinates on the first variable, e.g., v = datasets[0].variables[vname] -> v = datasets[0].variables[vname].compute(). I am slightly nervous about the potential memory overhead of loading all coordinates into memory.

crusaderky · 2017-09-05T23:59:41Z

Just realised that you can concat(data_vars='different') and have the exact same problem on data_vars :|

Also, with "different" I realised that you're comparing the variable contents twice, once in _calc_concat_over and another time at the end of _dataset_concat. This also slows down concat on pure-numpy backend.

Need more time to work on this... I'll be on holiday for the next week with no access to my PC; should be able to continue after the 12 Sept.

crusaderky · 2017-09-06T00:01:39Z

P.S. need to put #1522 as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails

shoyer · 2017-09-06T00:20:49Z

Enjoy your holiday!

…

On Tue, Sep 5, 2017 at 5:01 PM crusaderky ***@***.***> wrote: P.S. need to put #1522 <#1522> as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1521 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1lU4s3CH5v8Pvc1SPyI3jCpZiJtfks5sfeDkgaJpZM4PBH12> .

rabernat · 2017-09-06T14:59:29Z

This is closely related to #1385 and my aborted attempted fix in #1413.

crusaderky · 2017-09-21T22:39:55Z

Back to banging my head on it. Expect a heavy rewrite of combine.py. Can't say an ETA but it's going to be a fair amount of hours.

fmaussion · 2017-09-21T22:50:58Z

Thanks @crusaderky for looking into this, I think this is very important.

jhamman · 2017-09-21T23:14:45Z

@crusaderky - happy to help with this. Maybe you can get a PR open and then I can provide some ASV benchmarking.

crusaderky · 2017-09-21T23:18:10Z

@jhamman There's already #1551 open but I need to heavily rethink it to cater for all the various use cases offered by the data_vars and coords parameters of concat().

crusaderky mentioned this issue Aug 24, 2017

Dataset.__repr__ computes dask variables #1522

Closed

jhamman added topic-backends topic-performance labels Aug 24, 2017

crusaderky mentioned this issue Aug 28, 2017

v0.10 Release #1535

Closed

13 tasks

crusaderky mentioned this issue Sep 2, 2017

Load nonindex coords ahead of concat() #1551

Merged

4 tasks

jhamman closed this as completed in #1551 Oct 9, 2017

crusaderky mentioned this issue Apr 5, 2018

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_mfdataset reads coords from disk multiple times #1521

open_mfdataset reads coords from disk multiple times #1521

crusaderky commented Aug 24, 2017

crusaderky commented Aug 24, 2017 •

edited

Loading

shoyer commented Aug 24, 2017

crusaderky commented Sep 2, 2017 •

edited

Loading

crusaderky commented Sep 2, 2017

shoyer commented Sep 4, 2017 •

edited

Loading

shoyer commented Sep 4, 2017

crusaderky commented Sep 5, 2017

crusaderky commented Sep 6, 2017

shoyer commented Sep 6, 2017 via email

rabernat commented Sep 6, 2017

crusaderky commented Sep 21, 2017

fmaussion commented Sep 21, 2017

jhamman commented Sep 21, 2017

crusaderky commented Sep 21, 2017

open_mfdataset reads coords from disk multiple times #1521

open_mfdataset reads coords from disk multiple times #1521

Comments

crusaderky commented Aug 24, 2017

crusaderky commented Aug 24, 2017 • edited Loading

shoyer commented Aug 24, 2017

crusaderky commented Sep 2, 2017 • edited Loading

crusaderky commented Sep 2, 2017

shoyer commented Sep 4, 2017 • edited Loading

shoyer commented Sep 4, 2017

crusaderky commented Sep 5, 2017

crusaderky commented Sep 6, 2017

shoyer commented Sep 6, 2017 via email

rabernat commented Sep 6, 2017

crusaderky commented Sep 21, 2017

fmaussion commented Sep 21, 2017

jhamman commented Sep 21, 2017

crusaderky commented Sep 21, 2017

crusaderky commented Aug 24, 2017 •

edited

Loading

crusaderky commented Sep 2, 2017 •

edited

Loading

shoyer commented Sep 4, 2017 •

edited

Loading