-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_mfdataset reads coords from disk multiple times #1521
Comments
This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 2**30), chunks=(1, 2**30). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over? |
In principle, coords can have the same shape as data variables. In those cases, you probably want to use the same chunking scheme.
@rabernat is interested in this use case. See #1385 and #1413 for discussion.
Yes, I think you're correct here as well. This is also an annoying inefficiency, but the API design is a little tricky. |
As suspected, the problem is caused specifically by non-index coords:
output:
|
Getting closer. The problem is in xarray.concat, which resolves non-index dask coords, TWICE, even if it should not resolve them at all (as alignment should be done on index coords only?)
Output:
|
The problem is these lines in Lines 158 to 168 in 78ca20a
We inspect compare coordinates for equality in order to decide whether to ignore redundant coordinates or stack them up. This happens if |
So, to be more precise, I think the problem is that the first variable is computed many times over (once per comparison), inside the A very simple fix, slightly more conservative than loading every coordinate into memory, is to simply compute these first coordinates on the first variable, e.g., |
Just realised that you can concat(data_vars='different') and have the exact same problem on data_vars :| Also, with "different" I realised that you're comparing the variable contents twice, once in _calc_concat_over and another time at the end of _dataset_concat. This also slows down concat on pure-numpy backend. Need more time to work on this... I'll be on holiday for the next week with no access to my PC; should be able to continue after the 12 Sept. |
P.S. need to put #1522 as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails |
Enjoy your holiday!
…On Tue, Sep 5, 2017 at 5:01 PM crusaderky ***@***.***> wrote:
P.S. need to put #1522 <#1522> as
a prerequisite in order not to lose my sanity, as this change is very much
hitting on the same nails
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1521 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1lU4s3CH5v8Pvc1SPyI3jCpZiJtfks5sfeDkgaJpZM4PBH12>
.
|
Back to banging my head on it. Expect a heavy rewrite of combine.py. Can't say an ETA but it's going to be a fair amount of hours. |
Thanks @crusaderky for looking into this, I think this is very important. |
@crusaderky - happy to help with this. Maybe you can get a PR open and then I can provide some ASV benchmarking. |
I have 200x of the below dataset, split on the 'scenario' axis:
I individually dump them to disk with Dataset.to_netcdf(fname, engine='h5netcdf').
Then I try loading them back up with open_mfdataset, but it's mortally slow:
The problem is caused by the coords being read from disk multiple times.
Workaround:
Proposed solutions:
An additional, more radical observation is that, very frequently, a user knows in advance that all coords are aligned. In this use case, the user could explicitly request xarray to blindly trust this assumption, and thus skip loading the coords not based on concat_dim in all datasets beyond the first.
The text was updated successfully, but these errors were encountered: