Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add defaults during concat 508 #3545
Add defaults during concat 508 #3545
Changes from 6 commits
baeebed
a96583b
af347e7
418c538
9e35c84
f7124a3
47f7e4d
df3693e
c21dcd4
4e01bd9
515b9c1
cf5b8bd
3bf3931
03f9b3b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nested loop through
datasets
concerns me here. It means that concat will run in quadratic time with respect to the number of datasets being concatenated. This probably makexarray.concat
very slow on 1,000 datasets and outrageously slow on 10,000 datasets, both of which happen with some regularity.it would be best to write this using a separate pass to create dummy versions of each Variable, which could be reused when appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could happen in
calc_concat_over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new PR contains improved logic but still required me to go through the list of data_sets a few times. I think the new worst case runtime is O(DN^2) where D is num of datasets and N is number of variables in final list. If no fill value are required then it will be O(DN).
I did some perf testing with the new logic versus the old and I don't really see a significant difference but would love addition feedback if there is a better way.
Perf result for concat 720 files via open_mfdataset Parallel=False for PR:
58.7 s ± 143 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original result
58.1 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For 4359 files via open_mfdataset Parallel=False for PR:
5min 54s ± 840 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sorry I don't really have a good real-world dataset this large w/out missing values to test the original implementation. But this dataset ~6x larger took ~6x more time even with the penalty to cache and fill the missing values.
I don't currently have good data without missing variables larger than this (hence the PR :) )
I was also not sure I should overload the logic in calc_concat_over to do more but I could re-review this if the logic in the new PR looks like it should be refactored that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is starting to look a little familiar now, I think there are at least a handful of existing uses in variable.py already. Maybe factor it out into a helper function in
xarray.core.dtypes
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is in the new updated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned that this dummy variable may not always be the right size.
For example, supposing we are concatenating two Dataset along the existing dimension
'x'
. The first dataset has sizex=1
and the second has sizex=2
. If a variable is missing from one but not the other, the "dummy" variable would always have the wrong size, resulting in a total length of 2 or 4 but not 3.To properly handle this, I think you will need to index out the concatenated dimension from the dummy variable (where-ever it is found), and then use
expand_dims
to add it back in the appropriate size for the current dataset.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i'm not really sure I understand this case. Any chance you can provide a test which I can use which would help?