Fill missing data_vars during concat by reindexing #7400

kmuehlbauer · 2022-12-22T14:41:56Z

Closes Ignore missing variables when concatenating datasets? #508,
Closes Add defaults during concat 508 #3545
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

This is another attempt to solve #508. Took inspiration from #3545 by @scottcha.

This follows along @dcherian's comment in the same above PR (#3545 (comment)).

Update:

After review the variable order is estimated by order of appearance in the list of datasets. That keeps full backwards compatibility and is deterministic. Thanks @dcherian and @keewis for the suggestions.

kmuehlbauer · 2022-12-22T14:53:04Z

There are no tests added currently. I'm just wondering, if that approach would work in general.

The current assumptions here:

works only on data variables, not coordinates
order is taken from first dataset, variables missing in first dataset are appended at the end
missing variables are filled with fill_value by reindexing (thanks @dcherian for the inspiration).

kmuehlbauer · 2023-01-04T11:33:49Z

Thanks @Illviljan for activating the benchmark runs. Are those errors related to the changes? I'm not up to date with mypy, are these errors induced by changes here?

Illviljan · 2023-01-04T12:25:45Z

Benchmark is a numba issue, probably #7306.

Mypy is real, cannot getitem a object. Try out using isinstance instead of the try/except to narrow the typing.

kmuehlbauer · 2023-01-04T13:58:15Z

OK, green light's now also on mypy. Looks like the approach would work in general. Trying to add some tests now.

kmuehlbauer · 2023-01-05T08:28:36Z

This is ready for a first round of review. Thanks!

scottcha · 2023-01-08T00:54:31Z

Hi, thanks for doing this @kmuehlbauer . FWIW I'm no longer seeing the issue I was previously seeing when I submitted #3545 when I just run with released xarray v2022.12.0 (I haven't gone back further to see when the issue started going away so I'm not really sure if the old error has just been suppressed or if the single case I was seeing back then was resolved in a previous PR--or there is also a chance there is something which changed in my data over that long time period).

That being said I also applied this PR to my workflow and reran the concat code and it continues to pass correctly with this PR from what I've seen so far.

@kmuehlbauer did you see the tests I created here? https://github.com/pydata/xarray/blob/03f9b3b85aee039f47dd693322492ab09f57fb73/xarray/tests/test_concat.py
Not all of them got to a passing state but there were several cases I tried to document with the tests there.

kmuehlbauer · 2023-01-08T08:03:20Z

Thanks @scottcha for taking the time to testing things.

I'll have a look at your tests in more detail now. I've concentrated on my use case in the first place and hoped to get away with it 😀.

Illviljan

Some typing suggestions.

xarray/tests/test_concat.py

kmuehlbauer · 2023-01-08T09:20:25Z

Some typing suggestions.

Thanks @Illviljan, your suggestions and help is much appreciated.

kmuehlbauer · 2023-01-08T15:09:45Z

@scottcha I've found a glitch in the code due to your tests. Already pushed the changes here.

I'm going to cherry pick your tests here next.

kmuehlbauer · 2023-01-08T15:32:36Z

@scottcha I've tried to cherry-pick, but ended up copy/pasting and adding your authorship to the commit.

I think the final problem is the order in:

test_concat_missing_multiple_consecutive_var
test_multiple_datasets_with_multiple_missing_variables

These tests are flaky. Sometimes the order is correct and sometimes not. Can't immediately see the root cause here.

@Illviljan I'll try to add typing to these additional tests. Should be good for learning that.

kmuehlbauer · 2023-01-08T16:28:06Z

@Illviljan OK, I'm stuck now. I can't make anything out of the remaining mypy errors. Would be great if you could have another look here, thanks!

xarray/tests/test_concat.py

kmuehlbauer · 2023-01-09T12:21:03Z

@scottcha I think I've managed to get along with your tests. It looks like everything is running now.

One thing which is still unresolved:

The order of data variables which are not available in the first dataset is not deterministic because of using set for gathering all variables. But maybe that can be neglected for now.

@Illviljan @dcherian This is ready for another round of review. Thanks for considering.

xarray/core/concat.py

xarray/tests/test_concat.py

kmuehlbauer · 2023-01-10T14:33:41Z

I was hoping to gain something by merging the variable order code with _parse_datasets, to only have to traverse the datasets once.

The current behaviour, and the best I've come up so far in terms of performance:

count number of variables while iterating datasets (_parse_datasets)
check if first dataset contains all wanted variables
2a. if that's the case, take the order from first dataset
check if the dataset with max count variables contains all wanted variables
3a. if that's the case, take the order from that dataset
if not 2a or 3a, take order from first dataset and append missing variables to the end

xarray/core/concat.py

kmuehlbauer · 2023-01-10T14:54:48Z

Finally, this is as far I could get with it. I'll leave it as is now. Looking forward for reviews and suggestions. Thanks @Illviljan for the great support!

…se_datasets, provide fast lane for variable order estimation

This also removes the need to handle fill_value

Co-authored-by: Deepak Cherian <[email protected]>

kmuehlbauer · 2023-01-19T14:06:57Z

@dcherian rebased on latest main and fixed whats-new.rst. Should be good for another review.

dcherian

Thanks @kmuehlbauer and @scottcha . This is great to see finished! Very clean implementation in the end.

kmuehlbauer · 2023-01-20T07:05:43Z

@dcherian There slipped an old item from whats-new.rst back into. I've removed it. Should be OK now.

Great to see this functionality coming to next xarray version.

* upstream/main: RTD maintenance (pydata#7477) fix the RTD build skipping feature (pydata#7476) Add benchmarks for to_dataframe and to_dask_dataframe (pydata#7474) allow skipping RTD builds (pydata#7470) create separate environment files for `python=3.11` (pydata#7469) Bump mamba-org/provision-with-micromamba from 14 to 15 (pydata#7466) install `numbagg` from `conda-forge` (pydata#7415) Fill missing data_vars during concat by reindexing (pydata#7400) [skip-cii] Add pyodide update instructions to HOW_TO_RELEASE (pydata#7449) [skip-ci] whats-new for next release (pydata#7455) v2023.01.0 whats-new (pydata#7440)

kmuehlbauer force-pushed the fix-issue-508 branch from 3f6206f to a9d7ac5 Compare December 22, 2022 14:42

This was referenced Dec 22, 2022

Add defaults during concat 508 #3545

Closed

Ignore missing variables when concatenating datasets? #508

Closed

Illviljan added the run-benchmark Run the ASV benchmark workflow label Dec 27, 2022

github-actions bot removed the run-benchmark Run the ASV benchmark workflow label Jan 4, 2023

kmuehlbauer force-pushed the fix-issue-508 branch from 0d2dd18 to 3da2ded Compare January 5, 2023 07:51

Illviljan reviewed Jan 8, 2023

View reviewed changes

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

Illviljan reviewed Jan 8, 2023

View reviewed changes

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

kmuehlbauer force-pushed the fix-issue-508 branch from 79b0054 to 12fac20 Compare January 9, 2023 12:15

Illviljan reviewed Jan 9, 2023

View reviewed changes

xarray/core/concat.py Show resolved Hide resolved

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

xarray/tests/test_concat.py Outdated Show resolved Hide resolved

kmuehlbauer commented Jan 10, 2023

View reviewed changes

xarray/core/concat.py Outdated Show resolved Hide resolved

Illviljan added the run-benchmark Run the ASV benchmark workflow label Jan 10, 2023

kmuehlbauer closed this Jan 10, 2023

kmuehlbauer reopened this Jan 10, 2023

github-actions bot removed the run-benchmark Run the ASV benchmark workflow label Jan 10, 2023

Illviljan mentioned this pull request Jan 10, 2023

Pull Request Labeler - Workaround sync-labels bug #7431

Merged

kmuehlbauer and others added 16 commits January 19, 2023 15:04

change np.random to use Generator

c41419f

move code for variable order into dedicated function, merge with _par…

b2b0b18

…se_datasets, provide fast lane for variable order estimation

fix comment

bb0a8ae

Use order from first dataset, append missing variables to the end

e266fe5

ensure fill_value is dict

6120796

ensure fill_value in align

5733d93

simplify combined_var, fix test

9891439

revert fill_value for alignment.py

94b9ba9

derive variable order in order of appearance as suggested per review

70be70f

remove unneeded enumerate

c3eda8f

Use alignment.reindex_variables instead.

70f38ab

This also removes the need to handle fill_value

small cleanup

4825c94

Update doc/whats-new.rst

a71f633

Co-authored-by: Deepak Cherian <[email protected]>

adapt tests as per review request, fix ensure_common_dims

92e6108

adapt tests as per review request

8610397

fix whats-new.rst

6cb163e

kmuehlbauer force-pushed the fix-issue-508 branch from 95c03e0 to 6cb163e Compare January 19, 2023 14:06

dcherian approved these changes Jan 19, 2023

View reviewed changes

dcherian added the plan to merge Final call for comments label Jan 19, 2023

kmuehlbauer and others added 3 commits January 19, 2023 10:01

add whats-new.rst entry

070c0fb

Add additional test with scalar data_var

de60890

remove erroneous content from whats-new.rst

b14c8f6

dcherian changed the title ~~ENH: fill missing variables during concat by reindexing~~ Fill missing data_vars during concat by reindexing Jan 20, 2023

dcherian merged commit b4e3cbc into pydata:main Jan 20, 2023

This was referenced May 11, 2023

Slow performance of concat() #7833

Closed

Improve concat performance #7824

Merged

kmuehlbauer deleted the fix-issue-508 branch May 25, 2023 07:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fill missing data_vars during concat by reindexing #7400

Fill missing data_vars during concat by reindexing #7400

kmuehlbauer commented Dec 22, 2022 •

edited

Loading

kmuehlbauer commented Dec 22, 2022

kmuehlbauer commented Jan 4, 2023

Illviljan commented Jan 4, 2023

kmuehlbauer commented Jan 4, 2023

kmuehlbauer commented Jan 5, 2023

scottcha commented Jan 8, 2023

kmuehlbauer commented Jan 8, 2023

Illviljan left a comment

kmuehlbauer commented Jan 8, 2023

kmuehlbauer commented Jan 8, 2023

kmuehlbauer commented Jan 8, 2023

kmuehlbauer commented Jan 8, 2023

kmuehlbauer commented Jan 9, 2023

kmuehlbauer commented Jan 10, 2023

kmuehlbauer commented Jan 10, 2023

kmuehlbauer commented Jan 19, 2023

dcherian left a comment

kmuehlbauer commented Jan 20, 2023

Fill missing data_vars during concat by reindexing #7400

Fill missing data_vars during concat by reindexing #7400

Conversation

kmuehlbauer commented Dec 22, 2022 • edited Loading

kmuehlbauer commented Dec 22, 2022

kmuehlbauer commented Jan 4, 2023

Illviljan commented Jan 4, 2023

kmuehlbauer commented Jan 4, 2023

kmuehlbauer commented Jan 5, 2023

scottcha commented Jan 8, 2023

kmuehlbauer commented Jan 8, 2023

Illviljan left a comment

Choose a reason for hiding this comment

kmuehlbauer commented Jan 8, 2023

kmuehlbauer commented Jan 8, 2023

kmuehlbauer commented Jan 8, 2023

kmuehlbauer commented Jan 8, 2023

kmuehlbauer commented Jan 9, 2023

kmuehlbauer commented Jan 10, 2023

kmuehlbauer commented Jan 10, 2023

kmuehlbauer commented Jan 19, 2023

dcherian left a comment

Choose a reason for hiding this comment

kmuehlbauer commented Jan 20, 2023

kmuehlbauer commented Dec 22, 2022 •

edited

Loading