Add defaults during concat 508 #3545

scottcha · 2019-11-17T20:47:07Z

Closes Ignore missing variables when concatenating datasets? #508
Tests added
Passes black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

Contined on issue #508 by removing exception when concat two datasets with disjoint variables and instead add the missing variable with np.nan.

dcherian

Thanks @scottcha! I've left a few small comments.

Let's add a test for concatenating integer variables with a specific fill-value is provided (e.g. 0 or -1). Concatenating should not change the integer dtype in the result.

xarray/core/concat.py

doc/whats-new.rst

Co-Authored-By: Deepak Cherian <[email protected]>

…cottcha/xarray into add-defaults-during-concat-508 Merge doc change

dcherian

Thanks @scottcha . I only have one comment. I'm a little unsure of whether we should fill in missing coordinate values. Let's see what @shoyer thinks of this approach.

xarray/tests/test_combine.py

doc/whats-new.rst

shoyer · 2019-11-19T01:58:09Z

xarray/core/concat.py

+                            if fill_value is dtypes.NA:
+                                dtype, fill_value = dtypes.maybe_promote(
+                                    ds.variables[k].dtype
+                                )
+                            else:
+                                dtype = ds.variables[k].dtype


This pattern is starting to look a little familiar now, I think there are at least a handful of existing uses in variable.py already. Maybe factor it out into a helper function in xarray.core.dtypes?

Ok this is in the new updated PR.

shoyer · 2019-11-19T02:01:57Z

xarray/core/concat.py

+                # if one of the variables doesn't exist find one which does
+                # and use it to create a fill value
+                if k not in ds.variables:
+                    for ds in datasets:


This nested loop through datasets concerns me here. It means that concat will run in quadratic time with respect to the number of datasets being concatenated. This probably make xarray.concat very slow on 1,000 datasets and outrageously slow on 10,000 datasets, both of which happen with some regularity.

it would be best to write this using a separate pass to create dummy versions of each Variable, which could be reused when appropriate.

it would be best to write this using a separate pass to create dummy versions of each Variable, which could be reused when appropriate.

This could happen in calc_concat_over

The new PR contains improved logic but still required me to go through the list of data_sets a few times. I think the new worst case runtime is O(DN^2) where D is num of datasets and N is number of variables in final list. If no fill value are required then it will be O(DN).
I did some perf testing with the new logic versus the old and I don't really see a significant difference but would love addition feedback if there is a better way.

Perf result for concat 720 files via open_mfdataset Parallel=False for PR:
58.7 s ± 143 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original result
58.1 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For 4359 files via open_mfdataset Parallel=False for PR:
5min 54s ± 840 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sorry I don't really have a good real-world dataset this large w/out missing values to test the original implementation. But this dataset ~6x larger took ~6x more time even with the penalty to cache and fill the missing values.

I don't currently have good data without missing variables larger than this (hence the PR :) )

I was also not sure I should overload the logic in calc_concat_over to do more but I could re-review this if the logic in the new PR looks like it should be refactored that way.

shoyer · 2019-11-19T02:06:34Z

xarray/core/concat.py

+                            filled = full_like(
+                                ds.variables[k], fill_value=fill_value, dtype=dtype
+                            )


I am concerned that this dummy variable may not always be the right size.

For example, supposing we are concatenating two Dataset along the existing dimension 'x'. The first dataset has size x=1 and the second has size x=2. If a variable is missing from one but not the other, the "dummy" variable would always have the wrong size, resulting in a total length of 2 or 4 but not 3.

To properly handle this, I think you will need to index out the concatenated dimension from the dummy variable (where-ever it is found), and then use expand_dims to add it back in the appropriate size for the current dataset.

Ok, i'm not really sure I understand this case. Any chance you can provide a test which I can use which would help?

xarray/tests/test_concat.py

xarray/tests/test_combine.py

shoyer · 2019-11-19T02:12:45Z

Thanks for working on this important issue!

There are a lot of edge cases that can come up in concat, so I think it would be very helpful to try to enumerate a broader set of unit tests for thoroughly testing this. For example:

Pre-existing vs non-pre-existing dimension
Pre-existing dimensions of different sizes
Missing data variables vs coordinates vs indexed coordinates

Co-Authored-By: keewis <[email protected]>

scottcha · 2019-11-20T01:22:51Z

Ok, I'll work on extending the updates with the feedback and additional tests. Thanks!

…ring and variable types when variables are missing

scottcha · 2019-12-31T01:18:35Z

Hi, I've provided a new update to this PR (sorry it took me awhile both to get more familiar with the code and find the time to update the PR). I improved the logic to be a bit more performant and handle more edge cases as well as updated the test suite. I have a few questions:

The tests I wrote are a bit more verbose than the tests previously. I can tighten them down but I found it was easier for me to read the logic in this form. Please let me know what you prefer.
I'm still not quite sure I've captured all the scenarios as I'm a pretty basic xarray user so please let me know if there is still something I'm missing.

shoyer · 2019-12-31T01:40:02Z

I'll take a look at this more carefully soon. But I do think it is a hard requirement that concat runs in linear time (with respect to the total number of variables across all datasets).

…

On Mon, Dec 30, 2019 at 5:18 PM Scott Chamberlin ***@***.***> wrote: Hi, I've provided a new update to this PR (sorry it took me awhile both to get more familiar with the code and find the time to update the PR). I improved the logic to be a bit more performant and handle more edge cases as well as updated the test suite. I have a few questions: 1. The tests I wrote are a bit more verbose than the tests previously. I can tighten them down but I found it was easier for me to read the logic in this form. Please let me know what you prefer. 2. I'm still not quite sure I've captured all the scenarios as I'm a pretty basic xarray user so please let me know if there is still something I'm missing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3545?email_source=notifications&email_token=AAJJFVVSKN5ZWD4FQHPIJG3Q3KMWXA5CNFSM4JOLVICKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH3RY5Q#issuecomment-569842806>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJJFVUPKZ7Q3UFVSH7D2STQ3KMWXANCNFSM4JOLVICA> .

kmuehlbauer · 2020-01-23T06:52:06Z

@scottcha If found this while searching. Have the same requirements, means missing DataArrays in some Datasets of a timeseries to be concatenated. I've already some hacks and workarounds in place for my specific use cases, but it would be really great if this could be handled by xarray.

I'll try to test your current implementation against my source data and will report my findings here.

Update: I've rebased locally on latest master and this works smoothly with my data (which uses packed data). I'll now look into performance.

kmuehlbauer · 2020-01-23T14:08:57Z

@scottcha @shoyer For one of my use cases (240 datasets, 1 with missing variables) I do not see any performance penalties using this implementation compared to the current. But this might be due to the fact, that the most time consuming part is the expand_dims for every dataset, which accounts for roughly 80% overall concat runtime.

If I can be of any help to push this over the line, please ping me.

dcherian · 2020-01-23T15:41:33Z

the most time consuming part is the expand_dims for every dataset, which accounts for roughly 80% overall concat runtime.

Hmmm... maybe we need a short-circuit version of expand_dims?

kmuehlbauer · 2020-01-23T15:59:47Z

@dcherian Just to clarify, the concatenation is done along a new dimension (which has to be created by expand_dims). What do you mean by short-clrcuit in this context?

scottcha · 2020-01-23T17:00:06Z

@kmuehlbauer @dcherian @shoyer If it would be easier it could abandon this PR and resubmit a new one as the code has drastically changed since the original comments were provided? Essentially I'm waiting for feedback or approval of this PR.

shoyer · 2020-01-23T17:10:48Z

Can you explain why you think you need the nested iteration over dataset variables? What ordering are you trying to achieve?

scottcha · 2020-01-23T19:15:53Z

xarray/core/concat.py

+    # Find union of all data variables (preserving order)
+    # assumes all datasets are relatively in the same order
+    # and missing variables are inserted in the correct position
+    # if datasets have variables in drastically different orders
+    # the resulting order will be dependent on the order they are in the list
+    # passed to concat
+    union_of_variables = OrderedDict()
+    union_of_coordinates = OrderedDict()
+    for ds in datasets:
+        var_list = list(ds.variables.keys())
+        # this logic maintains the order of the variable list and runs in
+        # O(n^2) where n is number of variables in the uncommon worst case
+        # where there are no missing variables this will be O(n)
+        for i in range(0, len(var_list)):
+            if var_list[i] not in union_of_variables:
+                # need to determine the correct place
+                # first add the new item which will be at the end
+                union_of_variables[var_list[i]] = None
+                union_of_variables.move_to_end(var_list[i])
+                # move any items after this in the variables list to the end
+                # this will only happen for missing variables
+                for j in range(i + 1, len(var_list)):
+                    if var_list[j] in union_of_variables:
+                        union_of_variables.move_to_end(var_list[j])


@shoyer if this is the code you are referring to this have two purposes:

Find a complete set of variables even if the first datatset in the concat list has a missing variable (the previous implementation assums the first datatset has all variables)

Maintains the order of those variables (which is essentially the sorting operations happening when a missing variable is encountered) which was documented as a requirement for groupby in the previous implementation.

I'm not sure if preserving ordering is really essential, though I guess it would be nice to have.

The fundamental problem here is efficiently determining a consistent merge order between lists. This is pretty similar to some code I once wrote in TensorFlow. It only handles merging two lists efficiently, but hopefully is a good model. The fundamental idea is to simultaneously consume elements across all the lists at once.

I think there is no reason why it could not be extended to N-lists (though it would also need to be changed to fall-back to order of appearance rather than raising an error):
https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/contrib/labeled_tensor/python/ops/core.py#L919

Either way, the logic should definitely live in a separate helper function, which makes it easier to test.

Regarding ordering I was going off the previous comment which said
# stack up each variable to fill-out the dataset (in order) # n.b. this loop preserves variable order, needed for groupby.
I believe one of the groupby test was also checking for that but I can't really recall at this point (regardless all existing groupby tests are currently passing in the pr).

I liked the code you linked and took a little detour to to try to incorporate a version of it in to my PR. I pulled it back out once I realized two things:

The conversion of the input list to a set I thought was a bit risky since the order isn't necessarily guaranteed (esp < python 3.7 where dicts weren't ordered by default) and it's why my implementation was relying on ordered dicts. I'm sure the code you linked is likely ok I just was unsure about taking a dependency on what seemed an undocumented assumption.

The case where no consistent ordering was possible returned None while I didn't necessarily think that was appropriate for this code since there isn't really a strict necessity for variable ordering and I'm not sure you want to go deeper down that path. Removing this assumption was forcing me in to more complex code.

I did spend a bit of time trying to write the generalized n dimension version of the consistent_ordering code but it was getting quite complex and was potentially hiding some complexity under some syntactic sugar. I ended up refactoring the piece of code in question to an internal method (as its still fairly tied to the implementation of the public method) and put a note that its a potential candidate for a refactor.

The PR is updated with these changes.

@scottcha @shoyer I've tested again the different approaches. If there are only occasional misses I workes quite well. But in corner cases (two neighboring variables miss in consecutive datasets) it can have unwanted results. I'll add some code after the weekend.

From what I read this problem is closely related to the shortest common supersequence problem. I've checked on the implementations and it works very well in terms of result, but is (currently) quite slow.

There should be some checks to possibly find one Dataset which contains all variables and can be used for output sorting. If none such is available then...

If there is a correct solution possible, the code should find it. Just my 2c.

You are right that this is a special case of shortest common supersequence though since there shouldn't be repeated values in any sequence it might be easier to solve.

@kmuehlbauer can you provide a case where you think the ordering determined by the current algorighm isn't providing the expected results? I just updated the PR with a test case for the multiple neighboring missing variables (as well as explicit asserts on the data_var ordering) and I'm still getting expected results. It would be great to see what you observed.

It may be time to actually ask what you want the behavior to really be in this case before introducing additional complexity. I just read through some of the pandas issues and looked like they dealt with this as well pandas-dev/pandas#4588. Is that the behavior you would like in xarray? I like the alignment at least for the default behavior with the pandas behavior but I think its really up to the xarray owners? Pandas allows a sort option which is also something to consider for an explicit alphabetical ordering.

(edited as I think the statement in the linked article about sql behvior was incorrect, also more clear about the pandas behavior)

@scottcha This is from the top of my head, so bear with me, if this isn't creating the unwanted effects.

ds1 = ['d1', 'd3' , 'd4' , 'd5' , 'd6' ] ds2 = ['d1', 'd2' , 'd4' , 'd5' , 'd6' ] ds3 = ['d1', 'd2' , 'd3' , 'd5' , 'd6' ] ds4 = ['d1', 'd2' , 'd3' , 'd4' , 'd6' ]

This is an example where one variable is missing in each Dataset, but the correct ordering is obvious. I hope I got it right. If not, I have to look it up on Monday earliest.

I'll test your additions/changes next week, currently travelling.

kmuehlbauer · 2020-01-28T11:33:41Z

@scottcha @shoyer below is a minimal example where one variable is missing in each file.

import random
random.seed(123)
random.randint(0, 10)

# create var names list with one missing value
orig = [f'd{i:02}' for i in range(10)]
datasets = []
for i in range(1, 9):
    l1 = orig.copy()
    l1.remove(f'd{i:02}')
    datasets.append(l1)

# create files
for i, dsl in enumerate(datasets):
    foo_data = np.arange(24).reshape(2, 3, 4)
    with nc.Dataset(f'test{i:02}.nc', 'w') as ds:
        ds.createDimension('x', size=2)
        ds.createDimension('y', size=3)
        ds.createDimension('z', size=4)
        for k in dsl:
            ds.createVariable(k, int, ('x', 'y', 'z'))
            ds.variables[k][:] = foo_data

flist = glob.glob('test*.nc')
dslist = []
for f in flist:
    dslist.append(xr.open_dataset(f))

ds2 = xr.concat(dslist, dim='time')
ds2

Output:

<xarray.Dataset>
Dimensions:  (time: 8, x: 2, y: 3, z: 4)
Dimensions without coordinates: time, x, y, z
Data variables:
    d01      (x, y, z) int64 0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23
    d00      (time, x, y, z) int64 0 1 2 3 4 5 6 7 8 ... 16 17 18 19 20 21 22 23
    d02      (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0
    d03      (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0
    d04      (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0
    d05      (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0
    d06      (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0
    d07      (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0
    d08      (time, x, y, z) float64 0.0 1.0 2.0 3.0 4.0 ... nan nan nan nan nan
    d09      (time, x, y, z) int64 0 1 2 3 4 5 6 7 8 ... 16 17 18 19 20 21 22 23

Three cases here:

d00 and d09 are available in all datasets, and they are concatenated correctly (keeping dtype)
d02 to d08 are missing in one dataset and are filled with the created dummy variable, but the dtype is converted to float64
d01 is not handled properly, because it is missing in the first dataset, this is due to checking only variables of first dataset in _calc_concat_over

            elif opt == "all":
                concat_over.update(
                    set(getattr(datasets[0], subset)) - set(datasets[0].dims)
                )

and from putting d01 in result_vars before iterating to find missing variables.

kmuehlbauer · 2020-01-28T11:34:56Z

xarray/core/concat.py

+            )
+
+            union_of_variables[variable_key] = full_like(
+                ds[variable_key], fill_value=v_fill_value, dtype=dtype


This need to be ds.variables[variable_key], fill_value=v_fill_value, dtype=dtype, otherwise it will fail later (DataArray has no set_dim)

Thanks for the feedback and the above test. I'll try to incorporate your suggested test as well as the rest of the pending comments in the next update.

dcherian

I added the comment about preserving order because of this test:

xarray/xarray/tests/test_dataset.py

Lines 3631 to 3643 in d63888c

    
           def test_groupby_order(self): 
        
               # groupby should preserve variables order 
        
               ds = Dataset() 
        
               for vn in ["a", "b", "c"]: 
        
                   ds[vn] = DataArray(np.arange(10), dims=["t"]) 
        
               data_vars_ref = list(ds.data_vars.keys()) 
        
               ds = ds.groupby("t").mean(...) 
        
               data_vars = list(ds.data_vars.keys()) 
        
               assert data_vars == data_vars_ref 
        
               # coords are now at the end of the list, so the test below fails 
        
               # all_vars = list(ds.variables.keys()) 
        
               # all_vars_ref = list(ds.variables.keys()) 
        
               # self.assertEqual(all_vars, all_vars_ref)

I wonder if we can drastically simplify this PR with

data_var_order = list(datasets[0].data_vars) 
data_var_order += list(data_names - set(data_var_order))

i.e. take the order from the first dataset. Ordering for any variables not in the first dataset is not guaranteed. This should make that groupby test pass.

Unless I'm missing something, I don't think xarray cares about variable order anywhere else.

dcherian · 2020-01-29T15:52:25Z

xarray/core/concat.py

@@ -1,7 +1,9 @@
 import pandas as pd
+from collections import OrderedDict


just plain dict should be fine now since we are python 3.6+

Ok, I didn't realize that it was 3.6+ only. Will change to dict.

dcherian · 2020-01-29T15:53:07Z

xarray/tests/test_concat.py

+    datasets = create_concat_datasets(2, 123)
+    vars_to_drop = ["humidity", "precipitation", "cloud cover"]
+    datasets[0] = datasets[0].drop_vars(vars_to_drop)
+    datasets[1] = datasets[1].drop_vars(vars_to_drop + ["pressure"])


Suggested change

datasets[1] = datasets[1].drop_vars(vars_to_drop + ["pressure"])

datasets[1] = datasets[1].drop_vars(vars_to_drop + ["pressure"]).isel(day=0)

Tests start failing with this change.

I'll submit an update with the suggested changes. I agree that I'm not sure where order should matter as long as the result is deterministic.

dcherian · 2020-02-14T15:52:21Z

I am now wondering if we can use align or reindex to do the filling for us.

Example: goal is concat along 'x' with result dataset having x=[1,2,3,4]

Loop through datasets and assign coordinate values as appropriate.
Break datasets up into mappings collected = {"variable": [var1_at_x=1, var2_at_x=2, var4_at_x=4]} -> there's some stuff in merge.py that could be reused for this
concatenate these lists to get a new mapping concatenated = {"variable": [var_at_x=[1,2,4]]}
apply reindexed = {concatenated[var].reindex(x=[1,2,3,4], fill_value=...) for var in concatenated}
create dataset Dataset(reindexed)

Step 1 would be where we deal with all the edge cases mentioned in @shoyer's comment viz

For example:

Pre-existing vs non-pre-existing dimension
Pre-existing dimensions of different sizes
Missing data variables vs coordinates vs indexed coordinates

scottcha · 2020-02-15T18:38:25Z

I just pushed an incomplete set of changes as @kmuehlbauer tests have demonstrated there was some incomplete cases the PR still isn't handling.
Here is a summary:

I've simplified the logic based on @dcherian comments but in order to keep the result deterministic needed to use list logic instead of set logic. I also kept the OrderedDict instead of going with the default dict as the built in ordering methods as of py 3.6 were still insufficient for keeping the ordering consistent (I needed to pop FIFO) which doesn't seem possible until py 3.8.
I did add a failing test to capture the cases @kmuehlbauer pointed out.

I'm not sure I have my head wrapped around xarray enough to address @dcherian's latest comments though which is why i'm sharing the code at this point. All tests are passing except the new cases which were pointed out.

I'll try to continue to get time to update this but wanted to at least provide this status update at this point as its been awhile.

mullenkamp · 2021-06-30T08:41:29Z

Has this been implemented? Or is it still failing the tests?

lukelbd · 2022-06-24T05:08:38Z

Cool PR - looks like it's stale? Maybe someone should copy the work to a new one? Have been coming across this issue a lot in my work recently.

kmuehlbauer · 2022-11-07T20:26:57Z

@scottcha Are you still around and interested to bring this along? If not I could try to dive again into this.

scottcha · 2022-11-07T20:33:02Z

I'm still along and yes I do still need this functionality (I still sync back to this PR when I have data missing vars). The issue was that the technical requirements got beyond what I was able to account for with the time I had available. If you or someone else was interested in picking it up I'd be happy to evaluate against my use cases.

kmuehlbauer · 2022-11-07T20:38:21Z

Great @scottcha, I was coming back here too every once in an while to just refresh my mind with the ideas pursued here. I can try to rebase the PR onto latest main, if I can free some cycles in the following days for starters.

keewis · 2022-11-07T21:02:35Z

I can try to rebase the PR onto latest main

I did try that a few months ago, but a lot has changed since the PR was opened so it might actually be easier to reimplement the PR?

kmuehlbauer · 2022-11-07T21:37:11Z

Thanks @keewis for the heads up. I'll have a look and if things get too complicated a reimplementation might be our best option.

kmuehlbauer · 2022-12-22T14:56:44Z

@scottcha @keewis I've tried hard, but finally decided to start from scratch, see #7400.

@scottcha

* Fill missing data variables during concat by reindexing * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * FIX: use `Any` for type of `fill_value` as this seems consistent with other places * ENH: add tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typing Co-authored-by: Illviljan <[email protected]> * typing Co-authored-by: Illviljan <[email protected]> * typing Co-authored-by: Illviljan <[email protected]> * use None instead of False Co-authored-by: Illviljan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * concatenate variable in any case if variable has concat_dim * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add tests from @scottcha #3545 * typing * fix typing * fix tests with, finalize typing * add whats-new.rst entry * Update xarray/tests/test_concat.py Co-authored-by: Illviljan <[email protected]> * Update xarray/tests/test_concat.py Co-authored-by: Illviljan <[email protected]> * add TODO, fix numpy.random.default_rng * change np.random to use Generator * move code for variable order into dedicated function, merge with _parse_datasets, provide fast lane for variable order estimation * fix comment * Use order from first dataset, append missing variables to the end * ensure fill_value is dict * ensure fill_value in align * simplify combined_var, fix test * revert fill_value for alignment.py * derive variable order in order of appearance as suggested per review * remove unneeded enumerate * Use alignment.reindex_variables instead. This also removes the need to handle fill_value * small cleanup * Update doc/whats-new.rst Co-authored-by: Deepak Cherian <[email protected]> * adapt tests as per review request, fix ensure_common_dims * adapt tests as per review request * fix whats-new.rst * add whats-new.rst entry * Add additional test with scalar data_var * remove erroneous content from whats-new.rst Co-authored-by: Scott Chamberlin <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Illviljan <[email protected]> Co-authored-by: Deepak Cherian <[email protected]>

scottcha added 2 commits November 17, 2019 12:35

Concat adds default variables when missing from a dataset

baeebed

Update whats-new.rst with change

a96583b

dcherian requested changes Nov 18, 2019

View reviewed changes

xarray/core/concat.py Outdated Show resolved Hide resolved

doc/whats-new.rst Outdated Show resolved Hide resolved

dcherian mentioned this pull request Nov 18, 2019

Concatenate datasets when some variables are present in one dataset and not present in other dataset intake/intake-esm#144

Open

scottcha and others added 4 commits November 18, 2019 09:27

Update doc/whats-new.rst

af347e7

Co-Authored-By: Deepak Cherian <[email protected]>

Update test to check dtypes for fill values on concat

418c538

Merge branch 'add-defaults-during-concat-508' of https://github.com/s…

9e35c84

…cottcha/xarray into add-defaults-during-concat-508 Merge doc change

update docstring to explain new fill_value behavior

f7124a3

dcherian approved these changes Nov 19, 2019

View reviewed changes

xarray/tests/test_combine.py Outdated Show resolved Hide resolved

keewis reviewed Nov 19, 2019

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

shoyer reviewed Nov 19, 2019

View reviewed changes

scottcha and others added 2 commits November 19, 2019 17:01

Update doc/whats-new.rst

47f7e4d

Co-Authored-By: keewis <[email protected]>

Remove test_combine.py update from pr

df3693e

scottcha added 2 commits December 30, 2019 14:59

Improved implementation of concat which respects type promotion, orde…

c21dcd4

…ring and variable types when variables are missing

refactor variable promotion logic to dtypes.py

4e01bd9

scottcha commented Jan 23, 2020

View reviewed changes

scottcha added 2 commits January 23, 2020 19:51

refactor variable order code to its own internal method

515b9c1

Add test for multiple consecutive missing variables

cf5b8bd

kmuehlbauer reviewed Jan 28, 2020

View reviewed changes

dcherian reviewed Jan 29, 2020

View reviewed changes

Simplify variable ordering in concat for missing variable case

3bf3931

remove unused helper method

03f9b3b

AnnaKwa mentioned this pull request Mar 26, 2020

Feature/save intermediate ml diag data ai2cm/fv3net#200

Merged

Filip-K mentioned this pull request Jun 2, 2022

Ignore missing variables when concatenating datasets? #508

Closed

kmuehlbauer mentioned this pull request Dec 22, 2022

Fill missing data_vars during concat by reindexing #7400

Merged

5 tasks

kmuehlbauer pushed a commit to kmuehlbauer/xarray that referenced this pull request Jan 8, 2023

add tests from @scottcha pydata#3545

de46929

kmuehlbauer pushed a commit to kmuehlbauer/xarray that referenced this pull request Jan 9, 2023

add tests from @scottcha pydata#3545

031f0f8

kmuehlbauer pushed a commit to kmuehlbauer/xarray that referenced this pull request Jan 16, 2023

add tests from @scottcha pydata#3545

7aeb283

kmuehlbauer pushed a commit to kmuehlbauer/xarray that referenced this pull request Jan 19, 2023

add tests from @scottcha pydata#3545

b9d0d02

dcherian closed this in #7400 Jan 20, 2023

	def test_groupby_order(self):
	# groupby should preserve variables order
	ds = Dataset()
	for vn in ["a", "b", "c"]:
	ds[vn] = DataArray(np.arange(10), dims=["t"])
	data_vars_ref = list(ds.data_vars.keys())
	ds = ds.groupby("t").mean(...)
	data_vars = list(ds.data_vars.keys())
	assert data_vars == data_vars_ref
	# coords are now at the end of the list, so the test below fails
	# all_vars = list(ds.variables.keys())
	# all_vars_ref = list(ds.variables.keys())
	# self.assertEqual(all_vars, all_vars_ref)

		@@ -1,7 +1,9 @@
		import pandas as pd
		from collections import OrderedDict

	datasets[1] = datasets[1].drop_vars(vars_to_drop + ["pressure"])
	datasets[1] = datasets[1].drop_vars(vars_to_drop + ["pressure"]).isel(day=0)

Add defaults during concat 508 #3545

Add defaults during concat 508 #3545

Conversation

scottcha commented Nov 17, 2019

dcherian left a comment

Choose a reason for hiding this comment

dcherian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Nov 19, 2019

scottcha commented Nov 20, 2019

scottcha commented Dec 31, 2019

shoyer commented Dec 31, 2019 via email

kmuehlbauer commented Jan 23, 2020 • edited Loading

kmuehlbauer commented Jan 23, 2020

dcherian commented Jan 23, 2020

kmuehlbauer commented Jan 23, 2020

scottcha commented Jan 23, 2020 • edited Loading

shoyer commented Jan 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottcha Jan 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottcha Jan 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmuehlbauer commented Jan 28, 2020 • edited Loading

kmuehlbauer Jan 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian commented Feb 14, 2020 • edited Loading

scottcha commented Feb 15, 2020

mullenkamp commented Jun 30, 2021

lukelbd commented Jun 24, 2022

kmuehlbauer commented Nov 7, 2022

scottcha commented Nov 7, 2022

kmuehlbauer commented Nov 7, 2022

keewis commented Nov 7, 2022 • edited Loading

kmuehlbauer commented Nov 7, 2022

kmuehlbauer commented Dec 22, 2022

kmuehlbauer commented Jan 23, 2020 •

edited

Loading

scottcha commented Jan 23, 2020 •

edited

Loading

scottcha Jan 24, 2020 •

edited

Loading

scottcha Jan 24, 2020 •

edited

Loading

kmuehlbauer commented Jan 28, 2020 •

edited

Loading

kmuehlbauer Jan 28, 2020 •

edited

Loading

dcherian commented Feb 14, 2020 •

edited

Loading

keewis commented Nov 7, 2022 •

edited

Loading