Ignore missing variables when concatenating datasets? #508

shoyer · 2015-08-02T06:03:57Z

Several users (@raj-kesavan, @richardotis, now myself) have wondered about how to concatenate xray Datasets with different variables.

With the current xray.concat, you need to awkwardly create dummy variables filled with NaN in datasets that don't have them (or drop mismatched variables entirely). Neither of these are great options -- concat should have an option (the default?) to take care of this for the user.

This would also be more consistent with pd.concat, which takes a more relaxed approach to matching dataframes with different variables (it does an outer join).

The text was updated successfully, but these errors were encountered:

max-sixty · 2019-01-15T20:10:58Z

Closing as stale, please reopen if still relevant

scottcha · 2019-11-14T03:37:59Z

I just ran in to this issue. While the previous fix seems to handle one case it doesn't handle all the cases. Before I clean this up and open a new PR does this look like its on the right track (it worked for my issue where I was concating multiple datasets which always had the same dims and coordinates but sometimes were missing variables)?

starts at line 353 on concat.py

for k in datasets[0].variables:
       if k in concat_over:
           try:
               #new code
               for ds in datasets:
                   if k not in ds.variables:
                       #make a new array with the same dimensions and coordinates
                       #by default this will be initialized to np.nan which is what we want
                       from .dataarray import DataArray
                       new_array = DataArray(coords=ds.coords, dims=ds.dims)
                       ds[k] = new_array
               #end new code
               vars = ensure_common_dims([ds.variables[k] for ds in datasets])
           except KeyError: 
              #this can likely be removed then
               raise ValueError("%r is not present in all datasets." % k)
           combined = concat_vars(vars, dim, positions)
           assert isinstance(combined, Variable)
           result_vars[k] = combined

dcherian · 2019-11-14T15:33:52Z

Thanks for tackling this very important issue @scottcha !

from .dataarray import DataArray
new_array = DataArray(coords=ds.coords, dims=ds.dims)
ds[k] = new_array

Instead of creating a DataArray we only need to create a Variable (https://xarray.pydata.org/en/stable/internals.html#variable-objects).

I would instead try full_like(example_variable, fill_value=np.nan) (import full_like from the appropriate file). The trick would be figuring out what example_variable is. Maybe like this? (there may be some clever way to avoid the two loops)

variables = []
for ds in datasets:
    if k in ds.variables:
         filled = full_like(ds.variables[k], fill_value=np.nan)
         break

for ds in datasets:
    if k not in ds.variables:
        variables.append(filled)
    else:
        variables.append(ds.variables[k])

vars = ensure_common_dims(variables)

Please send in a PR with any progress you make. We are happy to help out. We have some documentation on contributing and testing here: https://xarray.pydata.org/en/stable/contributing.html

scottcha · 2019-11-14T17:30:09Z

Ok got it, I'll take a look and spin up a PR.
Thanks

Filip-K · 2022-06-02T10:59:59Z

Hi guys! Just to clarify, this is not fixed by #3769 (which only concerns coordinates, not variables) nor by #3364 (which concerns merge not concat). It would be fixed by #3545, but this one is not merged yet. Right?

dcherian · 2022-06-02T13:10:04Z

Yes that is correct

zoj613 · 2022-11-09T12:16:13Z

Any plans to support this?

kmuehlbauer · 2022-12-22T15:04:31Z

There is another attempt to get this resolved in #7400. Any input appreciated over there.

shoyer mentioned this issue Jul 27, 2016

ValueError: encountered unexpected variable nbnds #919

Closed

shoyer mentioned this issue Sep 22, 2016

Fixes for compat='no_conflicts' and open_mfdataset #1007

Merged

max-sixty closed this as completed Jan 15, 2019

dcherian mentioned this issue Oct 1, 2019

Make concat more forgiving with variables that are being merged. #3364

Merged

4 tasks

dcherian reopened this Oct 15, 2019

dcherian mentioned this issue Oct 15, 2019

Concatenate datasets when some variables are present in one dataset and not present in other dataset intake/intake-esm#144

Open

scottcha mentioned this issue Nov 17, 2019

Add defaults during concat 508 #3545

Closed

4 tasks

dcherian mentioned this issue Feb 14, 2020

concat now handles non-dim coordinates only present in one dataset #3769

Merged

3 tasks

dcherian added the topic-combine combine/concat/merge label Jul 8, 2021

kmuehlbauer mentioned this issue Dec 22, 2022

Fill missing data_vars during concat by reindexing #7400

Merged

5 tasks

dcherian closed this as completed in #7400 Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore missing variables when concatenating datasets? #508

Ignore missing variables when concatenating datasets? #508

shoyer commented Aug 2, 2015

max-sixty commented Jan 15, 2019

scottcha commented Nov 14, 2019 •

edited

Loading

dcherian commented Nov 14, 2019

scottcha commented Nov 14, 2019

Filip-K commented Jun 2, 2022

dcherian commented Jun 2, 2022

zoj613 commented Nov 9, 2022

kmuehlbauer commented Dec 22, 2022

Ignore missing variables when concatenating datasets? #508

Ignore missing variables when concatenating datasets? #508

Comments

shoyer commented Aug 2, 2015

max-sixty commented Jan 15, 2019

scottcha commented Nov 14, 2019 • edited Loading

dcherian commented Nov 14, 2019

scottcha commented Nov 14, 2019

Filip-K commented Jun 2, 2022

dcherian commented Jun 2, 2022

zoj613 commented Nov 9, 2022

kmuehlbauer commented Dec 22, 2022

scottcha commented Nov 14, 2019 •

edited

Loading