Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add defaults during concat 508 #3545

Closed
wants to merge 14 commits into from
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,8 @@ Bug fixes
(:issue:`3402`). By `Deepak Cherian <https://github.com/dcherian/>`_
- Allow appending datetime and bool data variables to zarr stores.
(:issue:`3480`). By `Akihiro Matsukawa <https://github.com/amatsukawa/>`_.
- Make :py:func:`~xarray.concat` more robust when concatenating variables present in some datasets but
not others (:issue:`508`). By `Scott Chamberlin <http://github.com/scottcha>`_.
scottcha marked this conversation as resolved.
Show resolved Hide resolved

Documentation
~~~~~~~~~~~~~
Expand Down
31 changes: 26 additions & 5 deletions xarray/core/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from . import dtypes, utils
from .alignment import align
from .common import full_like
from .duck_array_ops import lazy_array_equiv
from .merge import _VALID_COMPAT, unique_variable
from .variable import IndexVariable, Variable, as_variable
Expand Down Expand Up @@ -77,7 +78,8 @@ def concat(
to assign each dataset along the concatenated dimension. If not
supplied, objects are concatenated in the provided order.
fill_value : scalar, optional
Value to use for newly missing values
Value to use for newly missing values as well as to fill values where the
variable is not present in all datasets.
join : {'outer', 'inner', 'left', 'right', 'exact'}, optional
String indicating how to combine differing indexes
(excluding dim) in objects
Expand Down Expand Up @@ -370,10 +372,29 @@ def ensure_common_dims(vars):
# n.b. this loop preserves variable order, needed for groupby.
for k in datasets[0].variables:
if k in concat_over:
try:
vars = ensure_common_dims([ds.variables[k] for ds in datasets])
except KeyError:
raise ValueError("%r is not present in all datasets." % k)
variables = []
for ds in datasets:
# if one of the variables doesn't exist find one which does
# and use it to create a fill value
if k not in ds.variables:
for ds in datasets:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested loop through datasets concerns me here. It means that concat will run in quadratic time with respect to the number of datasets being concatenated. This probably make xarray.concat very slow on 1,000 datasets and outrageously slow on 10,000 datasets, both of which happen with some regularity.

it would be best to write this using a separate pass to create dummy versions of each Variable, which could be reused when appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be best to write this using a separate pass to create dummy versions of each Variable, which could be reused when appropriate.

This could happen in calc_concat_over

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new PR contains improved logic but still required me to go through the list of data_sets a few times. I think the new worst case runtime is O(DN^2) where D is num of datasets and N is number of variables in final list. If no fill value are required then it will be O(DN).
I did some perf testing with the new logic versus the old and I don't really see a significant difference but would love addition feedback if there is a better way.

Perf result for concat 720 files via open_mfdataset Parallel=False for PR:
58.7 s ± 143 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original result
58.1 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For 4359 files via open_mfdataset Parallel=False for PR:
5min 54s ± 840 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sorry I don't really have a good real-world dataset this large w/out missing values to test the original implementation. But this dataset ~6x larger took ~6x more time even with the penalty to cache and fill the missing values.

I don't currently have good data without missing variables larger than this (hence the PR :) )

I was also not sure I should overload the logic in calc_concat_over to do more but I could re-review this if the logic in the new PR looks like it should be refactored that way.

if k in ds.variables:
# found one to use as a fill value, fill with fill_value
if fill_value is dtypes.NA:
dtype, fill_value = dtypes.maybe_promote(
ds.variables[k].dtype
)
else:
dtype = ds.variables[k].dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is starting to look a little familiar now, I think there are at least a handful of existing uses in variable.py already. Maybe factor it out into a helper function in xarray.core.dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is in the new updated PR.


filled = full_like(
ds.variables[k], fill_value=fill_value, dtype=dtype
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that this dummy variable may not always be the right size.

For example, supposing we are concatenating two Dataset along the existing dimension 'x'. The first dataset has size x=1 and the second has size x=2. If a variable is missing from one but not the other, the "dummy" variable would always have the wrong size, resulting in a total length of 2 or 4 but not 3.

To properly handle this, I think you will need to index out the concatenated dimension from the dummy variable (where-ever it is found), and then use expand_dims to add it back in the appropriate size for the current dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i'm not really sure I understand this case. Any chance you can provide a test which I can use which would help?

break
variables.append(filled)
else:
variables.append(ds.variables[k])
vars = ensure_common_dims(variables)
combined = concat_vars(vars, dim, positions)
assert isinstance(combined, Variable)
result_vars[k] = combined
Expand Down
10 changes: 6 additions & 4 deletions xarray/tests/test_combine.py
Original file line number Diff line number Diff line change
Expand Up @@ -742,10 +742,16 @@ def test_auto_combine(self):
Dataset({"x": ("a", [0]), "y": ("a", [0])}),
Dataset({"y": ("a", [1]), "x": ("a", [1])}),
]

actual = auto_combine(objs)
expected = Dataset({"x": ("a", [0, 1]), "y": ("a", [0, 1])})
assert_identical(expected, actual)

objs = [Dataset({"x": [0], "y": [0]}), Dataset({"x": [0]})]
actual = auto_combine(objs)
expected = Dataset({"x": [0], "y": [0, np.nan]})
scottcha marked this conversation as resolved.
Show resolved Hide resolved
assert_identical(expected, actual)

objs = [Dataset({"x": [0], "y": [0]}), Dataset({"y": [1], "x": [1]})]
with raises_regex(ValueError, "too many .* dimensions"):
auto_combine(objs)
Expand All @@ -754,10 +760,6 @@ def test_auto_combine(self):
with raises_regex(ValueError, "cannot infer dimension"):
auto_combine(objs)

objs = [Dataset({"x": [0], "y": [0]}), Dataset({"x": [0]})]
with raises_regex(ValueError, "'y' is not present in all datasets"):
auto_combine(objs)

def test_auto_combine_previously_failed(self):
# In the above scenario, one file is missing, containing the data for
# one year's data for one variable.
Expand Down
33 changes: 28 additions & 5 deletions xarray/tests/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,17 +35,28 @@ def test_concat_compat():
},
coords={"x": [0, 1], "y": [1], "z": [-1, -2], "q": [0]},
)

ds_concat = Dataset(
{
"has_x_y": (
("q", "y", "x"),
[[[np.nan, np.nan], [3, 4]], [[1, 2], [np.nan, np.nan]]],
),
"has_x": (("q", "x"), [[1, 2], [1, 2]]),
"no_x_y": (("q", "z"), [[1, 2], [1, 2]]),
},
coords={"x": [0, 1], "y": [0, 1], "z": [-1, -2], "q": [0, np.nan]},
)
scottcha marked this conversation as resolved.
Show resolved Hide resolved
result = concat([ds1, ds2], dim="y", data_vars="minimal", compat="broadcast_equals")
assert_equal(ds2.no_x_y, result.no_x_y.transpose())

for var in ["has_x", "no_x_y"]:
assert "y" not in result[var]

result2 = concat([ds2, ds1], dim="q")
assert_equal(ds_concat, result2)

with raises_regex(ValueError, "coordinates in some datasets but not others"):
concat([ds1, ds2], dim="q")
with raises_regex(ValueError, "'q' is not present in all datasets"):
concat([ds2, ds1], dim="q")


class TestConcatDataset:
Expand Down Expand Up @@ -327,17 +338,29 @@ def test_concat_fill_value(self, fill_value):
Dataset({"a": ("x", [2, 3]), "x": [1, 2]}),
Dataset({"a": ("x", [1, 2]), "x": [0, 1]}),
]

if fill_value == dtypes.NA:
# if we supply the default, we expect the missing value for a
# float array
fill_value = np.nan
fill_value_expected = np.nan
else:
fill_value_expected = fill_value

expected = Dataset(
{"a": (("t", "x"), [[fill_value, 2, 3], [1, 2, fill_value]])},
{
"a": (
("t", "x"),
[[fill_value_expected, 2, 3], [1, 2, fill_value_expected]],
)
},
{"x": [0, 1, 2]},
)
actual = concat(datasets, dim="t", fill_value=fill_value)
assert_identical(actual, expected)

# check that the dtype is as expected
assert expected.a.dtype == type(fill_value_expected)


class TestConcatDataArray:
def test_concat(self):
Expand Down