Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid auto creation of indexes in concat #8872

Merged
merged 81 commits into from
May 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
22995e9
test not creating indexes on concatenation
TomNicholas Mar 25, 2024
7142c9f
construct result dataset using Coordinates object with indexes passed…
TomNicholas Mar 25, 2024
7fb075a
remove unnecessary overwriting of indexes
TomNicholas Mar 25, 2024
285c1de
ConcatenatableArray class
TomNicholas Mar 25, 2024
cc24757
use ConcatenableArray in tests
TomNicholas Mar 28, 2024
90a2592
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Mar 28, 2024
beb665a
add regression tests
TomNicholas Mar 28, 2024
22f361d
fix by performing check
TomNicholas Mar 28, 2024
55166fc
refactor assert_valid_explicit_coords and rename dims->sizes
TomNicholas Mar 28, 2024
322b76e
Merge branch 'forbid_invalid_coordinates' into concat-avoid-index-aut…
TomNicholas Mar 28, 2024
da6692b
Revert "add regression tests"
TomNicholas Mar 28, 2024
35dfb67
Revert "fix by performing check"
TomNicholas Mar 28, 2024
fd3de2b
Revert "refactor assert_valid_explicit_coords and rename dims->sizes"
TomNicholas Mar 28, 2024
0a60172
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Mar 28, 2024
21afbb1
fix failing test
TomNicholas Mar 28, 2024
6e9ead6
possible fix for failing groupby test
TomNicholas Mar 28, 2024
2534712
Revert "possible fix for failing groupby test"
TomNicholas Mar 29, 2024
3e848eb
test expand_dims doesn't create Index
TomNicholas Apr 19, 2024
95d453c
add option to not create 1D index in expand_dims
TomNicholas Apr 19, 2024
ba5627e
refactor tests to consider data variables and coordinate variables se…
TomNicholas Apr 20, 2024
3719ba7
test expand_dims doesn't create Index
TomNicholas Apr 19, 2024
018e74b
add option to not create 1D index in expand_dims
TomNicholas Apr 19, 2024
f680505
refactor tests to consider data variables and coordinate variables se…
TomNicholas Apr 20, 2024
f10509a
fix bug causing new test to fail
TomNicholas Apr 20, 2024
8152c0a
test index auto-creation when iterable passed as new coordinate values
TomNicholas Apr 20, 2024
aa813cf
make test for iterable pass
TomNicholas Apr 20, 2024
e78de7d
added kwarg to dataarray
TomNicholas Apr 20, 2024
b1329cc
whatsnew
TomNicholas Apr 20, 2024
a9f7e0c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 20, 2024
2ce3dec
Revert "refactor tests to consider data variables and coordinate vari…
TomNicholas Apr 20, 2024
87a08b4
Revert "add option to not create 1D index in expand_dims"
TomNicholas Apr 20, 2024
e0c6db1
Merge branch 'expand_dims_create_1d_index' into concat-avoid-index-au…
TomNicholas Apr 20, 2024
214ed7d
test that concat doesn't raise if create_1d_index=False
TomNicholas Apr 20, 2024
78d2798
make test pass by passing create_1d_index down through concat
TomNicholas Apr 20, 2024
fc206b0
assert that an UnexpectedDataAccess error is raised when create_1d_in…
TomNicholas Apr 20, 2024
ce797f1
eliminate possibility of xarray internals bypassing UnexpectedDataAcc…
TomNicholas Apr 20, 2024
62e750f
update tests to use private versions of assertions
TomNicholas Apr 26, 2024
f86c82f
create_1d_index->create_index
TomNicholas Apr 26, 2024
4dd8d3c
Merge branch 'main' into expand_dims_create_1d_index
TomNicholas Apr 26, 2024
d5d90fd
Update doc/whats-new.rst
TomNicholas Apr 26, 2024
e00dbab
Merge branch 'expand_dims_create_1d_index' into concat-avoid-index-au…
TomNicholas Apr 26, 2024
5bb88b8
Rename create_1d_index -> create_index
TomNicholas Apr 26, 2024
1d471b1
fix ConcatenatableArray
TomNicholas Apr 26, 2024
766605d
formatting
TomNicholas Apr 26, 2024
971287f
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Apr 26, 2024
10c0ed5
whatsnew
TomNicholas Apr 26, 2024
51eea5d
add new create_index kwarg to overloads
TomNicholas Apr 26, 2024
bde9f2b
split vars into data_vars and coord_vars in one loop
TomNicholas Apr 26, 2024
d5241ce
avoid mypy error by using new variable name
TomNicholas Apr 26, 2024
7e8f895
warn if create_index=True but no index created because dimension vari…
TomNicholas Apr 27, 2024
ed85446
add string marks in warning message
TomNicholas Apr 27, 2024
39571ba
Merge branch 'main' into expand_dims_create_1d_index
TomNicholas Apr 27, 2024
206985b
Merge branch 'expand_dims_create_1d_index' into concat-avoid-index-au…
TomNicholas Apr 27, 2024
86998e4
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Apr 27, 2024
5894724
regression test for dtype changing in to_stacked_array
TomNicholas Apr 29, 2024
dad9433
correct doctest
TomNicholas Apr 29, 2024
b235c09
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Apr 29, 2024
36a2223
Remove outdated comment
TomNicholas Apr 29, 2024
e17c13f
test we can skip creation of indexes during shape promotion
TomNicholas Apr 29, 2024
e8fa857
make shape promotion test pass
TomNicholas Apr 29, 2024
648d5bc
Merge branch 'concat-avoid-index-auto-creation' of https://github.com…
TomNicholas Apr 29, 2024
deb292c
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Apr 29, 2024
6dd57a9
point to issue in whatsnew
TomNicholas Apr 29, 2024
b0e3612
don't create dimension coordinates just to drop them at the end
TomNicholas May 1, 2024
b2f06a0
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas May 1, 2024
ff70fc7
Remove ToDo about not using Coordinates object to pass indexes
TomNicholas May 1, 2024
2f97a5c
get rid of unlabeled_dims variable entirely
TomNicholas May 1, 2024
6d825e5
move ConcatenatableArray and similar to new file
TomNicholas May 8, 2024
b88b5a6
formatting nit
TomNicholas May 8, 2024
30c7408
Merge branch 'concat-avoid-index-auto-creation' of https://github.com…
TomNicholas May 8, 2024
b243150
renamed create_index -> create_index_for_new_dim in concat
TomNicholas May 8, 2024
9e9e168
renamed create_index -> create_index_for_new_dim in expand_dims
TomNicholas May 8, 2024
dca2fb9
fix incorrect arg name
TomNicholas May 8, 2024
c979672
add example to docstring
TomNicholas May 8, 2024
ac27ce0
add example of using new kwarg to docstring of expand_dims
TomNicholas May 8, 2024
d73ac48
add example of using new kwarg to docstring of concat
TomNicholas May 8, 2024
9ebbb33
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas May 8, 2024
d1b656d
re-nit the nit
TomNicholas May 8, 2024
ac998e9
more instances of the nit
keewis May 8, 2024
0849b94
fix docstring doctest formatting nit
TomNicholas May 8, 2024
25764ca
Merge branch 'concat-avoid-index-auto-creation' of https://github.com…
TomNicholas May 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,10 @@ New Features
- :py:func:`testing.assert_allclose`/:py:func:`testing.assert_equal` now accept a new argument `check_dims="transpose"`, controlling whether a transposed array is considered equal. (:issue:`5733`, :pull:`8991`)
By `Ignacio Martinez Vazquez <https://github.com/ignamv>`_.
- Added the option to avoid automatically creating 1D pandas indexes in :py:meth:`Dataset.expand_dims()`, by passing the new kwarg
`create_index=False`. (:pull:`8960`)
`create_index_for_new_dim=False`. (:pull:`8960`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Avoid automatically re-creating 1D pandas indexes in :py:func:`concat()`. Also added option to avoid creating 1D indexes for
new dimension coordinates by passing the new kwarg `create_index_for_new_dim=False`. (:issue:`8871`, :pull:`8872`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.

Breaking changes
Expand Down
67 changes: 52 additions & 15 deletions xarray/core/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from xarray.core import dtypes, utils
from xarray.core.alignment import align, reindex_variables
from xarray.core.coordinates import Coordinates
from xarray.core.duck_array_ops import lazy_array_equiv
from xarray.core.indexes import Index, PandasIndex
from xarray.core.merge import (
Expand Down Expand Up @@ -42,6 +43,7 @@ def concat(
fill_value: object = dtypes.NA,
join: JoinOptions = "outer",
combine_attrs: CombineAttrsOptions = "override",
create_index_for_new_dim: bool = True,
) -> T_Dataset: ...


Expand All @@ -56,6 +58,7 @@ def concat(
fill_value: object = dtypes.NA,
join: JoinOptions = "outer",
combine_attrs: CombineAttrsOptions = "override",
create_index_for_new_dim: bool = True,
) -> T_DataArray: ...


Expand All @@ -69,6 +72,7 @@ def concat(
fill_value=dtypes.NA,
join: JoinOptions = "outer",
combine_attrs: CombineAttrsOptions = "override",
create_index_for_new_dim: bool = True,
):
"""Concatenate xarray objects along a new or existing dimension.

Expand Down Expand Up @@ -162,6 +166,8 @@ def concat(

If a callable, it must expect a sequence of ``attrs`` dicts and a context object
as its only parameters.
create_index_for_new_dim : bool, default: True
Whether to create a new ``PandasIndex`` object when the objects being concatenated contain scalar variables named ``dim``.

Returns
-------
Expand Down Expand Up @@ -217,6 +223,25 @@ def concat(
x (new_dim) <U1 8B 'a' 'b'
* y (y) int64 24B 10 20 30
* new_dim (new_dim) int64 16B -90 -100

# Concatenate a scalar variable along a new dimension of the same name with and without creating a new index

>>> ds = xr.Dataset(coords={"x": 0})
>>> xr.concat([ds, ds], dim="x")
<xarray.Dataset> Size: 16B
Dimensions: (x: 2)
Coordinates:
* x (x) int64 16B 0 0
Data variables:
*empty*

>>> xr.concat([ds, ds], dim="x").indexes
Indexes:
x Index([0, 0], dtype='int64', name='x')

>>> xr.concat([ds, ds], dim="x", create_index_for_new_dim=False).indexes
Indexes:
*empty*
"""
# TODO: add ignore_index arguments copied from pandas.concat
# TODO: support concatenating scalar coordinates even if the concatenated
Expand Down Expand Up @@ -245,6 +270,7 @@ def concat(
fill_value=fill_value,
join=join,
combine_attrs=combine_attrs,
create_index_for_new_dim=create_index_for_new_dim,
)
elif isinstance(first_obj, Dataset):
return _dataset_concat(
Expand All @@ -257,6 +283,7 @@ def concat(
fill_value=fill_value,
join=join,
combine_attrs=combine_attrs,
create_index_for_new_dim=create_index_for_new_dim,
)
else:
raise TypeError(
Expand Down Expand Up @@ -439,7 +466,7 @@ def _parse_datasets(
if dim in dims:
continue

if dim not in dim_coords:
if dim in ds.coords and dim not in dim_coords:
dim_coords[dim] = ds.coords[dim].variable
dims = dims | set(ds.dims)

Expand All @@ -456,6 +483,7 @@ def _dataset_concat(
fill_value: Any = dtypes.NA,
join: JoinOptions = "outer",
combine_attrs: CombineAttrsOptions = "override",
create_index_for_new_dim: bool = True,
) -> T_Dataset:
"""
Concatenate a sequence of datasets along a new or existing dimension
Expand Down Expand Up @@ -489,7 +517,6 @@ def _dataset_concat(
datasets
)
dim_names = set(dim_coords)
unlabeled_dims = dim_names - coord_names

both_data_and_coords = coord_names & data_names
if both_data_and_coords:
Expand All @@ -502,15 +529,18 @@ def _dataset_concat(

# case where concat dimension is a coordinate or data_var but not a dimension
if (dim in coord_names or dim in data_names) and dim not in dim_names:
datasets = [ds.expand_dims(dim) for ds in datasets]
datasets = [
ds.expand_dims(dim, create_index_for_new_dim=create_index_for_new_dim)
for ds in datasets
]

# determine which variables to concatenate
concat_over, equals, concat_dim_lengths = _calc_concat_over(
datasets, dim, dim_names, data_vars, coords, compat
)

# determine which variables to merge, and then merge them according to compat
variables_to_merge = (coord_names | data_names) - concat_over - unlabeled_dims
variables_to_merge = (coord_names | data_names) - concat_over

result_vars = {}
result_indexes = {}
Expand Down Expand Up @@ -567,7 +597,8 @@ def get_indexes(name):
var = ds._variables[name]
if not var.dims:
data = var.set_dims(dim).values
yield PandasIndex(data, dim, coord_dtype=var.dtype)
if create_index_for_new_dim:
yield PandasIndex(data, dim, coord_dtype=var.dtype)

# create concatenation index, needed for later reindexing
file_start_indexes = np.append(0, np.cumsum(concat_dim_lengths))
Expand Down Expand Up @@ -646,29 +677,33 @@ def get_indexes(name):
# preserves original variable order
result_vars[name] = result_vars.pop(name)

result = type(datasets[0])(result_vars, attrs=result_attrs)

absent_coord_names = coord_names - set(result.variables)
absent_coord_names = coord_names - set(result_vars)
if absent_coord_names:
raise ValueError(
f"Variables {absent_coord_names!r} are coordinates in some datasets but not others."
)
result = result.set_coords(coord_names)
result.encoding = result_encoding

result = result.drop_vars(unlabeled_dims, errors="ignore")
result_data_vars = {}
coord_vars = {}
for name, result_var in result_vars.items():
if name in coord_names:
coord_vars[name] = result_var
else:
result_data_vars[name] = result_var

if index is not None:
# add concat index / coordinate last to ensure that its in the final Dataset
if dim_var is not None:
index_vars = index.create_variables({dim: dim_var})
else:
index_vars = index.create_variables()
result[dim] = index_vars[dim]

coord_vars[dim] = index_vars[dim]
result_indexes[dim] = index

# TODO: add indexes at Dataset creation (when it is supported)
result = result._overwrite_indexes(result_indexes)
coords_obj = Coordinates(coord_vars, indexes=result_indexes)

result = type(datasets[0])(result_data_vars, coords=coords_obj, attrs=result_attrs)
result.encoding = result_encoding

return result

Expand All @@ -683,6 +718,7 @@ def _dataarray_concat(
fill_value: object = dtypes.NA,
join: JoinOptions = "outer",
combine_attrs: CombineAttrsOptions = "override",
create_index_for_new_dim: bool = True,
) -> T_DataArray:
from xarray.core.dataarray import DataArray

Expand Down Expand Up @@ -719,6 +755,7 @@ def _dataarray_concat(
fill_value=fill_value,
join=join,
combine_attrs=combine_attrs,
create_index_for_new_dim=create_index_for_new_dim,
)

merged_attrs = merge_attrs([da.attrs for da in arrays], combine_attrs)
Expand Down
12 changes: 7 additions & 5 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -2558,7 +2558,7 @@ def expand_dims(
self,
dim: None | Hashable | Sequence[Hashable] | Mapping[Any, Any] = None,
axis: None | int | Sequence[int] = None,
create_index: bool = True,
create_index_for_new_dim: bool = True,
**dim_kwargs: Any,
) -> Self:
"""Return a new object with an additional axis (or axes) inserted at
Expand All @@ -2569,7 +2569,7 @@ def expand_dims(
coordinate consisting of a single value.

The automatic creation of indexes to back new 1D coordinate variables
controlled by the create_index kwarg.
controlled by the create_index_for_new_dim kwarg.

Parameters
----------
Expand All @@ -2586,8 +2586,8 @@ def expand_dims(
multiple axes are inserted. In this case, dim arguments should be
same length list. If axis=None is passed, all the axes will be
inserted to the start of the result array.
create_index : bool, default is True
Whether to create new PandasIndex objects for any new 1D coordinate variables.
create_index_for_new_dim : bool, default: True
Whether to create new ``PandasIndex`` objects when the object being expanded contains scalar variables with names in ``dim``.
**dim_kwargs : int or sequence or ndarray
The keywords are arbitrary dimensions being inserted and the values
are either the lengths of the new dims (if int is given), or their
Expand Down Expand Up @@ -2651,7 +2651,9 @@ def expand_dims(
dim = {dim: 1}

dim = either_dict_or_kwargs(dim, dim_kwargs, "expand_dims")
ds = self._to_temp_dataset().expand_dims(dim, axis, create_index=create_index)
ds = self._to_temp_dataset().expand_dims(
dim, axis, create_index_for_new_dim=create_index_for_new_dim
)
return self._from_temp_dataset(ds)

def set_index(
Expand Down
43 changes: 35 additions & 8 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -4513,7 +4513,7 @@ def expand_dims(
self,
dim: None | Hashable | Sequence[Hashable] | Mapping[Any, Any] = None,
axis: None | int | Sequence[int] = None,
create_index: bool = True,
create_index_for_new_dim: bool = True,
**dim_kwargs: Any,
) -> Self:
"""Return a new object with an additional axis (or axes) inserted at
Expand All @@ -4524,7 +4524,7 @@ def expand_dims(
coordinate consisting of a single value.

The automatic creation of indexes to back new 1D coordinate variables
controlled by the create_index kwarg.
controlled by the create_index_for_new_dim kwarg.

Parameters
----------
Expand All @@ -4541,8 +4541,8 @@ def expand_dims(
multiple axes are inserted. In this case, dim arguments should be
same length list. If axis=None is passed, all the axes will be
inserted to the start of the result array.
create_index : bool, default is True
Whether to create new PandasIndex objects for any new 1D coordinate variables.
create_index_for_new_dim : bool, default: True
Whether to create new ``PandasIndex`` objects when the object being expanded contains scalar variables with names in ``dim``.
**dim_kwargs : int or sequence or ndarray
The keywords are arbitrary dimensions being inserted and the values
are either the lengths of the new dims (if int is given), or their
Expand Down Expand Up @@ -4612,6 +4612,33 @@ def expand_dims(
Data variables:
temperature (y, x, time) float64 96B 0.5488 0.7152 0.6028 ... 0.7917 0.5289

# Expand a scalar variable along a new dimension of the same name with and without creating a new index

>>> ds = xr.Dataset(coords={"x": 0})
>>> ds
<xarray.Dataset> Size: 8B
Dimensions: ()
Coordinates:
x int64 8B 0
Data variables:
*empty*

>>> ds.expand_dims("x")
<xarray.Dataset> Size: 8B
Dimensions: (x: 1)
Coordinates:
* x (x) int64 8B 0
Data variables:
*empty*

>>> ds.expand_dims("x").indexes
Indexes:
x Index([0], dtype='int64', name='x')

>>> ds.expand_dims("x", create_index_for_new_dim=False).indexes
Indexes:
*empty*

See Also
--------
DataArray.expand_dims
Expand Down Expand Up @@ -4663,7 +4690,7 @@ def expand_dims(
# value within the dim dict to the length of the iterable
# for later use.

if create_index:
if create_index_for_new_dim:
index = PandasIndex(v, k)
indexes[k] = index
name_and_new_1d_var = index.create_variables()
Expand Down Expand Up @@ -4705,14 +4732,14 @@ def expand_dims(
variables[k] = v.set_dims(dict(all_dims))
else:
if k not in variables:
if k in coord_names and create_index:
if k in coord_names and create_index_for_new_dim:
# If dims includes a label of a non-dimension coordinate,
# it will be promoted to a 1D coordinate with a single value.
index, index_vars = create_default_index_implicit(v.set_dims(k))
indexes[k] = index
variables.update(index_vars)
else:
if create_index:
if create_index_for_new_dim:
warnings.warn(
f"No index created for dimension {k} because variable {k} is not a coordinate. "
f"To create an index for {k}, please first call `.set_coords('{k}')` on this object.",
Expand Down Expand Up @@ -5400,7 +5427,7 @@ def to_stacked_array(
[3, 4, 5, 7]])
Coordinates:
* z (z) object 32B MultiIndex
* variable (z) object 32B 'a' 'a' 'a' 'b'
* variable (z) <U1 16B 'a' 'a' 'a' 'b'
* y (z) object 32B 'u' 'v' 'w' nan
Dimensions without coordinates: x

Expand Down
Loading
Loading