Consistently report all dimensions in error messages if invalid dimensions are given #8079

mgunyho · 2023-08-17T16:03:53Z

Hello,

I noticed that arr.min("nonexistent") raises an error with a very helpful message

ValueError: 'nonexistent' not found in array dimensions ('x', 'y', 'z')

while arr.idxmin("nonexistent") raises

KeyError: 'Dimension "nonexistent" not in dimension' [sic]

IMO, the list of dimensions should always be shown in the error message for these kinds of errors, it makes debugging much easier. With this PR, I have implemented this behavior for all such functions that I could find.

There is quite a consistent pattern which I think could be factored out into a function, but I didn't have a clear enough picture of the structure of the whole code to do it.

I didn't fix the tests yet, I'll do it if you think this can be merged.

Searched list of issues, couldn't find one related to this
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

TomNicholas · 2023-08-17T17:31:19Z

This is great, thank you @mgunyho !

kmuehlbauer · 2023-08-17T19:07:47Z

Not sure, but what to do if we have say tens or even hundreds of dimensions? Maybe that's not the majority of use cases but we should be prepared.

BTW, I've marked this as ready for review by accident, sorry.

mgunyho · 2023-08-18T08:40:34Z

what to do if we have say tens or even hundreds of dimensions?

The maximum number of dimensions for a numpy array is 32, and seems like in the near future it's going to be increased to 64 at most: numpy/numpy#5744. The same limit seems to apply for dask.

But okay, a dataset where each variable has a different coordinate can have lots of dimensions, like #5546. Although there it is also mentioned that in reality usually the number of dimensions is on the order of 10.

IMO a 32-item list in the error message is a bit ugly but still acceptable. If a use case comes up where this is a problem, we could add logic to limit the number of items shown in the list (this would be a good reason to factor out a function).

I'll fix the tests soon and then mark this as ready for review for real.

kmuehlbauer · 2023-08-18T08:48:30Z

@mgunyho I totally agree with your reasoning, but I just wanted to mention it as a possible problem source. Thanks for taking care!

mgunyho · 2023-08-19T13:19:42Z

I now went through all relevant ValueErrors and KeyErrors and updated the error messages where applicable.

I left out a couple of instances involving data_vars, because it's more likely have lots of them (like in #5546).

For Dataset, {dataset.dims!r} shows up as Frozen({"dim1": 3, "dim2": 4}), so I used tuple(dataset.dims). I used tuple instead of list because that's what we have done earlier, see here and here. Personally I would maybe prefer list, because a single-element tuple ("dim",) looks a bit confusing compared to a list ["dim"].

Note that arr.rolling(nonexistent=3) and arr.idxmin("nonexistent") raise KeyError, while coarsen() and min() raise ValueError. I also found MissingDimensionsError in variables.py which is a subclass of ValueError, but that's only used by one function.

doc/whats-new.rst

xarray/tests/test_dataset.py

mgunyho · 2023-08-19T13:29:07Z

xarray/tests/test_dataarray.py

+        with pytest.raises(
+            ValueError,
+            match=re.escape(
+                "Dimensions ('space',) not found in dataset dimensions ('time',)"


This now says "dataset" even though it's a DataArray, because DataArray.drop_duplicates just does self._to_temp_dataset().drop_duplicates(dim, keep=keep). Maybe the error message could say "data dimensions"? Should I change it everywhere to say just data instead of dataset?

Yes I think not mentioning Dataset would be better. There might be an example somewhere else of handling this same type of ambiguity.

I recall seeing an error message saying "data dimensions", and I found this: https://github.com/pydata/xarray/blob/main/xarray/core/variable.py#L677 (seems pretty closely related to this PR and #8089). I can change the wording to say just "data dimensions", I think it's a good way to put it.

Another option is "{self.__class__.__name__} dimensions", which is used in a couple of places.

I changed it now to "data dimensions".

mgunyho · 2023-08-19T13:29:53Z

Also, we should probably add the "error-reporting" label to this PR, I can't seem to be able to do it

TomNicholas

Personally I would maybe prefer list

Given that we are not trying to communicate anything about the type (i.e. regardless of the error raised the dimensions are not stored as a list internally) then I think you can print them however you feel is neatest.

xarray/core/computation.py

TomNicholas · 2023-08-19T18:50:20Z

xarray/core/concat.py

-                        "some variables in data_vars are not data variables "
-                        f"on the first dataset: {invalid_vars}"
+                        f"the variables {invalid_vars} in data_vars are not "
+                        f"found in the data variables of the first dataset"


Suggested change

f"found in the data variables of the first dataset"

f"found in the data variables of the first dataset {valid_vars}"

Here I explicitly left out listing the data variables, because it's more likely to have many of them like in #5546 (see also the comment in the code right above this)

mgunyho · 2023-08-21T07:08:26Z

Personally I would maybe prefer list

Given that we are not trying to communicate anything about the type (i.e. regardless of the error raised the dimensions are not stored as a list internally) then I think you can print them however you feel is neatest.

Can you @Illviljan say why you preferred tuple over list here #7821 (comment) ?

Illviljan · 2023-08-21T12:56:38Z

I don't remember that suggestion but maybe here's the reason:

tuples are faster than list to initialize, if mutability is not needed then use tuple.
dims are usually returned as Tuple[Hashable, ...], see DataArray().dims

mgunyho · 2023-08-26T18:47:51Z

tuples are faster than list to initialize

I suppose here this doesn't make much of a difference, since the number of items is fairly small.

If mutability is not needed then use tuple, dims are usually returned as Tuple[Hashable, ...],

These are fair points. I've left it as tuple for now.

… is given

Remove _assert_empty, not used anymore

…ate tests

dcherian · 2023-09-08T15:45:14Z

Thanks @mgunyho this is a great improvement. Thank you!

dcherian · 2023-09-09T04:55:22Z

Note that arr.rolling(nonexistent=3) and arr.idxmin("nonexistent") raise KeyError, while coarsen() and min() raise ValueError. I also found MissingDimensionsError in variables.py which is a subclass of ValueError, but that's only used by one function.

I think we should harmonize all these to ValueError but lets do that in a new issue & PR

github-actions bot added the topic-indexing label Aug 17, 2023

mgunyho marked this pull request as draft August 17, 2023 16:18

kmuehlbauer marked this pull request as ready for review August 17, 2023 19:03

mgunyho marked this pull request as draft August 18, 2023 08:40

mgunyho force-pushed the report-dims-in-errors branch 2 times, most recently from acd1e31 to 96d8f68 Compare August 19, 2023 12:03

github-actions bot added the topic-rolling label Aug 19, 2023

mgunyho force-pushed the report-dims-in-errors branch from 96d8f68 to f149ee6 Compare August 19, 2023 12:13

mgunyho marked this pull request as ready for review August 19, 2023 13:18

mgunyho commented Aug 19, 2023

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

mgunyho commented Aug 19, 2023

View reviewed changes

mgunyho mentioned this pull request Aug 19, 2023

WIP: Factor out a function for checking dimension-related errors #8089

Draft

3 tasks

TomNicholas added the topic-error reporting label Aug 19, 2023

TomNicholas reviewed Aug 19, 2023

View reviewed changes

mgunyho force-pushed the report-dims-in-errors branch from 0835a3f to 96771c3 Compare August 26, 2023 18:43

mgunyho added 6 commits September 4, 2023 11:20

Show dims and coords in idxmin/idxmax error message if an invalid dim…

e01bca0

… is given

Show data dims in error messages of Dataset and update tests

0ffc8bc

Remove _assert_empty, not used anymore

Update test for dataarray

dc6edcb

Show data dims in error messages of weighted and update test

d84ab2a

Show dimensions in error message of group_indexers_by_index

1388bf0

List coordinates in concat error message, update test

99ca40b

mgunyho added 6 commits September 4, 2023 11:24

List coordinates in coords __delitem__ error message, update tests

595c735

Show list of names in error message of PandasMultiIndex.sel, update test

a498ef5

Show list of dimensions in error messages of Rolling and Coarsen, upd…

67addd7

…ate tests

Show dims in Variable.concat error message as tuple for consistency

7fa5026

Change 'dataset' to 'data' in error messages

fdcad9a

Update whats-new

ee1ced1

mgunyho force-pushed the report-dims-in-errors branch from 96771c3 to ee1ced1 Compare September 4, 2023 08:30

Merge branch 'main' into report-dims-in-errors

5b6ebd2

dcherian added the plan to merge Final call for comments label Sep 8, 2023

dcherian changed the title ~~Consistently report all data dimensions in error messages if invalid dimensions are given~~ Consistently report all dimensions in error messages if invalid dimensions are given Sep 9, 2023

dcherian merged commit 0afbd45 into pydata:main Sep 9, 2023
25 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistently report all dimensions in error messages if invalid dimensions are given #8079

Consistently report all dimensions in error messages if invalid dimensions are given #8079

mgunyho commented Aug 17, 2023 •

edited

Loading

TomNicholas commented Aug 17, 2023

kmuehlbauer commented Aug 17, 2023

mgunyho commented Aug 18, 2023 •

edited

Loading

kmuehlbauer commented Aug 18, 2023

mgunyho commented Aug 19, 2023 •

edited

Loading

mgunyho Aug 19, 2023

TomNicholas Aug 19, 2023

mgunyho Aug 21, 2023

mgunyho Aug 21, 2023

mgunyho Aug 26, 2023

mgunyho commented Aug 19, 2023

TomNicholas left a comment

TomNicholas Aug 19, 2023

mgunyho Aug 21, 2023 •

edited

Loading

mgunyho commented Aug 21, 2023 •

edited

Loading

Illviljan commented Aug 21, 2023

mgunyho commented Aug 26, 2023

dcherian commented Sep 8, 2023

dcherian commented Sep 9, 2023

	f"found in the data variables of the first dataset"
	f"found in the data variables of the first dataset {valid_vars}"

Consistently report all dimensions in error messages if invalid dimensions are given #8079

Consistently report all dimensions in error messages if invalid dimensions are given #8079

Conversation

mgunyho commented Aug 17, 2023 • edited Loading

TomNicholas commented Aug 17, 2023

kmuehlbauer commented Aug 17, 2023

mgunyho commented Aug 18, 2023 • edited Loading

kmuehlbauer commented Aug 18, 2023

mgunyho commented Aug 19, 2023 • edited Loading

mgunyho Aug 19, 2023

Choose a reason for hiding this comment

TomNicholas Aug 19, 2023

Choose a reason for hiding this comment

mgunyho Aug 21, 2023

Choose a reason for hiding this comment

mgunyho Aug 21, 2023

Choose a reason for hiding this comment

mgunyho Aug 26, 2023

Choose a reason for hiding this comment

mgunyho commented Aug 19, 2023

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas Aug 19, 2023

Choose a reason for hiding this comment

mgunyho Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

mgunyho commented Aug 21, 2023 • edited Loading

Illviljan commented Aug 21, 2023

mgunyho commented Aug 26, 2023

dcherian commented Sep 8, 2023

dcherian commented Sep 9, 2023

mgunyho commented Aug 17, 2023 •

edited

Loading

mgunyho commented Aug 18, 2023 •

edited

Loading

mgunyho commented Aug 19, 2023 •

edited

Loading

mgunyho Aug 21, 2023 •

edited

Loading

mgunyho commented Aug 21, 2023 •

edited

Loading