-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistently report all dimensions in error messages if invalid dimensions are given #8079
Conversation
This is great, thank you @mgunyho ! |
Not sure, but what to do if we have say tens or even hundreds of dimensions? Maybe that's not the majority of use cases but we should be prepared. BTW, I've marked this as ready for review by accident, sorry. |
The maximum number of dimensions for a numpy array is 32, and seems like in the near future it's going to be increased to 64 at most: numpy/numpy#5744. The same limit seems to apply for dask. But okay, a dataset where each variable has a different coordinate can have lots of dimensions, like #5546. Although there it is also mentioned that in reality usually the number of dimensions is on the order of 10. IMO a 32-item list in the error message is a bit ugly but still acceptable. If a use case comes up where this is a problem, we could add logic to limit the number of items shown in the list (this would be a good reason to factor out a function). I'll fix the tests soon and then mark this as ready for review for real. |
@mgunyho I totally agree with your reasoning, but I just wanted to mention it as a possible problem source. Thanks for taking care! |
acd1e31
to
96d8f68
Compare
96d8f68
to
f149ee6
Compare
I now went through all relevant ValueErrors and KeyErrors and updated the error messages where applicable. I left out a couple of instances involving For Dataset, Note that |
xarray/tests/test_dataarray.py
Outdated
with pytest.raises( | ||
ValueError, | ||
match=re.escape( | ||
"Dimensions ('space',) not found in dataset dimensions ('time',)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now says "dataset" even though it's a DataArray, because DataArray.drop_duplicates
just does self._to_temp_dataset().drop_duplicates(dim, keep=keep)
. Maybe the error message could say "data dimensions"? Should I change it everywhere to say just data instead of dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think not mentioning Dataset
would be better. There might be an example somewhere else of handling this same type of ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall seeing an error message saying "data dimensions", and I found this: https://github.com/pydata/xarray/blob/main/xarray/core/variable.py#L677 (seems pretty closely related to this PR and #8089). I can change the wording to say just "data dimensions", I think it's a good way to put it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is "{self.__class__.__name__} dimensions"
, which is used in a couple of places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it now to "data dimensions".
Also, we should probably add the "error-reporting" label to this PR, I can't seem to be able to do it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would maybe prefer list
Given that we are not trying to communicate anything about the type (i.e. regardless of the error raised the dimensions are not stored as a list internally) then I think you can print them however you feel is neatest.
"some variables in data_vars are not data variables " | ||
f"on the first dataset: {invalid_vars}" | ||
f"the variables {invalid_vars} in data_vars are not " | ||
f"found in the data variables of the first dataset" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"found in the data variables of the first dataset" | |
f"found in the data variables of the first dataset {valid_vars}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I explicitly left out listing the data variables, because it's more likely to have many of them like in #5546 (see also the comment in the code right above this)
Can you @Illviljan say why you preferred tuple over list here #7821 (comment) ? |
I don't remember that suggestion but maybe here's the reason:
|
0835a3f
to
96771c3
Compare
I suppose here this doesn't make much of a difference, since the number of items is fairly small.
These are fair points. I've left it as |
Remove _assert_empty, not used anymore
96771c3
to
ee1ced1
Compare
Thanks @mgunyho this is a great improvement. Thank you! |
I think we should harmonize all these to |
Hello,
I noticed that
arr.min("nonexistent")
raises an error with a very helpful messagewhile
arr.idxmin("nonexistent")
raisesIMO, the list of dimensions should always be shown in the error message for these kinds of errors, it makes debugging much easier. With this PR, I have implemented this behavior for all such functions that I could find.
There is quite a consistent pattern which I think could be factored out into a function, but I didn't have a clear enough picture of the structure of the whole code to do it.
I didn't fix the tests yet, I'll do it if you think this can be merged.
whats-new.rst