-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More annotations #3177
More annotations #3177
Conversation
return self._drop_vars(labels, errors=errors) | ||
else: | ||
if utils.is_scalar(labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I fully understand the use case where dims is not None, where apparently you are allowed to feed into labels a whole DataArray. It also makes me nervous that there's a single unit test, not together with the other tests for Dataset.drop, that triggers the use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this for the case you pass a single label; what's the case with passing a whole DataArray?
In [12]: da = xr.DataArray(data=np.ones((2, 3)), dims=['x', 'y'], coords=dict(x=list('ab')))
In [13]: da
Out[13]:
<xarray.DataArray (x: 2, y: 3)>
array([[1., 1., 1.],
[1., 1., 1.]])
Coordinates:
* x (x) <U1 'a' 'b'
Dimensions without coordinates: y
In [14]: da.drop('a', dim='x')
Out[14]:
<xarray.DataArray (x: 1, y: 3)>
array([[1., 1., 1.]])
Coordinates:
* x (x) <U1 'b'
Dimensions without coordinates: y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the problem. The test is expecting ds.y to be cast to a generic array-like by drop.
xarray/xarray/tests/test_dataset.py
Line 4245 in dba85bd
expected = 2 * ds.drop(ds.y, dim='y') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact is that iter(DataArray)
yields scalar DataArrays, not the elements of the DataArray (unlike iter(numpy.ndarray)
)
So the lines
if isinstance(labels, str) or not isinstance(labels, Iterable):
labels = {labels}
else:
labels = set(labels)
fall apart when labels is a DataArray.
To make it work I should change it to something like:
if isinstance(labels, str) or not isinstance(labels, Iterable):
labels = {labels}
elif isinstance(labels, DataArray):
labels = labels.values
else:
labels = set(labels)
The problem is that the logic above is EVERYWHERE. None of the many, many functions that expect "one or more dimension names" as an input cope (or are tested against) a DataArray input.
I lean towards removing the option entirely, and changing the unit test from
ds.drop(ds.y, dim='y')
to
ds.drop(ds.y.values, dim='y')
This is also because passing a DataArray is incredibly counter-intuitive; it took me a long time to figure out that ds.drop(ds.y, dim='y')
is not trying to drop the y variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I wrote labels = labels.values
and not labels = set(labels.values)
. That would cause a huge performance hit when values has many elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is due to the fact that the same method drop() performs two fundamentally different things - drop variables or elements from the variables - and in the latter case there are plenty of similar methods that DO accept a DataArray, e.g. sel(), where(), etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I cracked it - please review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be ideal to have separate functions for dropping variables rather than elements along an array dimension.
We are also considering refactoring this function over in #3128, I would be very interested in your thoughts!
Like a machine! |
Living up to the headshot; a rampage of typing... Thanks @crusaderky ! |
Hello @crusaderky! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-08-05 22:11:48 UTC |
Ready for review and merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks for being so thorough.
I put added some small questions, e.g. around is_scalar
. I also added a couple of comments we can elide for now.
Assuming we're good on is_scalar
, I'll go ahead and merge.
# Don't cast to set, as it would harm performance when labels | ||
# is a large numpy array | ||
if utils.is_scalar(labels): | ||
labels = [labels] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, no need to change now: could we remove the identical operation from the DataArray wrapper of this at https://github.com/pydata/xarray/pull/3177/files#diff-ffd3597671590bab245b3193affa62b8R1798 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -3338,8 +3385,10 @@ def drop_dims(self, drop_dims, *, errors='raise'): | |||
if errors not in ['raise', 'ignore']: | |||
raise ValueError('errors must be either "raise" or "ignore"') | |||
|
|||
if utils.is_scalar(drop_dims): | |||
if isinstance(drop_dims, str) or not isinstance(drop_dims, Iterable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and is_scalar
? I had thought is_scalar
was attempting to centralize this logic in a single function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_scalar
doesn't cope with non-string Hashables, e.g. enum.Enum
. I'm uncomfortable touching it without going through the exercise of thoroughly reviewing everything that invokes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, rereading it, it does cope with Enum. However, most functions that today use if isinstance(drop_dims, str) or not isinstance(drop_dims, Iterable):
won't transparently cope with 0-dimensional numpy arrays or, even worse, 0-dimensional DataArrays.
... | ||
|
||
# Drop index labels along dimension | ||
@overload # noqa: F811 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! You're leading our understanding of python typing here...
@@ -3351,15 +3400,15 @@ def drop_dims(self, drop_dims, *, errors='raise'): | |||
for d in v.dims if d in drop_dims) | |||
return self._drop_vars(drop_vars) | |||
|
|||
def transpose(self, *dims): | |||
def transpose(self, *dims: Hashable) -> 'Dataset': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One day (don't change now), we could have this as returning its class, for those who inherit from Dataset
. And I'm not sure how to do that with python typing; in other languages it's -> Self
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this in Python by defining a type variable, T = TypeVar('T')
and using it to annotate self
:
def transpose(self: T, *dims: Hashable) -> T:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible, yes, but IMHO detracts from readibility. The question here is, do people actually subclass from DataArray and Dataset? I've always found xarray rather hostile to subclassing, chiefly due to the heavy interaction between DataArray and Dataset.
dims = set(self.dims) | ||
elif isinstance(dim, str) or not isinstance(dim, Iterable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here re this vs is_scalar
@@ -1904,9 +1904,9 @@ def test_drop_coordinates(self): | |||
assert_identical(actual, expected) | |||
|
|||
with raises_regex(ValueError, 'cannot be found'): | |||
arr.drop(None) | |||
arr.drop('w') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if None
is passed? My quick test suggest this test should still pass, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None technically works, but it's not a valid use case. hash(None)
works and isinstance(None, Hashable)
returns True, but mypy will complain if you pass None where a Hashable is required, as None is treated as a special case. Also, xarray uses None everywhere to mark optional parameters, and in this case it asks the user to explicitly provide a value.
@@ -2000,6 +2000,12 @@ def test_drop_index_labels(self): | |||
expected = data.isel(x=slice(0, 0)) | |||
assert_identical(expected, actual) | |||
|
|||
# DataArrays as labels are a nasty corner case as they are not | |||
# Iterable[Hashable] - DataArray.__iter__ yields scalar DataArrays. | |||
actual = data.drop(DataArray(['a', 'b', 'c']), 'x', errors='ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, it's good this now works
Ready for merge if everybody's happy with the current state |
OK, let's merge and we can iterate if needed from here. Thanks as ever @crusaderky @shoyer - do you have any thoughts on |
* master: pyupgrade one-off run (pydata#3190) mfdataset, concat now support the 'join' kwarg. (pydata#3102) reduce the size of example dataset in dask docs (pydata#3187) add climpred to related-projects (pydata#3188) bump rasterio to 1.0.24 in doc building environment (pydata#3186) More annotations (pydata#3177) Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117) Internal clean-up of isnull() to avoid relying on pandas (pydata#3132) Call darray.compute() in plot() (pydata#3183) BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)
* master: pyupgrade one-off run (pydata#3190) mfdataset, concat now support the 'join' kwarg. (pydata#3102) reduce the size of example dataset in dask docs (pydata#3187) add climpred to related-projects (pydata#3188) bump rasterio to 1.0.24 in doc building environment (pydata#3186) More annotations (pydata#3177) Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117) Internal clean-up of isnull() to avoid relying on pandas (pydata#3132) Call darray.compute() in plot() (pydata#3183) BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)
No description provided.