Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More annotations #3177

Merged
merged 9 commits into from
Aug 6, 2019
Merged

More annotations #3177

merged 9 commits into from
Aug 6, 2019

Conversation

crusaderky
Copy link
Contributor

No description provided.

return self._drop_vars(labels, errors=errors)
else:
if utils.is_scalar(labels):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I fully understand the use case where dims is not None, where apparently you are allowed to feed into labels a whole DataArray. It also makes me nervous that there's a single unit test, not together with the other tests for Dataset.drop, that triggers the use case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this for the case you pass a single label; what's the case with passing a whole DataArray?

In [12]: da = xr.DataArray(data=np.ones((2, 3)), dims=['x', 'y'], coords=dict(x=list('ab')))

In [13]: da
Out[13]:
<xarray.DataArray (x: 2, y: 3)>
array([[1., 1., 1.],
       [1., 1., 1.]])
Coordinates:
  * x        (x) <U1 'a' 'b'
Dimensions without coordinates: y

In [14]: da.drop('a', dim='x')
Out[14]:
<xarray.DataArray (x: 1, y: 3)>
array([[1., 1., 1.]])
Coordinates:
  * x        (x) <U1 'b'
Dimensions without coordinates: y

Copy link
Contributor Author

@crusaderky crusaderky Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problem. The test is expecting ds.y to be cast to a generic array-like by drop.

expected = 2 * ds.drop(ds.y, dim='y')

Copy link
Contributor Author

@crusaderky crusaderky Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact is that iter(DataArray) yields scalar DataArrays, not the elements of the DataArray (unlike iter(numpy.ndarray))

So the lines

            if isinstance(labels, str) or not isinstance(labels, Iterable):
                labels = {labels}
            else:
                labels = set(labels)

fall apart when labels is a DataArray.
To make it work I should change it to something like:

            if isinstance(labels, str) or not isinstance(labels, Iterable):
                labels = {labels}
            elif isinstance(labels, DataArray):
                labels = labels.values
            else:
                labels = set(labels)

The problem is that the logic above is EVERYWHERE. None of the many, many functions that expect "one or more dimension names" as an input cope (or are tested against) a DataArray input.

I lean towards removing the option entirely, and changing the unit test from

ds.drop(ds.y, dim='y')

to

ds.drop(ds.y.values, dim='y')

This is also because passing a DataArray is incredibly counter-intuitive; it took me a long time to figure out that ds.drop(ds.y, dim='y') is not trying to drop the y variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I wrote labels = labels.values and not labels = set(labels.values). That would cause a huge performance hit when values has many elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is due to the fact that the same method drop() performs two fundamentally different things - drop variables or elements from the variables - and in the latter case there are plenty of similar methods that DO accept a DataArray, e.g. sel(), where(), etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I cracked it - please review

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be ideal to have separate functions for dropping variables rather than elements along an array dimension.

We are also considering refactoring this function over in #3128, I would be very interested in your thoughts!

@dcherian
Copy link
Contributor

dcherian commented Aug 2, 2019

Like a machine!

@max-sixty
Copy link
Collaborator

Like a machine!

Living up to the headshot; a rampage of typing...

Thanks @crusaderky !

@pep8speaks
Copy link

pep8speaks commented Aug 2, 2019

Hello @crusaderky! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-05 22:11:48 UTC

@crusaderky crusaderky changed the title WIP: More annotations More annotations Aug 2, 2019
@crusaderky
Copy link
Contributor Author

Ready for review and merge

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for being so thorough.

I put added some small questions, e.g. around is_scalar. I also added a couple of comments we can elide for now.

Assuming we're good on is_scalar, I'll go ahead and merge.

# Don't cast to set, as it would harm performance when labels
# is a large numpy array
if utils.is_scalar(labels):
labels = [labels]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, no need to change now: could we remove the identical operation from the DataArray wrapper of this at https://github.com/pydata/xarray/pull/3177/files#diff-ffd3597671590bab245b3193affa62b8R1798 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -3338,8 +3385,10 @@ def drop_dims(self, drop_dims, *, errors='raise'):
if errors not in ['raise', 'ignore']:
raise ValueError('errors must be either "raise" or "ignore"')

if utils.is_scalar(drop_dims):
if isinstance(drop_dims, str) or not isinstance(drop_dims, Iterable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and is_scalar? I had thought is_scalar was attempting to centralize this logic in a single function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_scalar doesn't cope with non-string Hashables, e.g. enum.Enum. I'm uncomfortable touching it without going through the exercise of thoroughly reviewing everything that invokes it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, rereading it, it does cope with Enum. However, most functions that today use if isinstance(drop_dims, str) or not isinstance(drop_dims, Iterable): won't transparently cope with 0-dimensional numpy arrays or, even worse, 0-dimensional DataArrays.

...

# Drop index labels along dimension
@overload # noqa: F811
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! You're leading our understanding of python typing here...

@@ -3351,15 +3400,15 @@ def drop_dims(self, drop_dims, *, errors='raise'):
for d in v.dims if d in drop_dims)
return self._drop_vars(drop_vars)

def transpose(self, *dims):
def transpose(self, *dims: Hashable) -> 'Dataset':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One day (don't change now), we could have this as returning its class, for those who inherit from Dataset. And I'm not sure how to do that with python typing; in other languages it's -> Self

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this in Python by defining a type variable, T = TypeVar('T') and using it to annotate self:

    def transpose(self: T, *dims: Hashable) -> T:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, yes, but IMHO detracts from readibility. The question here is, do people actually subclass from DataArray and Dataset? I've always found xarray rather hostile to subclassing, chiefly due to the heavy interaction between DataArray and Dataset.

dims = set(self.dims)
elif isinstance(dim, str) or not isinstance(dim, Iterable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here re this vs is_scalar

@@ -1904,9 +1904,9 @@ def test_drop_coordinates(self):
assert_identical(actual, expected)

with raises_regex(ValueError, 'cannot be found'):
arr.drop(None)
arr.drop('w')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if None is passed? My quick test suggest this test should still pass, is that right?

Copy link
Contributor Author

@crusaderky crusaderky Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None technically works, but it's not a valid use case. hash(None) works and isinstance(None, Hashable) returns True, but mypy will complain if you pass None where a Hashable is required, as None is treated as a special case. Also, xarray uses None everywhere to mark optional parameters, and in this case it asks the user to explicitly provide a value.

@@ -2000,6 +2000,12 @@ def test_drop_index_labels(self):
expected = data.isel(x=slice(0, 0))
assert_identical(expected, actual)

# DataArrays as labels are a nasty corner case as they are not
# Iterable[Hashable] - DataArray.__iter__ yields scalar DataArrays.
actual = data.drop(DataArray(['a', 'b', 'c']), 'x', errors='ignore')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, it's good this now works

@crusaderky
Copy link
Contributor Author

Ready for merge if everybody's happy with the current state

@max-sixty
Copy link
Collaborator

OK, let's merge and we can iterate if needed from here. Thanks as ever @crusaderky

@shoyer - do you have any thoughts on is_scalar from this thread?

@max-sixty max-sixty merged commit d1935ff into pydata:master Aug 6, 2019
dcherian added a commit to yohai/xarray that referenced this pull request Aug 7, 2019
* master:
  pyupgrade one-off run (pydata#3190)
  mfdataset, concat now support the 'join' kwarg. (pydata#3102)
  reduce the size of example dataset in dask docs (pydata#3187)
  add climpred to related-projects (pydata#3188)
  bump rasterio to 1.0.24 in doc building environment (pydata#3186)
  More annotations (pydata#3177)
  Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117)
  Internal clean-up of isnull() to avoid relying on pandas (pydata#3132)
  Call darray.compute() in plot() (pydata#3183)
  BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 8, 2019
* master:
  pyupgrade one-off run (pydata#3190)
  mfdataset, concat now support the 'join' kwarg. (pydata#3102)
  reduce the size of example dataset in dask docs (pydata#3187)
  add climpred to related-projects (pydata#3188)
  bump rasterio to 1.0.24 in doc building environment (pydata#3186)
  More annotations (pydata#3177)
  Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117)
  Internal clean-up of isnull() to avoid relying on pandas (pydata#3132)
  Call darray.compute() in plot() (pydata#3183)
  BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)
@crusaderky crusaderky deleted the annotations branch August 9, 2019 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants