Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix in vectorized item assignment #1746

Merged
merged 14 commits into from
Dec 9, 2017

Conversation

fujiisoup
Copy link
Member

@fujiisoup fujiisoup commented Nov 29, 2017

Found bugs in nputils.NumpyVindexAdapter.__setitem__ and DataArray.__setitem__.

I will add more tests later.
Test case suggestions would be appreciated.

ind = Variable(['a'], [0, 1])
da[dict(x=ind, y=ind)] = 0
expected = DataArray([[0, 1], [1, 0], [1, 1]], dims=['x', 'y'])
self.assertDataArrayIdentical(expected, da)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future (no need to change this time), you can use assert_identical, from the newer testing tools

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always forget this, maybe because I usually make tests based on the existing ones...
(I think it is what many contributors do).
I think we should update entire test cases at some point (similar to #1741), rather than gradual migration.

mixed_positions, vindex_positions = _advanced_indexer_subspaces(key)
self._array[key] = np.moveaxis(value, vindex_positions,
mixed_positions)
if np.isscalar(value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using utils.is_scalar() and utils.to_0d_array for coercing a scalar value to a numpy array, then using the original code path here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that here we should support broadcasting, rather than some special care of scalars.

value = Variable(dims[-value.ndim:], value).set_dims(dims).data
value = Variable(dims[-value.ndim:], value)
# broadcast to become assignable
value = value.set_dims(dims).data
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to revert nputils.NumpyVindexAdapter to the master version and to make a broadcasting here.

@fujiisoup fujiisoup changed the title [WIP] Fix in vectorized item assignment Fix in vectorized item assignment Nov 29, 2017
@fujiisoup
Copy link
Member Author

Tests are failing due to #1738.
I think this is ready for review.

@@ -484,7 +484,9 @@ def __setitem__(self, key, value):
if isinstance(key, basestring):
self.coords[key] = value
else:
# xarray-style array indexing
# DataArray key -> Variable key
key = {k: v.variable if isinstance(v, DataArray) else v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to enforce consistency for coordinates here? My inclination would be that we should support exactly the same keys in setitem as are valid in getitem. Ideally we should also reuse the same code. That means we should raise errors if there are multiple indexers with inconsistent alignment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination would be that we should support exactly the same keys in setitem as are valid in getitem.

Reasonable. I will add a validation.

if new_order:
value = duck_array_ops.asarray(value)
if not isinstance(value, Variable):
value = as_compatible_data(value)
if value.ndim > len(dims):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this special case error message now that we call set_dims below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think value = Variable(dims[-value.ndim:], value) fails if value.ndim > len(dims).

# Coordinates in key, value and self[key] should be consistent.
obj = self[key]
if isinstance(value, DataArray):
assert_coordinate_consistent(value, obj.coords)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking of checking the consistency of coords on each DataArray argument in key instead. I guess we should probably check both!

# Coordinates in key, value and self[key] should be consistent.
obj = self[key]
if isinstance(value, DataArray):
assert_coordinate_consistent(value, obj.coords)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you use obj.coords.variables here we can skip the awkward getattr(coords[k], 'variable', coords[k]) above.

for k in obj.dims:
# make sure there are no conflict in dimension coordinates
if (k in coords and k in obj.coords):
coord = getattr(coords[k], 'variable', coords[k]) # Variable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to insist that coords always has the same type (e.g., a dict of with Variable values).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really reasonable. Updated.

@fujiisoup
Copy link
Member Author

I will update docs as this is a backward-incompatible change.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to update assignment for .loc, or does this already handle it? It would be good to at least add tests. We can leave that for another PR if you prefer.

doc/indexing.rst Outdated
.. note::

If an indexer is a :py:meth:`~xarray.DataArray`, its coordinates should not
conflict with the selected subpart of the target array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably note the exception for explicitly indexed dimensions with .loc/.sel?

"""
for k in obj.dims:
# make sure there are no conflict in dimension coordinates
if (k in coords and k in obj.coords):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can drop the extra parentheses here inside if

assert_coordinate_consistent(value, obj.coords.variables)
# DataArray key -> Variable key
key = {k: v.variable if isinstance(v, DataArray) else v
for k, v in self._item_key_to_dict(key).items()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check that coordinates are consistent on the key?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done a few lines above, obj = self[key], where in .isel we check the coordinates in the key.

But I am wondering this unnecessary indexing, though I think this implementation is the simplest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This could indeed be a significant performance hit. That said, I'm OK leaving this for now, with a TODO note to optimize it later.

@fujiisoup
Copy link
Member Author

@shoyer Thanks for the review.

Do we also need to update assignment for .loc, or does this already handle it?

Done.

As I commented above, coordinate validation requires an (practically) unused indexing.
I am wondering the efficiency of the indexing, though __setitem__ might be less frequently used than __getitem__.

pos_indexers[k] = DataArray(pos_indexers[k],
coords=coords, dims=v.dims)
pos_indexers, new_indexes = remap_label_indexers(
self, method, tolerance, **indexers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only indent by one additional tab for continuation lines.

assert_coordinate_consistent(value, obj.coords.variables)
# DataArray key -> Variable key
key = {k: v.variable if isinstance(v, DataArray) else v
for k, v in self._item_key_to_dict(key).items()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This could indeed be a significant performance hit. That said, I'm OK leaving this for now, with a TODO note to optimize it later.

@shoyer
Copy link
Member

shoyer commented Dec 8, 2017

Looks good to me. @fujiisoup please hit the big green button when you're ready!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants