-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sel along 1D non-index coordinates #3925
Conversation
Hello @TomNicholas! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
I think the functionality is useful, but I'm concerned it would make the API more confusing. In particular, one maxim we've generally held is that changing the number of dimensions doesn't change functionality. (ofc it changes whether it's possible to create an index, though the indexing API doesn't change) What are others' thoughts? |
Sorry, could you explain what you mean here? How would this PR violate that? |
Of course. I mean that if someone changes the dimensionality of a non-index coord from 1D to 2D, then running Whereas now, neither would work. Which in some ways is worse, but also less surprising. I think these are always difficult trade-offs... |
Gotcha, thanks. Hmm, this seems like a grey area... I think selecting along 1D non-dimension coords is probably way more common than along 2D coords, but I'm biased.
BTW I can immediately think of at least one other place where this rule is broken (#3774) - where going from 0D coords to 1D coords of length 1 changes whether |
if new_idx is not None: | ||
new_indexes[dim] = new_idx | ||
|
||
elif dim in data_obj.coords and len(data_obj.coords[dim] == 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be data_obj.coords[dim].ndim == 1
instead of len(data_obj.coords[dim] == 1)
? The later is checking that the coordinate's first dimension has length 1.
coords_dtype = data_obj.coords[dim].dtype | ||
label = maybe_cast_to_coords_dtype(label, coords_dtype) | ||
idxr, new_idx = convert_label_indexer(index, label, dim, | ||
method, tolerance) | ||
pos_indexers[dim] = idxr | ||
if new_idx is not None: | ||
new_indexes[dim] = new_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor so this code block doesn't need to be repeated?
Every index is a 1D coordinate, so I think you could do this by switching around the order of these if
checks, e.g.,
if dim in data_obj.coords and data_obj.coords[dim].ndim == 1:
if dim in data_obj.indexes:
index = data_obj.indexes[dim]
else:
index = data_obj.coords[dim].to_index()
(dim,) = data_obj[dim].dims
coords_dtype = data_obj.coords[dim].dtype
label = maybe_cast_to_coords_dtype(label, coords_dtype)
idxr, new_idx = convert_label_indexer(index, label, dim,
method, tolerance)
pos_indexers[dim] = idxr
if new_idx is not None:
new_indexes[dim] = new_idx
I think this is generally a good idea! In the future, creating an
One minor concern I have here is about efficiency: building an Here's an microbenchmark that hopefully illustrates the issue:
pandas is 1000x faster than NumPy if the index is pre-existing, but 100x slower if the index is new. That's a 1e5 fold slow-down! I think users will appreciate the flexibility, but if there's some way we warn users that they really should set the index ahead of time when they are doing repeating indexing that could also be welcome. Figuring out how to save the state for counting the number of times a new index is created could be pretty messy, though. I guess we could stuff it into |
Related to my microbenchmark, it might also be worth considering pure NumPy versions of common indexing operations, to avoid the need to repeatedly create hash-tables. But that could be quite a bit of work to do comprehensively. |
OK! May be a sizeable change but I update on @shoyer 's view, let's do it. |
I think it's a good use case for some kind of |
In #5692 it is possible to perform selection using non-dimension coordinates with an index, although there's no easy way yet to set an index for such coordinates (this will be done in a follow-up PR by updating the API of |
FYI I probably am not going to work on this PR again - especially as I remember getting quite confused by the indexing internals, which have now changed. If we can just wait for this feature to be enabled via the indexing refactor then I would rather just do that. Perhaps I should close this? |
@benbovy, please could you give an example how it is possible? # sel based on a non-dim coordinate
# (using this coordinate directly .sel(product_id=26) would result in error "'no index found for coordinate product_id")
%timeit xds.sel(product=xds.product_id==26)
1.54 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# sel based on the dim itself
%timeit xds.sel(product='GN91 Glove Medium')
499 µs ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit xds.where(xds.product_id==26, drop=True)
4.17 ms ± 39 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
With #6971 on the way I guess we can close this. |
As a user, I find not being able to select along one-dimensional non-dimensional coordinates actually comes up fairly often. I think it's quite common to use multiple coordinates to be able to choose between plotting in different coordinate systems (or units) easily.
I've tried to close #2028 in the simplest (but also least efficient) way which was suggested by @shoyer (suggestion 1 here).
This should be temporary anyway: it will get superseded by the explicit indexes refactor. If there is another approach which would achieve the same functionality as this PR but actually bring us closer to #1603 then I would be happy to take a stab at that instead.
I don't really know what to do about the failing test in groupby arithmetic - I think it's caused here but I'm not sure what to replace the triple error type catching (?!) with.
isort -rc . && black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new API