-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #3009: indexof1d
to handle null values
#3169
Closes #3009: indexof1d
to handle null values
#3169
Conversation
242ddf3
to
1ebebf2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks.
based on how |
This PR (closes Bears-R-Us#3009) refactors `indexof1d` to use `find` since they have similar functionality and `find` is fairly optimized and correctly handles null values (once we `dropna=False` to the `Groupby`). The two major difference is when there are how missing values are handled and how many indices get returned when there are duplicates in the search space. `find` would only return the index of the first occurrence and use `-1` to denote missing values, but `indexof1d` returns the indices of all occurrences and removes missing values. To enable this, I added the flags `all_occurrences` and `remove_missing` to `find` The approach I took involved adding a segmented `mink/maxk`, which I went back and forth on whether it should be user facing. I implemented this by permuting the values and calling the existing `mink/maxk`. I'm not sure if this is the most efficient approach, but my goal was focus on correctness first and we can optimize later if needed. Wrote tests for `indexof1d` both for the reproducer and in general.
1ebebf2
to
6d386b9
Compare
updated to add a |
I think this example is really good for understanding the broad strokes of what the changes to >>> select_from = ak.arange(10)
>>> arr1 = select_from[ak.randint(0, select_from.size, 20, seed=10)]
>>> arr2 = select_from[ak.randint(0, select_from.size, 20, seed=11)]
# remove some values to ensure we have some values
# which don't appear in the search space
>>> arr2 = arr2[arr2 != 9]
>>> arr2 = arr2[arr2 != 3]
# find with defaults (all_occurrences and remove_missing both False)
>>> ak.find(arr1, arr2)
array([-1 -1 -1 0 1 -1 -1 -1 2 -1 5 -1 8 -1 5 -1 -1 11 5 0])
# set remove_missing to True, only difference from default
# is missing values are excluded
>>> ak.find(arr1, arr2, remove_missing=True)
array([0 1 2 5 8 5 11 5 0])
# set all_occurrences to True, the first index of each list
# is the first occurence and should match the default
>>> ak.find(arr1, arr2, all_occurrences=True).to_list()
[[-1],
[-1],
[-1],
[0, 4],
[1, 3, 10],
[-1],
[-1],
[-1],
[2, 6, 12, 13],
[-1],
[5, 7],
[-1],
[8, 9, 14],
[-1],
[5, 7],
[-1],
[-1],
[11, 15],
[5, 7],
[0, 4]]
# set both remove_missing and all_occurrences to True, missing values
# will be empty segments
>>> ak.find(arr1, arr2, remove_missing=True, all_occurrences=True).to_list()
[[],
[],
[],
[0, 4],
[1, 3, 10],
[],
[],
[],
[2, 6, 12, 13],
[],
[5, 7],
[],
[8, 9, 14],
[],
[5, 7],
[],
[],
[11, 15],
[5, 7],
[0, 4]] also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a tricky one. Thank you for doing it!
It think it looks good, although I didn't check the array-view case as closely as I should b/c I'm still not very familiar with that part of the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! indexof1d is behaving as expected from my testing. Neat trick flattening the segarray!
This PR (closes #3009) refactors
indexof1d
to usefind
since they have similar functionality andfind
is fairly optimized and correctly handles null values (once wedropna=False
to theGroupby
). The one major difference is when there are duplicates in the search space.find
will only return the index of the first occurrence, butindexof1d
returns the indices of all occurrences. To enable this, I added anall_occurrences
flag tofind
The approach I took for returning all occurrences involved adding a segmented
mink/maxk
, and I went back and forth on whether it should be user facing (see #3170). I implemented this by permuting the values and calling the existingmink/maxk
on each segment. It's unlikely that this is the most efficient approach, but my goal was to focus on correctness this go around and optimize later if needed.Wrote tests for
indexof1d
both for the reproducer and in general.I'm especially interested in feedback from @brandon-neth to make sure I haven't missed any intended functionality of
indexof1d
, since they're the original author.