Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #3009: indexof1d to handle null values #3169

Merged
merged 3 commits into from
May 22, 2024

Conversation

stress-tess
Copy link
Member

@stress-tess stress-tess commented May 9, 2024

This PR (closes #3009) refactors indexof1d to use find since they have similar functionality and find is fairly optimized and correctly handles null values (once we dropna=False to the Groupby). The one major difference is when there are duplicates in the search space. find will only return the index of the first occurrence, but indexof1d returns the indices of all occurrences. To enable this, I added an all_occurrences flag to find

The approach I took for returning all occurrences involved adding a segmented mink/maxk, and I went back and forth on whether it should be user facing (see #3170). I implemented this by permuting the values and calling the existing mink/maxk on each segment. It's unlikely that this is the most efficient approach, but my goal was to focus on correctness this go around and optimize later if needed.

Wrote tests for indexof1d both for the reproducer and in general.

I'm especially interested in feedback from @brandon-neth to make sure I haven't missed any intended functionality of indexof1d, since they're the original author.

@stress-tess stress-tess marked this pull request as draft May 9, 2024 13:47
@stress-tess stress-tess force-pushed the 3009_indexof1d_null branch 2 times, most recently from 242ddf3 to 1ebebf2 Compare May 10, 2024 18:37
@stress-tess stress-tess marked this pull request as ready for review May 10, 2024 18:49
Copy link
Collaborator

@brandon-neth brandon-neth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks.

@stress-tess stress-tess mentioned this pull request May 14, 2024
@stress-tess stress-tess marked this pull request as draft May 14, 2024 12:37
@stress-tess stress-tess marked this pull request as draft May 14, 2024 12:37
@stress-tess
Copy link
Member Author

stress-tess commented May 14, 2024

based on how indexof1d is used, see #3109 (comment), I think I should add a flag to find which toggles whether indices that aren't found return -1 or are excluded from the output. Either way, I think we should hold off merging this until #3109 is merged so I can make sure those tests still pass. For those reasons, I'm converting this to draft for the time being

This PR (closes Bears-R-Us#3009) refactors `indexof1d` to use `find` since they have similar functionality and `find` is fairly optimized and correctly handles null values (once we `dropna=False` to the `Groupby`). The two major difference is when there are how missing values are handled and how many indices get returned when there are duplicates in the search space. `find` would only return the index of the first occurrence and use `-1` to denote missing values, but `indexof1d` returns the indices of all occurrences and removes missing values. To enable this, I added the flags `all_occurrences` and `remove_missing` to `find`

The approach I took involved adding a segmented `mink/maxk`, which I went back and forth on whether it should be user facing. I implemented this by permuting the values and calling the existing `mink/maxk`. I'm not sure if this is the most efficient approach, but my goal was focus on correctness first and we can optimize later if needed.

Wrote tests for `indexof1d` both for the reproducer and in general.
@stress-tess stress-tess force-pushed the 3009_indexof1d_null branch from 1ebebf2 to 6d386b9 Compare May 17, 2024 17:24
@stress-tess stress-tess marked this pull request as ready for review May 17, 2024 17:28
@stress-tess stress-tess requested a review from brandon-neth May 17, 2024 17:28
@stress-tess
Copy link
Member Author

updated to add a remove_missing flag to find. @brandon-neth, I think I changed enough to warrant a rereview if you don't mind

@stress-tess
Copy link
Member Author

stress-tess commented May 20, 2024

I think this example is really good for understanding the broad strokes of what the changes to find look like:

>>> select_from = ak.arange(10)
>>> arr1 = select_from[ak.randint(0, select_from.size, 20, seed=10)]
>>> arr2 = select_from[ak.randint(0, select_from.size, 20, seed=11)]
# remove some values to ensure we have some values
# which don't appear in the search space
>>> arr2 = arr2[arr2 != 9]
>>> arr2 = arr2[arr2 != 3]
# find with defaults (all_occurrences and remove_missing both False)
>>> ak.find(arr1, arr2)
array([-1 -1 -1 0 1 -1 -1 -1 2 -1 5 -1 8 -1 5 -1 -1 11 5 0])
 # set remove_missing to True, only difference from default
 # is missing values are excluded
 >>> ak.find(arr1, arr2, remove_missing=True)
array([0 1 2 5 8 5 11 5 0])
# set all_occurrences to True, the first index of each list
# is the first occurence and should match the default
>>> ak.find(arr1, arr2, all_occurrences=True).to_list()
[[-1],
 [-1],
 [-1],
 [0, 4],
 [1, 3, 10],
 [-1],
 [-1],
 [-1],
 [2, 6, 12, 13],
 [-1],
 [5, 7],
 [-1],
 [8, 9, 14],
 [-1],
 [5, 7],
 [-1],
 [-1],
 [11, 15],
 [5, 7],
 [0, 4]]
# set both remove_missing and all_occurrences to True, missing values
# will be empty segments
>>> ak.find(arr1, arr2, remove_missing=True, all_occurrences=True).to_list()
[[],
 [],
 [],
 [0, 4],
 [1, 3, 10],
 [],
 [],
 [],
 [2, 6, 12, 13],
 [],
 [5, 7],
 [],
 [8, 9, 14],
 [],
 [5, 7],
 [],
 [],
 [11, 15],
 [5, 7],
 [0, 4]]

also find was written by bill and was hard to understand even before this PR, so I recommend reading thru the comments and working through some examples and print out arrays as you go.

Copy link
Contributor

@ajpotts ajpotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a tricky one. Thank you for doing it!

It think it looks good, although I didn't check the array-view case as closely as I should b/c I'm still not very familiar with that part of the API.

arkouda/alignment.py Outdated Show resolved Hide resolved
arkouda/alignment.py Show resolved Hide resolved
arkouda/pdarraysetops.py Show resolved Hide resolved
Copy link
Collaborator

@brandon-neth brandon-neth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! indexof1d is behaving as expected from my testing. Neat trick flattening the segarray!

@stress-tess stress-tess merged commit c05c599 into Bears-R-Us:master May 22, 2024
22 checks passed
@stress-tess stress-tess deleted the 3009_indexof1d_null branch May 22, 2024 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

indexof1d to handle null values
3 participants