Closes #3009: `indexof1d` to handle null values #3169

stress-tess · 2024-05-09T13:44:31Z

This PR (closes #3009) refactors indexof1d to use find since they have similar functionality and find is fairly optimized and correctly handles null values (once we dropna=False to the Groupby). The one major difference is when there are duplicates in the search space. find will only return the index of the first occurrence, but indexof1d returns the indices of all occurrences. To enable this, I added an all_occurrences flag to find

The approach I took for returning all occurrences involved adding a segmented mink/maxk, and I went back and forth on whether it should be user facing (see #3170). I implemented this by permuting the values and calling the existing mink/maxk on each segment. It's unlikely that this is the most efficient approach, but my goal was to focus on correctness this go around and optimize later if needed.

Wrote tests for indexof1d both for the reproducer and in general.

I'm especially interested in feedback from @brandon-neth to make sure I haven't missed any intended functionality of indexof1d, since they're the original author.

brandon-neth

Looks great! Thanks.

stress-tess · 2024-05-14T12:42:31Z

based on how indexof1d is used, see #3109 (comment), I think I should add a flag to find which toggles whether indices that aren't found return -1 or are excluded from the output. Either way, I think we should hold off merging this until #3109 is merged so I can make sure those tests still pass. For those reasons, I'm converting this to draft for the time being

This PR (closes Bears-R-Us#3009) refactors `indexof1d` to use `find` since they have similar functionality and `find` is fairly optimized and correctly handles null values (once we `dropna=False` to the `Groupby`). The two major difference is when there are how missing values are handled and how many indices get returned when there are duplicates in the search space. `find` would only return the index of the first occurrence and use `-1` to denote missing values, but `indexof1d` returns the indices of all occurrences and removes missing values. To enable this, I added the flags `all_occurrences` and `remove_missing` to `find` The approach I took involved adding a segmented `mink/maxk`, which I went back and forth on whether it should be user facing. I implemented this by permuting the values and calling the existing `mink/maxk`. I'm not sure if this is the most efficient approach, but my goal was focus on correctness first and we can optimize later if needed. Wrote tests for `indexof1d` both for the reproducer and in general.

stress-tess · 2024-05-17T17:30:07Z

updated to add a remove_missing flag to find. @brandon-neth, I think I changed enough to warrant a rereview if you don't mind

stress-tess · 2024-05-20T18:46:08Z

I think this example is really good for understanding the broad strokes of what the changes to find look like:

>>> select_from = ak.arange(10)
>>> arr1 = select_from[ak.randint(0, select_from.size, 20, seed=10)]
>>> arr2 = select_from[ak.randint(0, select_from.size, 20, seed=11)]
# remove some values to ensure we have some values
# which don't appear in the search space
>>> arr2 = arr2[arr2 != 9]
>>> arr2 = arr2[arr2 != 3]
# find with defaults (all_occurrences and remove_missing both False)
>>> ak.find(arr1, arr2)
array([-1 -1 -1 0 1 -1 -1 -1 2 -1 5 -1 8 -1 5 -1 -1 11 5 0])
 # set remove_missing to True, only difference from default
 # is missing values are excluded
 >>> ak.find(arr1, arr2, remove_missing=True)
array([0 1 2 5 8 5 11 5 0])
# set all_occurrences to True, the first index of each list
# is the first occurence and should match the default
>>> ak.find(arr1, arr2, all_occurrences=True).to_list()
[[-1],
 [-1],
 [-1],
 [0, 4],
 [1, 3, 10],
 [-1],
 [-1],
 [-1],
 [2, 6, 12, 13],
 [-1],
 [5, 7],
 [-1],
 [8, 9, 14],
 [-1],
 [5, 7],
 [-1],
 [-1],
 [11, 15],
 [5, 7],
 [0, 4]]
# set both remove_missing and all_occurrences to True, missing values
# will be empty segments
>>> ak.find(arr1, arr2, remove_missing=True, all_occurrences=True).to_list()
[[],
 [],
 [],
 [0, 4],
 [1, 3, 10],
 [],
 [],
 [],
 [2, 6, 12, 13],
 [],
 [5, 7],
 [],
 [8, 9, 14],
 [],
 [5, 7],
 [],
 [],
 [11, 15],
 [5, 7],
 [0, 4]]

also find was written by bill and was hard to understand even before this PR, so I recommend reading thru the comments and working through some examples and print out arrays as you go.

ajpotts

This was a tricky one. Thank you for doing it!

It think it looks good, although I didn't check the array-view case as closely as I should b/c I'm still not very familiar with that part of the API.

arkouda/alignment.py

arkouda/pdarraysetops.py

brandon-neth

Looks good to me! indexof1d is behaving as expected from my testing. Neat trick flattening the segarray!

stress-tess requested review from ajpotts, brandon-neth, jaketrookman and drculhane May 9, 2024 13:44

stress-tess marked this pull request as draft May 9, 2024 13:47

stress-tess force-pushed the 3009_indexof1d_null branch 2 times, most recently from 242ddf3 to 1ebebf2 Compare May 10, 2024 18:37

stress-tess marked this pull request as ready for review May 10, 2024 18:49

stress-tess mentioned this pull request May 10, 2024

Fixes #3001: nan broadcast bug #3173

Merged

brandon-neth approved these changes May 13, 2024

View reviewed changes

stress-tess mentioned this pull request May 14, 2024

Dataframe Indexing #3109

Merged

stress-tess marked this pull request as draft May 14, 2024 12:37

stress-tess force-pushed the 3009_indexof1d_null branch from 1ebebf2 to 6d386b9 Compare May 17, 2024 17:24

stress-tess marked this pull request as ready for review May 17, 2024 17:28

stress-tess requested a review from brandon-neth May 17, 2024 17:28

ajpotts approved these changes May 20, 2024

View reviewed changes

arkouda/alignment.py Outdated Show resolved Hide resolved

arkouda/alignment.py Show resolved Hide resolved

arkouda/pdarraysetops.py Show resolved Hide resolved

stress-tess and others added 2 commits May 20, 2024 17:13

update and add examples in response to PR feedback

c3d8969

Merge branch 'master' into 3009_indexof1d_null

6ce5c2e

brandon-neth approved these changes May 22, 2024

View reviewed changes

stress-tess merged commit c05c599 into Bears-R-Us:master May 22, 2024
22 checks passed

stress-tess deleted the 3009_indexof1d_null branch May 22, 2024 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #3009: `indexof1d` to handle null values #3169

Closes #3009: `indexof1d` to handle null values #3169

stress-tess commented May 9, 2024 •

edited

Loading

brandon-neth left a comment

stress-tess commented May 14, 2024 •

edited

Loading

stress-tess commented May 17, 2024

stress-tess commented May 20, 2024 •

edited

Loading

ajpotts left a comment

brandon-neth left a comment

Closes #3009: indexof1d to handle null values #3169

Closes #3009: indexof1d to handle null values #3169

Conversation

stress-tess commented May 9, 2024 • edited Loading

brandon-neth left a comment

Choose a reason for hiding this comment

stress-tess commented May 14, 2024 • edited Loading

stress-tess commented May 17, 2024

stress-tess commented May 20, 2024 • edited Loading

ajpotts left a comment

Choose a reason for hiding this comment

brandon-neth left a comment

Choose a reason for hiding this comment

Closes #3009: `indexof1d` to handle null values #3169

Closes #3009: `indexof1d` to handle null values #3169

stress-tess commented May 9, 2024 •

edited

Loading

stress-tess commented May 14, 2024 •

edited

Loading

stress-tess commented May 20, 2024 •

edited

Loading