Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for GroupBy on Strings #1776

Closed
reuster986 opened this issue Sep 12, 2022 · 0 comments · Fixed by #1851
Closed

Optimizations for GroupBy on Strings #1776

reuster986 opened this issue Sep 12, 2022 · 0 comments · Fixed by #1851
Assignees
Labels
performance Performance needs improving

Comments

@reuster986
Copy link
Collaborator

The current GroupBy logic for strings is optimized for grouping a single array of long (or variable-length) strings. However, there are at least two other cases that could benefit from different logic:

  • A single array of short strings. Currently, all strings get hashed to 128-bit values, which are then sorted to achieve grouping. However, if the maximum length of the strings is 16 bytes or less, then directly sorting the strings would use the same or fewer radix digits as sorting the hashes, and you wouldn't have to compute the hashes. In practice, it might even make sense to set the length threshold a little higher, e.g. directly sort strings when the max length is 20 bytes or less.
  • Multiple arrays, at least one of which is strings. Currently, when grouping by multiple arrays, arkouda will first hash any strings arrays (via Strings._get_grouping_keys()) so that all arrays are represented by integers, and then it will hash all the integer representations (inside UniqueMsg.chpl) and accumulate those into a single array of 128-bit hashes for sorting. In this situation, strings effectively get hashed twice, which is wasteful. I think we can skip the first hash and rework UniqueMsg to just hash each strings array once.
@reuster986 reuster986 added the performance Performance needs improving label Sep 12, 2022
@reuster986 reuster986 self-assigned this Sep 12, 2022
@stress-tess stress-tess self-assigned this Oct 12, 2022
stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Oct 21, 2022
This PR (closes Bears-R-Us#1847):
- Adds benchmark comparing strings containing < 8 bytes, < 16 bytes, and > 16 bytes. This will track the single array of small strings optimization in Bears-R-Us#1776
stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Oct 21, 2022
This PR (Closes Bears-R-Us#1776):
- Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique`
- When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash)

The performance of these changes will be tracked by the `String Groupby Performance` benchmark and the benchmark from Issue Bears-R-Us#1847. This is an intial implementation, next steps will be moving the concat into a uint logic to take advantage of computeOnSegments. I will also combine the logic in `assumeSortedShortcut` and `uniqueAndCount` in the next iteration.
Ethan-DeBandi99 pushed a commit that referenced this issue Oct 21, 2022
This PR (closes #1847):
- Adds benchmark comparing strings containing < 8 bytes, < 16 bytes, and > 16 bytes. This will track the single array of small strings optimization in #1776

Co-authored-by: Pierce Hayes <[email protected]>
stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Oct 25, 2022
This PR (Closes Bears-R-Us#1776):
- Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique`
- When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash)
- Combines the logic in `assumeSortedShortcut` and `uniqueAndCount`

The performance of these changes will be tracked by the `String Groupby Performance` benchmark and the benchmark from Issue Bears-R-Us#1847. This is an intial implementation, next steps will be moving the concat into a uint logic to take advantage of computeOnSegments
stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Oct 25, 2022
This PR (Closes Bears-R-Us#1776):
- Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique`
- When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash)
- Combines the logic in `assumeSortedShortcut` and `uniqueAndCount`

The performance of these changes will be tracked by the `String Groupby Performance` benchmark and `Small String GroupBy Performance`
stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Oct 26, 2022
This PR (Closes Bears-R-Us#1776):
- Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique`
- When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash)
- Combines the logic in `assumeSortedShortcut` and `uniqueAndCount`

The performance of these changes will be tracked by the `String Groupby Performance` benchmark and `Small String GroupBy Performance`
joshmarshall1 pushed a commit that referenced this issue Oct 26, 2022
This PR (Closes #1776):
- Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique`
- When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash)
- Combines the logic in `assumeSortedShortcut` and `uniqueAndCount`

The performance of these changes will be tracked by the `String Groupby Performance` benchmark and `Small String GroupBy Performance`

Co-authored-by: Pierce Hayes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance needs improving
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants