Optimizations for GroupBy on Strings #1776

reuster986 · 2022-09-12T19:02:50Z

The current GroupBy logic for strings is optimized for grouping a single array of long (or variable-length) strings. However, there are at least two other cases that could benefit from different logic:

A single array of short strings. Currently, all strings get hashed to 128-bit values, which are then sorted to achieve grouping. However, if the maximum length of the strings is 16 bytes or less, then directly sorting the strings would use the same or fewer radix digits as sorting the hashes, and you wouldn't have to compute the hashes. In practice, it might even make sense to set the length threshold a little higher, e.g. directly sort strings when the max length is 20 bytes or less.
Multiple arrays, at least one of which is strings. Currently, when grouping by multiple arrays, arkouda will first hash any strings arrays (via Strings._get_grouping_keys()) so that all arrays are represented by integers, and then it will hash all the integer representations (inside UniqueMsg.chpl) and accumulate those into a single array of 128-bit hashes for sorting. In this situation, strings effectively get hashed twice, which is wasteful. I think we can skip the first hash and rework UniqueMsg to just hash each strings array once.

The text was updated successfully, but these errors were encountered:

This PR (closes Bears-R-Us#1847): - Adds benchmark comparing strings containing < 8 bytes, < 16 bytes, and > 16 bytes. This will track the single array of small strings optimization in Bears-R-Us#1776

This PR (Closes Bears-R-Us#1776): - Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique` - When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash) The performance of these changes will be tracked by the `String Groupby Performance` benchmark and the benchmark from Issue Bears-R-Us#1847. This is an intial implementation, next steps will be moving the concat into a uint logic to take advantage of computeOnSegments. I will also combine the logic in `assumeSortedShortcut` and `uniqueAndCount` in the next iteration.

This PR (closes #1847): - Adds benchmark comparing strings containing < 8 bytes, < 16 bytes, and > 16 bytes. This will track the single array of small strings optimization in #1776 Co-authored-by: Pierce Hayes <[email protected]>

This PR (Closes Bears-R-Us#1776): - Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique` - When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash) - Combines the logic in `assumeSortedShortcut` and `uniqueAndCount` The performance of these changes will be tracked by the `String Groupby Performance` benchmark and the benchmark from Issue Bears-R-Us#1847. This is an intial implementation, next steps will be moving the concat into a uint logic to take advantage of computeOnSegments

This PR (Closes Bears-R-Us#1776): - Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique` - When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash) - Combines the logic in `assumeSortedShortcut` and `uniqueAndCount` The performance of these changes will be tracked by the `String Groupby Performance` benchmark and `Small String GroupBy Performance`

This PR (Closes #1776): - Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique` - When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash) - Combines the logic in `assumeSortedShortcut` and `uniqueAndCount` The performance of these changes will be tracked by the `String Groupby Performance` benchmark and `Small String GroupBy Performance` Co-authored-by: Pierce Hayes <[email protected]>

reuster986 added the performance Performance needs improving label Sep 12, 2022

reuster986 self-assigned this Sep 12, 2022

stress-tess self-assigned this Oct 12, 2022

stress-tess mentioned this issue Oct 21, 2022

Small string Groupby benchmark #1847

Closed

stress-tess mentioned this issue Oct 21, 2022

Closes #1847: Small String GroupBy benchmark #1848

Merged

stress-tess mentioned this issue Oct 21, 2022

Closes #1776: GroupBy Strings Optimization #1851

Merged

joshmarshall1 closed this as completed in #1851 Oct 26, 2022

stress-tess mentioned this issue Oct 30, 2023

Strings groupby performance #2835

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for GroupBy on Strings #1776

Optimizations for GroupBy on Strings #1776

reuster986 commented Sep 12, 2022

Optimizations for GroupBy on Strings #1776

Optimizations for GroupBy on Strings #1776

Comments

reuster986 commented Sep 12, 2022