-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations for GroupBy on Strings #1776
Labels
performance
Performance needs improving
Comments
stress-tess
pushed a commit
to stress-tess/arkouda
that referenced
this issue
Oct 21, 2022
This PR (closes Bears-R-Us#1847): - Adds benchmark comparing strings containing < 8 bytes, < 16 bytes, and > 16 bytes. This will track the single array of small strings optimization in Bears-R-Us#1776
stress-tess
pushed a commit
to stress-tess/arkouda
that referenced
this issue
Oct 21, 2022
This PR (Closes Bears-R-Us#1776): - Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique` - When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash) The performance of these changes will be tracked by the `String Groupby Performance` benchmark and the benchmark from Issue Bears-R-Us#1847. This is an intial implementation, next steps will be moving the concat into a uint logic to take advantage of computeOnSegments. I will also combine the logic in `assumeSortedShortcut` and `uniqueAndCount` in the next iteration.
Ethan-DeBandi99
pushed a commit
that referenced
this issue
Oct 21, 2022
This PR (closes #1847): - Adds benchmark comparing strings containing < 8 bytes, < 16 bytes, and > 16 bytes. This will track the single array of small strings optimization in #1776 Co-authored-by: Pierce Hayes <[email protected]>
stress-tess
pushed a commit
to stress-tess/arkouda
that referenced
this issue
Oct 25, 2022
This PR (Closes Bears-R-Us#1776): - Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique` - When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash) - Combines the logic in `assumeSortedShortcut` and `uniqueAndCount` The performance of these changes will be tracked by the `String Groupby Performance` benchmark and the benchmark from Issue Bears-R-Us#1847. This is an intial implementation, next steps will be moving the concat into a uint logic to take advantage of computeOnSegments
stress-tess
pushed a commit
to stress-tess/arkouda
that referenced
this issue
Oct 25, 2022
This PR (Closes Bears-R-Us#1776): - Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique` - When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash) - Combines the logic in `assumeSortedShortcut` and `uniqueAndCount` The performance of these changes will be tracked by the `String Groupby Performance` benchmark and `Small String GroupBy Performance`
stress-tess
pushed a commit
to stress-tess/arkouda
that referenced
this issue
Oct 26, 2022
This PR (Closes Bears-R-Us#1776): - Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique` - When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash) - Combines the logic in `assumeSortedShortcut` and `uniqueAndCount` The performance of these changes will be tracked by the `String Groupby Performance` benchmark and `Small String GroupBy Performance`
joshmarshall1
pushed a commit
that referenced
this issue
Oct 26, 2022
This PR (Closes #1776): - Removes duplicated string hash in `Strings._get_grouping_keys()` since this will be performed on all arrays during `unique` - When perfroming groupby on a single strings array with max length < 16 bytes, we now concat the bytes into 1 or 2 uint arrays and then use the regular numeric unique logic. This is hopefully more efficient than uniquing/sorting the hashed arrays (since 16 bytes <= 128 bits from the hash and we don't have to do the computation of the hash) - Combines the logic in `assumeSortedShortcut` and `uniqueAndCount` The performance of these changes will be tracked by the `String Groupby Performance` benchmark and `Small String GroupBy Performance` Co-authored-by: Pierce Hayes <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The current
GroupBy
logic for strings is optimized for grouping a single array of long (or variable-length) strings. However, there are at least two other cases that could benefit from different logic:Strings._get_grouping_keys()
) so that all arrays are represented by integers, and then it will hash all the integer representations (insideUniqueMsg.chpl
) and accumulate those into a single array of 128-bit hashes for sorting. In this situation, strings effectively get hashed twice, which is wasteful. I think we can skip the first hash and reworkUniqueMsg
to just hash each strings array once.The text was updated successfully, but these errors were encountered: