Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make string sorting more consistent #677

Closed
ronawho opened this issue Feb 16, 2021 · 8 comments
Closed

Make string sorting more consistent #677

ronawho opened this issue Feb 16, 2021 · 8 comments
Labels
performance Performance needs improving Recommend Closing This issue is recommended to be closed.

Comments

@ronawho
Copy link
Contributor

ronawho commented Feb 16, 2021

String argsort operates on the raw variable-length strings, but coargsort operates on the hash of the strings. Should these be unified? It's not obvious to me why they differ or when it's appropriate to sort the hash vs the raw strings.

From a performance perspective, it seems to be slightly faster to sort small strings directly, but at some point sorting the hash is faster. Sorting the hash should roughly be the time to hash + 4 * int(32) radix sort time (since we're sorting 128-bit hashes).

Here's a branch that compares twoPhaseArgSort to hash+radixSort:master...ronawho:str-argsort-hash

16-node-cs-hdr performance results with varying string sizes:

make test-bin StringArgsortSpeedTest

# maxLen=8 (default)
./test-bin/StringArgsortSpeedTest -nl 16 --size=$((10**8)) --maxLen=8
Sorted 1600000000 strings (7629.0 MB) in 10.15 seconds (751.92 MB/s)

./test-bin/StringArgsortSpeedTest -nl 16 --size=$((10**8)) --maxLen=8 --useHashArgsort
Sorted 1600000000 strings (7629.0 MB) in 10.94 seconds (697.59 MB/s)

# maxLen=32
./test-bin/StringArgsortSpeedTest -nl 16 --size=$((10**8)) --maxLen=32
Sorted 1600000000 strings (25939.0 MB) in 35.15 seconds (737.89 MB/s)

./test-bin/StringArgsortSpeedTest -nl 16 --size=$((10**8)) --maxLen=32 --useHashArgsort
Sorted 1600000000 strings (25939.0 MB) in 11.64 seconds (2227.67 MB/s)

# maxLen=64
./test-bin/StringArgsortSpeedTest -nl 16 --size=$((10**8)) --maxLen=64
Sorted 1600000000 strings (50354.0 MB) in 69.18 seconds (727.90 MB/s)

./test-bin/StringArgsortSpeedTest -nl 16 --size=$((10**8)) --maxLen=6 --useHashArgsort
Sorted 1600000000 strings (50353.0 MB) in 11.73 seconds (4293.39 MB/s)
@reuster986
Copy link
Collaborator

Good question. Ideally, arkouda would offer users both "grouping" semantics and "sorting" semantics for strings, with ak.GroupBy and set operations implicitly using "grouping" semantics. Unfortunately, I originally wrote ak.GroupBy to use coargsort under the hood, back when it was only integers and there was no distinction between grouping and sorting.

So ideally I would like to rework ak.GroupBy to call its own server command that hashes strings (by default), and change ak.coargsort to actually sort the strings (ak.argsort already sorts). What do you think of this plan?

@ronawho
Copy link
Contributor Author

ronawho commented Feb 17, 2021

Ah, ok -- yeah that makes a lot of sense. I didn't understand how sorting the hashes was useful in terms of actually sorting the raw strings, but that makes a lot of sense that coargsort was primarily be used for grouping. I agree, let's make coargsort sort the raw strings, and have groupby call a new server command.

@mhmerrill mhmerrill added the performance Performance needs improving label Apr 5, 2021
@mhmerrill
Copy link
Contributor

@reuster986 @ronawho did we satisfy this issue yet?

@ronawho
Copy link
Contributor Author

ronawho commented Apr 5, 2021

Not so far as I know.

@reuster986
Copy link
Collaborator

The high-level actions required by this issue are:

  • Create a server message specifically for GroupBy that takes over the grouping logic currently employed by coargsort
  • Modify coargsort to actually sort the data, rather than just grouping it (for strings)

Personally, I think it makes sense to wait on this issue until we have support for complex objects in the symbol table. Then, we can design the new GroupBy server message to generate an actual GroupBy object. This would also be a good time to refactor the logic of findSegmentsMsg to make it cleaner and possibly more performant.

In short, this is a pretty major effort that we will have to plan for in the future.

@mhmerrill
Copy link
Contributor

@reuster986 @ronawho can we close this?

@reuster986
Copy link
Collaborator

Once #1330 gets merged it will check the first box in my comment above, but the second box is still not done. However, this issue is related to #846 and we could maybe close one of them and roll the requirements into the other.

@stress-tess
Copy link
Member

I think we should close this issue in favor of #1348, and roll over any requirements might are not currently captured

@stress-tess stress-tess added the Recommend Closing This issue is recommended to be closed. label May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance needs improving Recommend Closing This issue is recommended to be closed.
Projects
None yet
Development

No branches or pull requests

4 participants