Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create benchmark for sorting relevant distributions of data #973

Open
7 of 9 tasks
reuster986 opened this issue Nov 16, 2021 · 0 comments
Open
7 of 9 tasks

Create benchmark for sorting relevant distributions of data #973

reuster986 opened this issue Nov 16, 2021 · 0 comments
Assignees

Comments

@reuster986
Copy link
Collaborator

reuster986 commented Nov 16, 2021

I can't seem to find the discussion right now, but we've been mulling over whether to explore a different sorting algorithm, such as the twoArrayRadixSort already in Chapel. The current LSD radix sort has the advantages of simplicity and complete insensitivity to data distribution, which are powerful. However, there are conditions (such as mostly-sorted input) under which an MSD radix sort would perform significantly faster, and it's not clear to me which sort is best for the majority of data distributions we are likely to encounter in the wild.

To that end, I'd like to create a benchmark that evaluates the performance of sorting and co-sorting on several distributions of data. For now, only the existing LSD radix sort would be used, and I would expect the performance to be the same on all inputs of a given size, regardless of distribution. But if/when we include the Chapel twoArrayRadixSort as a runtime choice, the benchmarking infrastructure will exist to compare the two algorithms side by side.

Cases I would like to test:

  • uniform random integers
    • [0, 2**16)
    • [0, 2**32)
    • [0, 2**64)
  • power-law integers and floats
  • RMAT-generated vertex identifiers (ints)
  • block-sorted data, i.e. concatenate(arrays), where each arrays[i] is sorted
  • refinements, e.g. coargsort([a, b]), where a is sorted but b is not
  • datetime64[ns]-like data, i.e. values whose range is much smaller than their magnitude
  • IPv4/IPv6-like data, e.g. with 90% of values in [0, 2**32) and 10% in [2**32, 2**128) (this would require a coargsort)
  • Strings of uniformly distributed length
  • Strings with log-normally distributed length

Any other suggestions or additional cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant