Create benchmark for sorting relevant distributions of data #973

reuster986 · 2021-11-16T16:32:57Z

I can't seem to find the discussion right now, but we've been mulling over whether to explore a different sorting algorithm, such as the twoArrayRadixSort already in Chapel. The current LSD radix sort has the advantages of simplicity and complete insensitivity to data distribution, which are powerful. However, there are conditions (such as mostly-sorted input) under which an MSD radix sort would perform significantly faster, and it's not clear to me which sort is best for the majority of data distributions we are likely to encounter in the wild.

To that end, I'd like to create a benchmark that evaluates the performance of sorting and co-sorting on several distributions of data. For now, only the existing LSD radix sort would be used, and I would expect the performance to be the same on all inputs of a given size, regardless of distribution. But if/when we include the Chapel twoArrayRadixSort as a runtime choice, the benchmarking infrastructure will exist to compare the two algorithms side by side.

Cases I would like to test:

uniform random integers
- [0, 2**16)
- [0, 2**32)
- [0, 2**64)
power-law integers and floats
RMAT-generated vertex identifiers (ints)
block-sorted data, i.e. concatenate(arrays), where each arrays[i] is sorted
refinements, e.g. coargsort([a, b]), where a is sorted but b is not
datetime64[ns]-like data, i.e. values whose range is much smaller than their magnitude
IPv4/IPv6-like data, e.g. with 90% of values in [0, 2**32) and 10% in [2**32, 2**128) (this would require a coargsort)
Strings of uniformly distributed length
Strings with log-normally distributed length

Any other suggestions or additional cases?

The text was updated successfully, but these errors were encountered:

Data distributions for sort benchmarking #973

reuster986 self-assigned this Nov 16, 2021

reuster986 mentioned this issue Nov 16, 2021

Data distributions for sort benchmarking #973 #977

Merged

reuster986 mentioned this issue Nov 30, 2021

Support Chapel's twoArrayRadixSort as an optional algorithm #984

Closed

reuster986 added a commit that referenced this issue Dec 16, 2021

Merge pull request #977 from Bears-R-Us/sort-test-cases

3934fc1

Data distributions for sort benchmarking #973

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create benchmark for sorting relevant distributions of data #973

Create benchmark for sorting relevant distributions of data #973

reuster986 commented Nov 16, 2021 •

edited

Loading

Create benchmark for sorting relevant distributions of data #973

Create benchmark for sorting relevant distributions of data #973

Comments

reuster986 commented Nov 16, 2021 • edited Loading

reuster986 commented Nov 16, 2021 •

edited

Loading