You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can't seem to find the discussion right now, but we've been mulling over whether to explore a different sorting algorithm, such as the twoArrayRadixSort already in Chapel. The current LSD radix sort has the advantages of simplicity and complete insensitivity to data distribution, which are powerful. However, there are conditions (such as mostly-sorted input) under which an MSD radix sort would perform significantly faster, and it's not clear to me which sort is best for the majority of data distributions we are likely to encounter in the wild.
To that end, I'd like to create a benchmark that evaluates the performance of sorting and co-sorting on several distributions of data. For now, only the existing LSD radix sort would be used, and I would expect the performance to be the same on all inputs of a given size, regardless of distribution. But if/when we include the Chapel twoArrayRadixSort as a runtime choice, the benchmarking infrastructure will exist to compare the two algorithms side by side.
Cases I would like to test:
uniform random integers
[0, 2**16)
[0, 2**32)
[0, 2**64)
power-law integers and floats
RMAT-generated vertex identifiers (ints)
block-sorted data, i.e. concatenate(arrays), where each arrays[i] is sorted
refinements, e.g. coargsort([a, b]), where a is sorted but b is not
datetime64[ns]-like data, i.e. values whose range is much smaller than their magnitude
IPv4/IPv6-like data, e.g. with 90% of values in [0, 2**32) and 10% in [2**32, 2**128) (this would require a coargsort)
Strings of uniformly distributed length
Strings with log-normally distributed length
Any other suggestions or additional cases?
The text was updated successfully, but these errors were encountered:
I can't seem to find the discussion right now, but we've been mulling over whether to explore a different sorting algorithm, such as the
twoArrayRadixSort
already in Chapel. The current LSD radix sort has the advantages of simplicity and complete insensitivity to data distribution, which are powerful. However, there are conditions (such as mostly-sorted input) under which an MSD radix sort would perform significantly faster, and it's not clear to me which sort is best for the majority of data distributions we are likely to encounter in the wild.To that end, I'd like to create a benchmark that evaluates the performance of sorting and co-sorting on several distributions of data. For now, only the existing LSD radix sort would be used, and I would expect the performance to be the same on all inputs of a given size, regardless of distribution. But if/when we include the Chapel
twoArrayRadixSort
as a runtime choice, the benchmarking infrastructure will exist to compare the two algorithms side by side.Cases I would like to test:
[0, 2**16)
[0, 2**32)
[0, 2**64)
concatenate(arrays)
, where eacharrays[i]
is sortedcoargsort([a, b])
, wherea
is sorted butb
is notdatetime64[ns]
-like data, i.e. values whose range is much smaller than their magnitude[0, 2**32)
and 10% in[2**32, 2**128)
(this would require a coargsort)Any other suggestions or additional cases?
The text was updated successfully, but these errors were encountered: