Improve in-place primitive sorts by 13-67% #4473

psvri · 2023-07-01T18:53:52Z

Which issue does this PR close?

Closes #.

Rationale for this change

The current sort implementation for primitive types first sorts by indices and then performs a take operation. The kernel can be improved by directly sorting.

The results for i64 on my laptop are as follows

sort 2^10               time:   [9.1287 µs 9.1711 µs 9.2279 µs]
                        change: [-27.180% -25.980% -24.697%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  11 (11.00%) high severe

sort 2^12               time:   [70.191 µs 70.437 µs 70.741 µs]
                        change: [-15.190% -13.941% -12.614%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

sort nulls 2^10         time:   [4.9898 µs 5.0080 µs 5.0319 µs]
                        change: [-68.212% -67.754% -67.288%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  10 (10.00%) high severe

sort nulls 2^12         time:   [34.333 µs 34.641 µs 34.983 µs]
                        change: [-58.256% -57.089% -56.063%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

What changes are included in this PR?

I reworked the sort kernel so that primitive types are sorted directly without using sort_by_indices . I have also included a new primitive benchmark sort_kernel_primitives.rs .

Are there any user-facing changes?

No

psvri · 2023-07-01T18:56:07Z

I havent created an issue for this. Let me know if its required.

arrow-ord/src/sort.rs

tustvold · 2023-07-03T09:43:21Z

Is there some way we might reduce the amount of unsafe code here, given this is a rare special case (where you don't need the indices to sort other columns) I'm keen to keep the maintenance overheads down.

arrow-ord/src/sort.rs

psvri · 2023-07-03T16:50:11Z

Is there some way we might reduce the amount of unsafe code here, given this is a rare special case (where you don't need the indices to sort other columns) I'm keen to keep the maintenance overheads down.

I have removed all unsafe code in the latest commit.

Dandandan · 2023-07-03T18:24:42Z

arrow-ord/src/sort.rs

+        mutable_slice.sort_unstable_by(|a, b| a.compare(*b));
+        if sort_options.descending {
+            mutable_slice.reverse();
+        }


This looks like it should be faster for the descending case?

Suggested change

mutable_slice.sort_unstable_by(|a, b| a.compare(*b));

if sort_options.descending {

mutable_slice.reverse();

}

if sort_options.descending {

mutable_slice.sort_unstable_by(|a, b| b.compare(*a));

} else {

mutable_slice.sort_unstable_by(|a, b| a.compare(*b));

}

Tested it just now on my laptop. The difference is only b/n 2-3% .

Thanks for checking. I would argue it's also a bit more simple :)

I think given the marginal speed difference it makes sense to save on codegen by using reverse 👍

tustvold

Looks good just some minor nits

tustvold · 2023-07-03T20:16:39Z

arrow-ord/src/sort.rs

+) -> Result<ArrayRef, ArrowError>
+where
+    T: ArrowPrimitiveType,
+    <T as arrow_array::ArrowPrimitiveType>::Native: ArrowNativeTypeOp,


This shouldn't be necessary given the constraints on ArrowPrimitiveType::Native

tustvold · 2023-07-03T20:18:54Z

arrow-ord/src/sort.rs

+    let array_data = values.to_data();
+    let input_values = array_data.buffer(0);


Suggested change

let array_data = values.to_data();

let input_values = array_data.buffer(0);

let array = array.as_primitive::<T>();

let input_values = array.values().as_ref();

This not only avoids marshaling to ArrayData, but also the code is technically exploiting an implementation detail that PrimitiveArray returns ArrayData with a zero offset

tustvold · 2023-07-03T20:19:39Z

arrow-ord/src/sort.rs

+    if values.null_count() > 0 {
+        let nulls = array_data.nulls().unwrap();


Suggested change

if values.null_count() > 0 {

let nulls = array_data.nulls().unwrap();

if let Some(nulls) = array.nulls().filter(|n| n.null_count() > 0) {

tustvold · 2023-07-03T20:20:45Z

arrow-ord/src/sort.rs

+    let array_data = values.to_data();
+    let input_values = array_data.buffer(0);
+
+    let mut null_bit_buffer = None;


It might be nicer to use an expression style here, rather than using mut

e.g.

let nulls = match array.nulls().filter(|n| n.null_count() > 0) {
Some(nulls) => ...,
None => ...
}

tustvold · 2023-07-03T20:21:20Z

arrow-ord/src/sort.rs

+
+    let result_capacity = values.len()
+        * std::mem::size_of::<<T as arrow_array::ArrowPrimitiveType>::Native>();
+    let mut mutable_buffer = MutableBuffer::new(result_capacity);


Have you considered just using Vec here?

tustvold · 2023-07-03T20:22:39Z

arrow-ord/src/sort.rs

+        let nulls = array_data.nulls().unwrap();
+
+        let mut validity_buffer = BooleanBufferBuilder::new(values.len());
+        let values_slice;


I personally prefer the expression style, e.g.

let values_slice = match sort_options.nulls_first { true => ..., false => ... }

It makes it easier to see what is going on and where the value is being set

tustvold · 2023-07-03T20:22:58Z

arrow-ord/src/sort.rs

+        mutable_slice.sort_unstable_by(|a, b| a.compare(*b));
+        if sort_options.descending {
+            mutable_slice.reverse();
+        }


I think given the marginal speed difference it makes sense to save on codegen by using reverse 👍

Dandandan · 2023-07-03T20:30:50Z

arrow-ord/src/sort.rs

+        null_bit_buffer = Some(validity_buffer.finish().into());
+    } else {
+        mutable_slice.copy_from_slice(&input_values[..values.len()]);
+        mutable_slice.sort_unstable_by(|a, b| a.compare(*b));


Suggested change

mutable_slice.sort_unstable_by(|a, b| a.compare(*b));

mutable_slice.sort_unstable();

Should be the same?

Not for floats, we use total ordering not the default partial ordering

Ah forgot about that :)

psvri · 2023-07-04T02:39:19Z

I will make these changes today.

tustvold · 2023-07-04T03:40:57Z

arrow-ord/src/sort.rs

@@ -57,11 +58,137 @@ pub fn sort(
    values: &dyn Array,
    options: Option<SortOptions>,
 ) -> Result<ArrayRef, ArrowError> {
-    if let DataType::RunEndEncoded(_, _) = values.data_type() {
-        return sort_run(values, options, None);
+    match values.data_type() {


If you changed sort_native_type to take PrimitiveArray instead of dyn Array you could use the downcast_primitive_array macro here

psvri · 2023-07-04T14:25:50Z

Thanks for the above comments and the latest commit addresses them. It simplified the code a lot.

tustvold

LGTM thank you

tustvold · 2023-07-04T14:47:48Z

arrow-ord/src/sort.rs

+        values => return sort_native_type(values, options),
+        DataType::RunEndEncoded(_, _) => return sort_run(values, options, None),
+        _ => {
+            let indices = sort_to_indices(values, options, None)?;
+            return take(values, &indices, None)


Suggested change

values => return sort_native_type(values, options),

DataType::RunEndEncoded(_, _) => return sort_run(values, options, None),

_ => {

let indices = sort_to_indices(values, options, None)?;

return take(values, &indices, None)

values => sort_native_type(values, options),

DataType::RunEndEncoded(_, _) => sort_run(values, options, None),

_ => {

let indices = sort_to_indices(values, options, None)?;

take(values, &indices, None)

psvri added 3 commits July 1, 2023 23:02

Adding sort_primitives benchmark

5fbd34a

Adding sort_primitives improvements

596f81a

Fix lints

0e20980

github-actions bot added the arrow Changes to the arrow crate label Jul 1, 2023

psvri commented Jul 1, 2023

View reviewed changes