-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datum based comparison kernels (#4596) #4701
Conversation
|
||
[features] | ||
dyn_cmp_dict = [] | ||
simd = ["arrow-array/simd"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SIMD variants added a non-trivial amount of code complexity, and behaved differently. I opted to simply remove them, we can always add back some sort of SIMD specialization for primitives in future should users feel strongly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other kernels as I recall we found properly written rust code would be vectorized by llvm more effectively than our hand rolled simd kernels. Do you think that is still the case?
cc @jhorstmann
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is empirically not the case here, LLVM is really bad at optimising horizontal operations such as creating a packed bitmask from a comparison. The LLVM generated kernels are ~2x slower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, though I feel like llvm has gotten a little better at this. on the other hand, the simd kernels also did not generate "perfect" code due to combining multiple masks into a single u64 (except for u8 comparisons which have 64 lanes).
I think the tradeoff against improved compile time is ok, and longer term this allows more code cleanups. The decimal types for example already had a quite hacky support for simd. The last simd usages then would be the aggregation kernels.
arrow-ord/src/cmp.rs
Outdated
Ok(BooleanArray::new(values, nulls)) | ||
} | ||
|
||
fn values(a: &dyn Array) -> (Option<Vec<usize>>, &dyn Array) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formulation not only allows for mixed dictionary and non-dictionary arrays (as before), but also allows for mixed dictionary key sizes.
The cost of hydrating the keys in this way is irrelevant compared to the execution time of the kernels, and avoids a huge amount of codegen
Benchmarks
|
arrow-ord/src/cmp.rs
Outdated
} | ||
} | ||
|
||
trait Dictionary: Array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see us making this public, perhaps with some additional methods, so that it can form the basic pattern for "how to handle dictionaries". FYI @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- methods like "apply unary function to values, returning a Dictionary of the same type" would be very helpful. Given the right documentation I think this would be very helpful and less confusing
One thought I had is if there is a more general formulation that might be helpful (for example, for DictionaryArrays REE Arrays, and maybe StringViewArrays), although maybe they just could have their own functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have enough experience with REE Arrays to want to propose an abstraction for them, I honestly don't really know how to handle them efficiently...
StringViewArrays shouldn't require any additional logic as it isn't a "nested" type
Latest benchmarks
|
These are very neat results. To what do you attribute the performance improvements like the following?
|
For the case of dictionaries it is likely because the process of normalizing the keys allows for unchecked indexing, which in turn helps LLVM optimise the code properly as there isn't a branch in the main loop. For the case of primitives I don't have a very good intuition, however, these kernels historically have been extremely sensitive to inlining. It is possible that collocating more of the code allows LLVM to better optimise the code... I haven't looked in too much detail. It could equally be the |
Ran the benchmarks with the pre-existing SIMD kernels, and they are roughly twice as fast, although have different semantics for floats. As written this PR will therefore represent a performance regression for users with the SIMD feature enabled. I can look to bring back the SIMD versions if people feel strongly, although my preference would be to avoid this complexity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through this PR's code and tests carefully, and I really like the new formulation -- very nicely done @tustvold 👏
For other reviewers, the size of the diff might obscure the fact that the core logic of all the comparisons is now combined into several relatively easy to read generic functions that are all shared.
My biggest comment / suggestion is to make it easier to create Scalar
s from rust types now that Scalar
plays much more prominently in the kernel API
I do wonder if some of the compilation speedup is that arrow-rs no longer instantiates all the code for the different comparisons unless it is actually called. Thus the benefits for downstream application binary size may not be as significant (but certainly it seems like an improvement none the less)
cc @viirya and @jhorstmann
@@ -44,10 +44,3 @@ half = { version = "2.1", default-features = false, features = ["num-traits"] } | |||
|
|||
[dev-dependencies] | |||
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] } | |||
|
|||
[package.metadata.docs.rs] | |||
features = ["dyn_cmp_dict"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an API change? (if someone was using arrow-ord directly)
I see that dyn_cmp_dict
is still used in arrow / arrow-string
|
||
[features] | ||
dyn_cmp_dict = [] | ||
simd = ["arrow-array/simd"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other kernels as I recall we found properly written rust code would be vectorized by llvm more effectively than our hand rolled simd kernels. Do you think that is still the case?
cc @jhorstmann
@@ -129,7 +129,8 @@ impl GetDbSchemasBuilder { | |||
} | |||
|
|||
if let Some(catalog_filter_name) = catalog_filter { | |||
filters.push(eq_utf8_scalar(&catalog_name, &catalog_filter_name)?); | |||
let scalar = StringArray::from_iter_values([catalog_filter_name]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I user I find this construction somewhat awkward to create a simple scalar
What would you think about creating convenience functions that would let this code be like:
let scalar = Scalar::new_utf8(catalog_fitler_name);
(doesn't have to be part of this PR, I can add it as a follow on ticket/PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a think about what we can do here
let s = UInt32Array::from(vec![tt]); | ||
eq(arr, &Scalar::new(&s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I think this would read better like
let s = Scalar::new_uint32(tt);
eq(arr, &Scalar::new(&s))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposal in #4704
@@ -43,6 +43,8 @@ | |||
//! ``` | |||
//! | |||
|
|||
pub mod cmp; | |||
#[doc(hidden)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for hidden
-- is it to discourage new uses of these functions? I also note you added "deprecated" notes to the functions in there, which should help people migrate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that it is clear where to look for comparison kernels
let r = r_v.map(|x| x.values().as_ref()).unwrap_or(r); | ||
|
||
let values = downcast_primitive_array! { | ||
(l, r) => apply(op, l.values().as_ref(), l_s, l_v, r.values().as_ref(), r_s, r_v), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming that this handles Date
, Time
, Timestamp
, etc types, right?
match (l_s, r_s) { | ||
(None, None) => { | ||
assert_eq!(l.len(), r.len()); | ||
collect_bool(l.len(), neg, |idx| unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth adding a safety note here related to the fact that the index is coming from a valid array?
op: impl Fn(T::Item, T::Item) -> bool, | ||
) -> BooleanBuffer { | ||
assert_eq!(l_v.len(), r_v.len()); | ||
collect_bool(l_v.len(), neg, |idx| unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here about documenting the safety of using idx
}) | ||
} | ||
|
||
trait ArrayOrd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you possible add some more documentation to this trait?
} | ||
|
||
fn is_eq(l: Self::Item, r: Self::Item) -> bool { | ||
l.is_eq(r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't make use of generics in the public interface, so it will instantiate all the variants regardless of if they're called. This is actually unlike before, so the speedup is quite possibly even more significant |
I recommend leaving this PR open for another day or two to allow others the chance to comment (or request more time to comment) |
I will probably look to get this in Friday morning GMT to make the next arrow release |
We are currently several arrow releases behind in our engine, so I can't easily test the performance impact on our internal benchmark queries. The compile-time improvements seem really nice though, and if needed we could implement comparison kernels for our limited set of datatypes using portable_simd. |
Which issue does this PR close?
Closes #4596
Rationale for this change
Adds datum based comparison kernels, deprecating the old kernels and drastically reducing the amount of generated code.
Benchmarks show no major performance regressions, and if anything non-trivial performance improvements.
Additionally a release build on master with dyn_cmp_dict takes
But with the changes in this PR takes
What changes are included in this PR?
Are there any user-facing changes?