-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistently coerce dictionaries for arithmetic #6785
Conversation
Do we have any numbers that illustrate this claim?
I think @viirya added this feature so perhaps he has more context |
They are on the linked ticket - apache/arrow-rs#4407 |
Hmm, I don't get the question. For the dictionary behavior, it follows original behavior. It basically follows kernel behavior too. Previously I cleaned up the decimal type part, but for dictionary behavior, I think it is unchanged. I think for dictionary, it still has storage advantage in memory and disk (e.g. shuffle), otherwise it has no reason to have it. It might be true that dictionary is slower to process in computation kernel, but for overall cost I'm not sure it is always a win to get rid of dictionary. Except we have something that can automatically detect dictionary sparsity and convert to dictionary before outputting to storage. |
I would like to remove the special case support for dictionaries as it already causes an inordinate amount of code complexity, and that is without it properly covering cases like temporal arithmetic. The question is therefore can we remove the explicit support in favor of coercion or is it really important to some use-case?
It can only ever be faster where the dictionary key is smaller than the primitive, e.g. |
I think this change makes sense, but I defer to @viirya -- if he is ok with it in principle I will review the PR carefully |
I think I agree with this can largely reduce code complexity. My little concern might be if this is a clear win? Or it is probably a win for some cases but not for all? Dictionary process might be slower for some cases, but for some I think it might be faster, e.g., dictionary + scalar which I remember is directly computing on dictionary values. Not to mention there is extra coercion (casting) operation. Also I'm not sure about selection kernels, do we have them in DataFusion? I guess I'm okay with this (+0?), but we might need to look at performance impact if any. |
The benchmarks on apache/arrow-rs#4407 show that for the wrapping addition used by DataFusion casting first is faster than the logic for operating on the dictionary. It should be noted this is in part because of the inefficiency of the current way this is implemented in arrow-rs, with the checked kernel variant performing better, however, as DataFusion does not use this, the change in this PR should result in a performance improvement for DF. However, consistency is imo the biggest justification for this change, aside from the issues with arithmetic on dictionaries not supporting the full set of types (e.g. temporal), the coercion behaviour is rather inconsistent. In particular if the left argument is a dictionary and the right argument is not both will be coerced to a dictionary and a dictionary kernel used, in all other cases main will coerce both to primitives. This makes me think it unlikely this code is actually being used in practice.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also looked at the issue/pr that added these kernels, #5193 and #5194
My reading of those PRs is that the issue was being able to calculate on dictionary encoded arrays (e.g. to support it) rather than a usecase where doing the dictionary calculation natively was critical. Therefore since this PR still supports the calculation (via casting to primitive and calling the primitive kernels) I think it doesn't undo any of that work
As @viirya has already looked at this, I think we can merge this in. Thank you for your work @tustvold to clean up and keep DataFusion running cleanly and efficiently
@@ -226,13 +226,13 @@ fn math_decimal_coercion( | |||
use arrow::datatypes::DataType::*; | |||
|
|||
match (lhs_type, rhs_type) { | |||
(Dictionary(key_type, value_type), _) => { | |||
(Dictionary(_, value_type), _) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is worth a comment here explaining we are expanding out dictionaries to the underlying type and leave a comment pointing back at this PR / a summary of #6785 (comment)
I think that is critical context
let left_expr = try_cast(col("a", schema)?, schema, lhs)?; | ||
let right_expr = try_cast(col("b", schema)?, schema, rhs)?; | ||
let arithmetic_op = binary_simple(left_expr, op, right_expr, schema); | ||
let op = binary_op(col("a", schema)?, op, col("b", schema)?, schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a much nicer formulation
Yea, it just casts to primitive before arithmetic operations so dictionary arrays are still going through them. From end-to-end perspective, it might be not noticeable if you don't look at the output array type. |
For performance, I'd talk more about end-to-end query benchmark instead of kernel benchmark only. Arithmetic operations are only small pieces in the query execution time (µs/ms v.s. minute/hour), I think. That is what I mentioned above on my concern as this is not simply kernel change but also possible impact to other operations in the query engine. We will also do benchmark on new DataFusion release when we are going to upgrade internal integration. |
To clarify this PR does not alter the return type of the evaluation, the output will still always be primitive. I therefore don't foresee this having downstream implications. This PR only changes the way that DF evaluates |
* Coerce dictionaries for arithmetic * Clippy
Which issue does this PR close?
Closes #.
Rationale for this change
The support for arithmetic on dictionaries is inconsistent, results in a huge amount of code gen, and doesn't yield significant performance savings. There is an upstream PR proposing removing this functionality - apache/arrow-rs#4407
In particular as far as I can tell the current logic has the following behaviour:
By coercing the dictionaries we can avoid all this complexity, and provide a more consistent user experience
As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?