-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Binary Dictionary Arithmetic Support #4407
Remove Binary Dictionary Arithmetic Support #4407
Conversation
Agree that the performance benefit of specialized kernels is probably not worth the complexity and added code.
This kind of makes sense to me, for many operations involving scalars, the dictionary would still be unique afterwards, while an operation with two dictionaries would lead to combinatoric explosion and no longer is beneficial to dictionary encode the results. Operations like In our engine we had a similar issue with string replace or concat operations, where we decided that such operations on two dictionary arrays would always result in a string array, but with dictionary array and literal string it would be beneficial to build a new dictionary. I did not review the code in detail, maybe this is already happening, but could the dyn kernels automatically downcast/materialize dictionary arrays so that dictionary arrays are still supported as inputs? |
I think I would prefer that this was delegated to the query engines type coercion machinery, to ensure this is visible to the planner and avoid unnecessary casts back and forth. A PR to do this in DataFusion can be found - apache/datafusion#6785 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could soften the blow to removing this functionality by adding some documentation somewhere that explains how calculate on dictionaries (namely cast them to the underlying type).
Perhaps some overview discussion on https://docs.rs/arrow/latest/arrow/compute/kernels/index.html or https://docs.rs/arrow/latest/arrow/compute/kernels/arithmetic/index.html?
Or perhaps on the kernels themselves like https://docs.rs/arrow/latest/arrow/compute/kernels/arithmetic/fn.add_dyn.html 🤔
I think @viirya should also have a chance to review / comment on this prior to merge. The justification as i understand it is that the primitive kernels are much faster anyways for this kind of operation and so including native dictionary creation is both slower as well as larger code and harder to work with |
I will find some time looking at this soon if you are not hurry to merge this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For kernel-level change, this looks good to me as the performance numbers don't show regression.
I'm also thinking about the question around dictionary of primitives. But it is not an issue of this change. Because even if they still have some advantages on other aspects, for arithmetic operations we still can cast them to primitives without performance regression.
Which issue does this PR close?
Relates to #3999
Rationale for this change
As part of #3999 I'm trying to improve the consistency and correctness of the arithmetic kernels, however, I am repeatedly bashing my head against the dictionary support and therefore wanted to float this idea to see what people think.
My major rationale is:
I think what would help me be less frustrated bashing my head against this would be some motivating use-case for this functionality, currently I can't see a compelling reason to ever use a DictionaryArray of primitives for query computation, they're almost always just worse
Performance of arithmetic using this feature, vs just casting first, run using (#4405)
What changes are included in this PR?
Are there any user-facing changes?
Yes, I suspect this will have downstream implications. Tagging @alamb @viirya @wjones127 @jhorstmann