Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently coerce dictionaries for arithmetic #6785

Merged
merged 3 commits into from
Jun 30, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 28, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

The support for arithmetic on dictionaries is inconsistent, results in a huge amount of code gen, and doesn't yield significant performance savings. There is an upstream PR proposing removing this functionality - apache/arrow-rs#4407

In particular as far as I can tell the current logic has the following behaviour:

  • Dictionary + Dictionary => Non-Dictionary (or error if feature not enabled)
  • Dictionary + Scalar Dictionary => Dictionary
  • Dictionary + Non-Dictionary => Dictionary
  • Non-Dictionary + Dictionary => Non-Dictionary

By coercing the dictionaries we can avoid all this complexity, and provide a more consistent user experience

As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions labels Jun 28, 2023
@tustvold tustvold requested a review from viirya June 28, 2023 09:11
@alamb
Copy link
Contributor

alamb commented Jun 28, 2023

The support for arithmetic on dictionaries is inconsistent, results in a huge amount of code gen, and doesn't yield significant performance savings. There is an upstream PR proposing removing this functionality - apache/arrow-rs#4407

Do we have any numbers that illustrate this claim?

As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger

I think @viirya added this feature so perhaps he has more context

@tustvold
Copy link
Contributor Author

Do we have any numbers that illustrate this claim?

They are on the linked ticket - apache/arrow-rs#4407

@viirya
Copy link
Member

viirya commented Jun 28, 2023

As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger
I think @viirya added this feature so perhaps he has more context

Hmm, I don't get the question. For the dictionary behavior, it follows original behavior. It basically follows kernel behavior too. Previously I cleaned up the decimal type part, but for dictionary behavior, I think it is unchanged.

I think for dictionary, it still has storage advantage in memory and disk (e.g. shuffle), otherwise it has no reason to have it. It might be true that dictionary is slower to process in computation kernel, but for overall cost I'm not sure it is always a win to get rid of dictionary. Except we have something that can automatically detect dictionary sparsity and convert to dictionary before outputting to storage.

@tustvold
Copy link
Contributor Author

tustvold commented Jun 28, 2023

Hmm, I don't get the question.

I would like to remove the special case support for dictionaries as it already causes an inordinate amount of code complexity, and that is without it properly covering cases like temporal arithmetic. The question is therefore can we remove the explicit support in favor of coercion or is it really important to some use-case?

I think for dictionary, it still has storage advantage in memory and disk (e.g. shuffle),

It can only ever be faster where the dictionary key is smaller than the primitive, e.g. DictionaryArray<Int32Type, PrimitiveArray<Int32Type>> will always be larger than PrimitiveArray<Int32Type>. In practice as the selection kernels do not recompute dictionaries (for good reason) even for DictionaryArray<Int32Type, PrimitiveArray<Decimal256Type>> the dictionary representation will likely be larger and more expensive to handle, especially for comparison and arithmetic kernels that evaluate on dictionary values directly.

@alamb
Copy link
Contributor

alamb commented Jun 28, 2023

I think this change makes sense, but I defer to @viirya -- if he is ok with it in principle I will review the PR carefully

@viirya
Copy link
Member

viirya commented Jun 28, 2023

I think I agree with this can largely reduce code complexity. My little concern might be if this is a clear win? Or it is probably a win for some cases but not for all? Dictionary process might be slower for some cases, but for some I think it might be faster, e.g., dictionary + scalar which I remember is directly computing on dictionary values. Not to mention there is extra coercion (casting) operation.

Also I'm not sure about selection kernels, do we have them in DataFusion?

I guess I'm okay with this (+0?), but we might need to look at performance impact if any.

@tustvold
Copy link
Contributor Author

tustvold commented Jun 29, 2023

but we might need to look at performance impact if any.

The benchmarks on apache/arrow-rs#4407 show that for the wrapping addition used by DataFusion casting first is faster than the logic for operating on the dictionary. It should be noted this is in part because of the inefficiency of the current way this is implemented in arrow-rs, with the checked kernel variant performing better, however, as DataFusion does not use this, the change in this PR should result in a performance improvement for DF.

However, consistency is imo the biggest justification for this change, aside from the issues with arithmetic on dictionaries not supporting the full set of types (e.g. temporal), the coercion behaviour is rather inconsistent. In particular if the left argument is a dictionary and the right argument is not both will be coerced to a dictionary and a dictionary kernel used, in all other cases main will coerce both to primitives. This makes me think it unlikely this code is actually being used in practice.

❯ create table foo as select (arrow_cast('1', 'Dictionary(Int32, Int32)')) as foo;
0 rows in set. Query took 0.003 seconds.
❯ select arrow_typeof(foo) from foo;
+--------------------------+
| arrow_typeof(foo.foo)    |
+--------------------------+
| Dictionary(Int32, Int32) |
+--------------------------+
1 row in set. Query took 0.002 seconds.
❯ explain select foo / 23 from foo;
+---------------+-------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                        |
+---------------+-------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: CAST(foo.foo AS Dictionary(Int32, Int64)) / Dictionary(Int32, Int64(23)) AS foo.foo / Int64(23) |
|               |   TableScan: foo projection=[foo]                                                                           |
| physical_plan | ProjectionExec: expr=[CAST(foo@0 AS Dictionary(Int32, Int64)) / 23 as foo.foo / Int64(23)]                  |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                             |
|               |                                                                                                             |
+---------------+-------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

❯ explain select 23 / foo from foo;
+---------------+-------------------------------------------------------------------------+
| plan_type     | plan                                                                    |
+---------------+-------------------------------------------------------------------------+
| logical_plan  | Projection: Int64(23) / CAST(foo.foo AS Int64)                          |
|               |   TableScan: foo projection=[foo]                                       |
| physical_plan | ProjectionExec: expr=[23 / CAST(foo@0 AS Int64) as Int64(23) / foo.foo] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                         |
|               |                                                                         |
+---------------+-------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.

❯ explain select foo / foo from foo;
+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                          |
+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: CAST(foo.foo AS Int32) AS foo.foo AS foo.foo AS foo.foo / CAST(foo.foo AS Int32) AS foo.foo AS foo.foo AS foo.foo |
|               |   TableScan: foo projection=[foo]                                                                                             |
| physical_plan | ProjectionExec: expr=[CAST(foo@0 AS Int32) / CAST(foo@0 AS Int32) as foo.foo / foo.foo]                                       |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                                               |
|               |                                                                                                                               |
+---------------+-------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.006 seconds.

@tustvold tustvold changed the title Coerce dictionaries for arithmetic Consistently coerce dictionaries for arithmetic Jun 29, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also looked at the issue/pr that added these kernels, #5193 and #5194

My reading of those PRs is that the issue was being able to calculate on dictionary encoded arrays (e.g. to support it) rather than a usecase where doing the dictionary calculation natively was critical. Therefore since this PR still supports the calculation (via casting to primitive and calling the primitive kernels) I think it doesn't undo any of that work

As @viirya has already looked at this, I think we can merge this in. Thank you for your work @tustvold to clean up and keep DataFusion running cleanly and efficiently

@@ -226,13 +226,13 @@ fn math_decimal_coercion(
use arrow::datatypes::DataType::*;

match (lhs_type, rhs_type) {
(Dictionary(key_type, value_type), _) => {
(Dictionary(_, value_type), _) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth a comment here explaining we are expanding out dictionaries to the underlying type and leave a comment pointing back at this PR / a summary of #6785 (comment)

I think that is critical context

let left_expr = try_cast(col("a", schema)?, schema, lhs)?;
let right_expr = try_cast(col("b", schema)?, schema, rhs)?;
let arithmetic_op = binary_simple(left_expr, op, right_expr, schema);
let op = binary_op(col("a", schema)?, op, col("b", schema)?, schema)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a much nicer formulation

@viirya
Copy link
Member

viirya commented Jun 29, 2023

Therefore since this PR still supports the calculation (via casting to primitive and calling the primitive kernels) I think it doesn't undo any of that work

Yea, it just casts to primitive before arithmetic operations so dictionary arrays are still going through them. From end-to-end perspective, it might be not noticeable if you don't look at the output array type.

@viirya
Copy link
Member

viirya commented Jun 29, 2023

The benchmarks on apache/arrow-rs#4407 show that for the wrapping addition used by DataFusion casting first is faster than the logic for operating on the dictionary ..., the change in this PR should result in a performance improvement for DF.

For performance, I'd talk more about end-to-end query benchmark instead of kernel benchmark only. Arithmetic operations are only small pieces in the query execution time (µs/ms v.s. minute/hour), I think. That is what I mentioned above on my concern as this is not simply kernel change but also possible impact to other operations in the query engine. We will also do benchmark on new DataFusion release when we are going to upgrade internal integration.

@tustvold
Copy link
Contributor Author

but also possible impact to other operations in the query engine

To clarify this PR does not alter the return type of the evaluation, the output will still always be primitive. I therefore don't foresee this having downstream implications.

This PR only changes the way that DF evaluates dictionary op primitive. Now instead of coercing the arguments to dictionaries it coerces them to primitives. This is now consistent with the existing behaviour for all other combinations of dictionaries and primitives, dictionary op dictionary, primitive op dictionary and primitive op primitive, which coerce to primitives on main.

@tustvold tustvold merged commit 2f78536 into apache:main Jun 30, 2023
2010YOUY01 pushed a commit to 2010YOUY01/arrow-datafusion that referenced this pull request Jul 5, 2023
* Coerce dictionaries for arithmetic

* Clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants