Consistently coerce dictionaries for arithmetic #6785

tustvold · 2023-06-28T09:03:51Z

Which issue does this PR close?

Closes #.

Rationale for this change

The support for arithmetic on dictionaries is inconsistent, results in a huge amount of code gen, and doesn't yield significant performance savings. There is an upstream PR proposing removing this functionality - apache/arrow-rs#4407

In particular as far as I can tell the current logic has the following behaviour:

Dictionary + Dictionary => Non-Dictionary (or error if feature not enabled)
Dictionary + Scalar Dictionary => Dictionary
Dictionary + Non-Dictionary => Dictionary
Non-Dictionary + Dictionary => Non-Dictionary

By coercing the dictionaries we can avoid all this complexity, and provide a more consistent user experience

As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb · 2023-06-28T10:12:51Z

The support for arithmetic on dictionaries is inconsistent, results in a huge amount of code gen, and doesn't yield significant performance savings. There is an upstream PR proposing removing this functionality - apache/arrow-rs#4407

Do we have any numbers that illustrate this claim?

As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger

I think @viirya added this feature so perhaps he has more context

tustvold · 2023-06-28T10:13:20Z

Do we have any numbers that illustrate this claim?

They are on the linked ticket - apache/arrow-rs#4407

viirya · 2023-06-28T15:48:56Z

As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger
I think @viirya added this feature so perhaps he has more context

Hmm, I don't get the question. For the dictionary behavior, it follows original behavior. It basically follows kernel behavior too. Previously I cleaned up the decimal type part, but for dictionary behavior, I think it is unchanged.

I think for dictionary, it still has storage advantage in memory and disk (e.g. shuffle), otherwise it has no reason to have it. It might be true that dictionary is slower to process in computation kernel, but for overall cost I'm not sure it is always a win to get rid of dictionary. Except we have something that can automatically detect dictionary sparsity and convert to dictionary before outputting to storage.

tustvold · 2023-06-28T15:52:19Z

Hmm, I don't get the question.

I would like to remove the special case support for dictionaries as it already causes an inordinate amount of code complexity, and that is without it properly covering cases like temporal arithmetic. The question is therefore can we remove the explicit support in favor of coercion or is it really important to some use-case?

I think for dictionary, it still has storage advantage in memory and disk (e.g. shuffle),

It can only ever be faster where the dictionary key is smaller than the primitive, e.g. DictionaryArray<Int32Type, PrimitiveArray<Int32Type>> will always be larger than PrimitiveArray<Int32Type>. In practice as the selection kernels do not recompute dictionaries (for good reason) even for DictionaryArray<Int32Type, PrimitiveArray<Decimal256Type>> the dictionary representation will likely be larger and more expensive to handle, especially for comparison and arithmetic kernels that evaluate on dictionary values directly.

alamb · 2023-06-28T18:17:13Z

I think this change makes sense, but I defer to @viirya -- if he is ok with it in principle I will review the PR carefully

viirya · 2023-06-28T20:24:44Z

I think I agree with this can largely reduce code complexity. My little concern might be if this is a clear win? Or it is probably a win for some cases but not for all? Dictionary process might be slower for some cases, but for some I think it might be faster, e.g., dictionary + scalar which I remember is directly computing on dictionary values. Not to mention there is extra coercion (casting) operation.

Also I'm not sure about selection kernels, do we have them in DataFusion?

I guess I'm okay with this (+0?), but we might need to look at performance impact if any.

tustvold · 2023-06-29T11:31:14Z

but we might need to look at performance impact if any.

The benchmarks on apache/arrow-rs#4407 show that for the wrapping addition used by DataFusion casting first is faster than the logic for operating on the dictionary. It should be noted this is in part because of the inefficiency of the current way this is implemented in arrow-rs, with the checked kernel variant performing better, however, as DataFusion does not use this, the change in this PR should result in a performance improvement for DF.

However, consistency is imo the biggest justification for this change, aside from the issues with arithmetic on dictionaries not supporting the full set of types (e.g. temporal), the coercion behaviour is rather inconsistent. In particular if the left argument is a dictionary and the right argument is not both will be coerced to a dictionary and a dictionary kernel used, in all other cases main will coerce both to primitives. This makes me think it unlikely this code is actually being used in practice.

❯ create table foo as select (arrow_cast('1', 'Dictionary(Int32, Int32)')) as foo;
0 rows in set. Query took 0.003 seconds.
❯ select arrow_typeof(foo) from foo;
+--------------------------+
| arrow_typeof(foo.foo)    |
+--------------------------+
| Dictionary(Int32, Int32) |
+--------------------------+
1 row in set. Query took 0.002 seconds.
❯ explain select foo / 23 from foo;
+---------------+-------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                        |
+---------------+-------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: CAST(foo.foo AS Dictionary(Int32, Int64)) / Dictionary(Int32, Int64(23)) AS foo.foo / Int64(23) |
|               |   TableScan: foo projection=[foo]                                                                           |
| physical_plan | ProjectionExec: expr=[CAST(foo@0 AS Dictionary(Int32, Int64)) / 23 as foo.foo / Int64(23)]                  |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                             |
|               |                                                                                                             |
+---------------+-------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

❯ explain select 23 / foo from foo;
+---------------+-------------------------------------------------------------------------+
| plan_type     | plan                                                                    |
+---------------+-------------------------------------------------------------------------+
| logical_plan  | Projection: Int64(23) / CAST(foo.foo AS Int64)                          |
|               |   TableScan: foo projection=[foo]                                       |
| physical_plan | ProjectionExec: expr=[23 / CAST(foo@0 AS Int64) as Int64(23) / foo.foo] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                         |
|               |                                                                         |
+---------------+-------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.

❯ explain select foo / foo from foo;
+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                          |
+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: CAST(foo.foo AS Int32) AS foo.foo AS foo.foo AS foo.foo / CAST(foo.foo AS Int32) AS foo.foo AS foo.foo AS foo.foo |
|               |   TableScan: foo projection=[foo]                                                                                             |
| physical_plan | ProjectionExec: expr=[CAST(foo@0 AS Int32) / CAST(foo@0 AS Int32) as foo.foo / foo.foo]                                       |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                                               |
|               |                                                                                                                               |
+---------------+-------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.006 seconds.

alamb

I also looked at the issue/pr that added these kernels, #5193 and #5194

My reading of those PRs is that the issue was being able to calculate on dictionary encoded arrays (e.g. to support it) rather than a usecase where doing the dictionary calculation natively was critical. Therefore since this PR still supports the calculation (via casting to primitive and calling the primitive kernels) I think it doesn't undo any of that work

As @viirya has already looked at this, I think we can merge this in. Thank you for your work @tustvold to clean up and keep DataFusion running cleanly and efficiently

alamb · 2023-06-29T16:46:48Z

datafusion/expr/src/type_coercion/binary.rs

@@ -226,13 +226,13 @@ fn math_decimal_coercion(
    use arrow::datatypes::DataType::*;

    match (lhs_type, rhs_type) {
-        (Dictionary(key_type, value_type), _) => {
+        (Dictionary(_, value_type), _) => {


maybe it is worth a comment here explaining we are expanding out dictionaries to the underlying type and leave a comment pointing back at this PR / a summary of #6785 (comment)

I think that is critical context

alamb · 2023-06-29T16:47:52Z

datafusion/physical-expr/src/expressions/binary.rs

-        let left_expr = try_cast(col("a", schema)?, schema, lhs)?;
-        let right_expr = try_cast(col("b", schema)?, schema, rhs)?;
-        let arithmetic_op = binary_simple(left_expr, op, right_expr, schema);
+        let op = binary_op(col("a", schema)?, op, col("b", schema)?, schema)?;


this is a much nicer formulation

viirya · 2023-06-29T19:11:35Z

Therefore since this PR still supports the calculation (via casting to primitive and calling the primitive kernels) I think it doesn't undo any of that work

Yea, it just casts to primitive before arithmetic operations so dictionary arrays are still going through them. From end-to-end perspective, it might be not noticeable if you don't look at the output array type.

viirya · 2023-06-29T19:14:53Z

The benchmarks on apache/arrow-rs#4407 show that for the wrapping addition used by DataFusion casting first is faster than the logic for operating on the dictionary ..., the change in this PR should result in a performance improvement for DF.

For performance, I'd talk more about end-to-end query benchmark instead of kernel benchmark only. Arithmetic operations are only small pieces in the query execution time (µs/ms v.s. minute/hour), I think. That is what I mentioned above on my concern as this is not simply kernel change but also possible impact to other operations in the query engine. We will also do benchmark on new DataFusion release when we are going to upgrade internal integration.

tustvold · 2023-06-30T10:01:37Z

but also possible impact to other operations in the query engine

To clarify this PR does not alter the return type of the evaluation, the output will still always be primitive. I therefore don't foresee this having downstream implications.

This PR only changes the way that DF evaluates dictionary op primitive. Now instead of coercing the arguments to dictionaries it coerces them to primitives. This is now consistent with the existing behaviour for all other combinations of dictionaries and primitives, dictionary op dictionary, primitive op dictionary and primitive op primitive, which coerce to primitives on main.

* Coerce dictionaries for arithmetic * Clippy

Coerce dictionaries for arithmetic

3e43960

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions labels Jun 28, 2023

tustvold requested a review from viirya June 28, 2023 09:11

tustvold mentioned this pull request Jun 28, 2023

Remove Binary Dictionary Arithmetic Support apache/arrow-rs#4407

Merged

Clippy

0cc90c2

Merge remote-tracking branch 'upstream/main' into dictionary-arithmetic

b3f164e

tustvold changed the title ~~Coerce dictionaries for arithmetic~~ Consistently coerce dictionaries for arithmetic Jun 29, 2023

alamb approved these changes Jun 29, 2023

View reviewed changes

viirya mentioned this pull request Jun 29, 2023

Remove multiply_fixed_point_dyn and use upstream kernel directly #6802

Closed

tustvold merged commit 2f78536 into apache:main Jun 30, 2023

2010YOUY01 pushed a commit to 2010YOUY01/arrow-datafusion that referenced this pull request Jul 5, 2023

Consistently coerce dictionaries for arithmetic (apache#6785)

e1233f4

* Coerce dictionaries for arithmetic * Clippy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistently coerce dictionaries for arithmetic #6785

Consistently coerce dictionaries for arithmetic #6785

tustvold commented Jun 28, 2023 •

edited

Loading

alamb commented Jun 28, 2023 •

edited

Loading

tustvold commented Jun 28, 2023

viirya commented Jun 28, 2023

tustvold commented Jun 28, 2023 •

edited

Loading

alamb commented Jun 28, 2023

viirya commented Jun 28, 2023

tustvold commented Jun 29, 2023 •

edited

Loading

alamb left a comment

alamb Jun 29, 2023

alamb Jun 29, 2023

viirya commented Jun 29, 2023

viirya commented Jun 29, 2023 •

edited

Loading

tustvold commented Jun 30, 2023

Consistently coerce dictionaries for arithmetic #6785

Consistently coerce dictionaries for arithmetic #6785

Conversation

tustvold commented Jun 28, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Jun 28, 2023 • edited Loading

tustvold commented Jun 28, 2023

viirya commented Jun 28, 2023

tustvold commented Jun 28, 2023 • edited Loading

alamb commented Jun 28, 2023

viirya commented Jun 28, 2023

tustvold commented Jun 29, 2023 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Jun 29, 2023

Choose a reason for hiding this comment

alamb Jun 29, 2023

Choose a reason for hiding this comment

viirya commented Jun 29, 2023

viirya commented Jun 29, 2023 • edited Loading

tustvold commented Jun 30, 2023

tustvold commented Jun 28, 2023 •

edited

Loading

alamb commented Jun 28, 2023 •

edited

Loading

tustvold commented Jun 28, 2023 •

edited

Loading

tustvold commented Jun 29, 2023 •

edited

Loading

viirya commented Jun 29, 2023 •

edited

Loading