Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type coercion for UDFs in logical plan #3254

Merged
merged 5 commits into from
Sep 7, 2022

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Aug 24, 2022

Which issue does this PR close?

Related to #2355
Follows on from #3222 and #3250

Rationale for this change

We currently perform type coercion for UDFs in the physical planner/optimizer. I would like to have this logic in the logical plan so that other projects (such as Dask SQL) can benefit from this.

What changes are included in this PR?

  • Update the TypeCoercion rule to support Scalar UDFs

Are there any user-facing changes?

Yes, logical plans will be different in some cases.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions labels Aug 24, 2022
@codecov-commenter
Copy link

codecov-commenter commented Aug 24, 2022

Codecov Report

Merging #3254 (55c8d13) into master (c89b10f) will increase coverage by 0.01%.
The diff coverage is 97.72%.

@@            Coverage Diff             @@
##           master    #3254      +/-   ##
==========================================
+ Coverage   85.58%   85.59%   +0.01%     
==========================================
  Files         296      296              
  Lines       54179    54231      +52     
==========================================
+ Hits        46367    46418      +51     
- Misses       7812     7813       +1     
Impacted Files Coverage Δ
datafusion/optimizer/src/type_coercion.rs 98.96% <97.72%> (-1.04%) ⬇️
datafusion/common/src/scalar.rs 85.06% <0.00%> (-0.10%) ⬇️
datafusion/core/src/catalog/information_schema.rs 94.37% <0.00%> (-0.05%) ⬇️
datafusion/core/tests/sql/mod.rs 97.77% <0.00%> (ø)
datafusion/expr/src/logical_plan/plan.rs 77.28% <0.00%> (ø)
datafusion/core/src/physical_plan/explain.rs 66.66% <0.00%> (ø)
datafusion/physical-expr/src/aggregate/sum.rs 69.79% <0.00%> (ø)
datafusion/physical-expr/src/array_expressions.rs 50.00% <0.00%> (ø)
datafusion/physical-expr/src/expressions/binary.rs 97.71% <0.00%> (ø)
...tafusion/physical-expr/src/expressions/try_cast.rs 98.79% <0.00%> (ø)
... and 4 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@andygrove andygrove changed the title Early draft: Add type coercion for UDFs in logical plan Add type coercion for UDFs in logical plan Sep 6, 2022
@github-actions github-actions bot removed physical-expr Physical Expressions logical-expr Logical plan and expressions core Core DataFusion crate labels Sep 6, 2022
@andygrove andygrove marked this pull request as ready for review September 6, 2022 14:09
@andygrove
Copy link
Member Author

@alamb Here is another type coercion rule that we can now add

@andygrove andygrove requested a review from liukun4515 September 6, 2022 14:10
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks great @andygrove

This will likely conflict with #3379

For what it is worth, I would love to move all type coercion rules out of physical planning and into this phase (aka consolidate all coercion)

/// `signature`, if possible.
///
/// See the module level documentation for more detail on coercion.
pub fn coerce_arguments_for_signature(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this code should already exist somewhere and it would be great to consolidate into a single implementation rather than have multiple implementations around

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is adapted from the coerce method in the physical-expr crate that operates on physical expressions rather than logical expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to public this method?
I think it's better to make it private.

let mut config = OptimizerConfig::default();
let plan = rule.optimize(&plan, &mut config)?;
assert_eq!(
"Projection: TestScalarUDF(CAST(Int32(123) AS Float32))\n EmptyRelation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@liukun4515
Copy link
Contributor

@andygrove @alamb Can we give a plan for the migration of type coercion?

@liukun4515
Copy link
Contributor

👍 looks great @andygrove

This will likely conflict with #3379

For what it is worth, I would love to move all type coercion rules out of physical planning and into this phase (aka consolidate all coercion)

we can do this after migrating the type coercion from the physical phase to logical phase.

@liukun4515
Copy link
Contributor

I will review it tomorrow, it's too later for me today

@alamb
Copy link
Contributor

alamb commented Sep 7, 2022

@andygrove @alamb Can we give a plan for the migration of type coercion?

@liukun4515 there is #2355 (which I will also add to the description of this PR). Is that what you had in mind?

@alamb alamb mentioned this pull request Sep 7, 2022
16 tasks
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good -- let's wait for @liukun4515 to review and then merge it in

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
only comments about public/private api

/// `signature`, if possible.
///
/// See the module level documentation for more detail on coercion.
pub fn coerce_arguments_for_signature(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to public this method?
I think it's better to make it private.

@liukun4515
Copy link
Contributor

liukun4515 commented Sep 7, 2022

After the refactor of the type coercion, do we need to forbid the creation of physical expr directly?

@alamb
Copy link
Contributor

alamb commented Sep 7, 2022

After the refactor of the type coercion, do we need to forbid to creation of physical expr directly?

I would personally prefer allow creating PhysicalExprs but not providing automatic coercion (leaving it up to the user to use the proper types)

@liukun4515
Copy link
Contributor

After the refactor of the type coercion, do we need to forbid to creation of physical expr directly?

I would personally prefer allow creating PhysicalExprs but not providing automatic coercion (leaving it up to the user to use the proper types)

A+B will get the common data type C in the logical phase, but it will also affected in the physical phase.
C+C will convert the D data type, but I can't find this changes after this #3222 merged.
I want to find out why

@liukun4515
Copy link
Contributor

After the refactor of the type coercion, do we need to forbid to creation of physical expr directly?

I would personally prefer allow creating PhysicalExprs but not providing automatic coercion (leaving it up to the user to use the proper types)

A+B will get the common data type C in the logical phase, but it will also affected in the physical phase. C+C will convert the D data type, but I can't find this changes after this #3222 merged. I want to find out why

I think I find a bug https://github.com/apache/arrow-datafusion/blob/c359018baa8bbb0a227e83df948c903cde4d701f/datafusion/expr/src/binary_rule.rs#L293 for the type coercion in arithmetic op.

@andygrove @alamb

If we move the type coercion to the logical phase, the type coercion will apply the binary op twice.
For the arithmetic operation, the decimal128(10,0) + decimal128(10,0) will get decimal128(11,0).
If the rule was applied to the op twice, the result type should be decimal(12,0);

@andygrove
Copy link
Member Author

After the refactor of the type coercion, do we need to forbid to creation of physical expr directly?

I would personally prefer allow creating PhysicalExprs but not providing automatic coercion (leaving it up to the user to use the proper types)

A+B will get the common data type C in the logical phase, but it will also affected in the physical phase. C+C will convert the D data type, but I can't find this changes after this #3222 merged. I want to find out why

I think I find a bug

https://github.com/apache/arrow-datafusion/blob/c359018baa8bbb0a227e83df948c903cde4d701f/datafusion/expr/src/binary_rule.rs#L293
for the type coercion in arithmetic op.

@andygrove @alamb

If we move the type coercion to the logical phase, the type coercion will apply the binary op twice. For the arithmetic operation, the decimal128(10,0) + decimal128(10,0) will get decimal128(11,0). If the rule was applied to the op twice, the result type should be decimal(12,0);

Thanks @liukun4515. I have filed #3388 to track this. I will look at this next.

@andygrove andygrove merged commit 7c04964 into apache:master Sep 7, 2022
@andygrove andygrove deleted the type-coercion-udf branch September 7, 2022 16:01
@ursabot
Copy link

ursabot commented Sep 7, 2022

Benchmark runs are scheduled for baseline = e6d1364 and contender = 7c04964. 7c04964 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants