Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/union wrong casting #5342

Merged
merged 5 commits into from
Feb 28, 2023
Merged

Bug/union wrong casting #5342

merged 5 commits into from
Feb 28, 2023

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented Feb 20, 2023

Which issue does this PR close?

Closes #5212.

Rationale for this change

When we try to use "UNION" on columns having differently signed or sized integer types, we get a cast error for some cases (explained in the issue), which can be handled accurately.

What changes are included in this PR?

Previous casting matches are modified such that the resulting type of the column will be the smallest unit that will not cause to any error.

  Int64 Int32 Int16 Int8 UInt64 UInt32 UInt16 UInt8
Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
Int32 Int64 Int32 Int32 Int32 Int64 Int64 Int32 Int32
Int16 Int64 Int32 Int16 Int16 Int64 Int64 Int32 Int16
Int8 Int64 Int32 Int16 Int8 Int64 Int64 Int32 Int16
UInt64 Int64 Int64 Int64 Int64 UInt64 UInt64 UInt64 UInt64
UInt32 Int64 Int64 Int64 Int64 UInt64 UInt32 UInt32 UInt32
UInt16 Int64 Int32 Int32 Int32 UInt64 UInt32 UInt16 UInt16
UInt8 Int64 Int32 Int16 Int16 UInt64 UInt32 UInt16 UInt8

Are these changes tested?

Yes, test_union_upcast_types test function is added showing the error is solved in the issue.

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions labels Feb 20, 2023
@iajoiner
Copy link
Contributor

@berkaysynnada Really thanks for your work! I wonder what the type of the union of an Int64 and an UInt64 should be though.

@ozankabak
Copy link
Contributor

ozankabak commented Feb 20, 2023

Once you hit the widest fixed-size type, information loss becomes inevitable unless you have bigint or arbitrary-precision types at your disposal (for integral and floating point types, respectively). AFAIK Datafusion does not have such types at this time. If this is indeed true, we need to choose between losing very high numbers (by casting to Int64), or lose negative numbers (by casting to UInt64).

@jackwener
Copy link
Member

cc @liukun4515

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada -- sorry for the delay in review. This is looking good

(Int32, _) | (_, Int32) => Some(Int32),
(Int16, _) | (_, Int16) => Some(Int16),
(Int8, _) | (_, Int8) => Some(Int8),
// start checking from Int64, that is the most inclusive integer type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we express the same logic more concisely with something like:

      (Int64, _) | (_, Int64) => Some(Int64),
        (Int32, _) | (_, Int32) => Some(Int64),
        (Int16, _) | (_, Int16) => Some(Int32),
        (Int8, _) | (_, Int8) => Some(Int16),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise I think the same pattern could be used for UInt* variants

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that map (Int32, Int16) to Int64? I think @berkaysynnada's intent is to map to the narrowest correct type, which in this case would be Int32.

I am not sure if the match pattern is the most succinct representation of the table in the PR text though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that map (Int32, Int16) to Int64?

Yes.

I guess I was thinking of the more general case for arithmetic where i32::MAX + i16::MAX requires a i64 to store

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by the sum. Aren't we just "merging" the schemas? In that case, the narrowest type that can represent both sides should be chosen, no? Am I missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am just confused

For example this PR adds this rule:

(Int8, UInt16) => Some(Int32),

Which isn't just merging the schema (e.g. UInt16) in my mind, but is extending it in some way. In this case, so that the entire range of Int8 and UInt16 can be represented -- which now makes sense

I think what would help me would be:

  1. Add some comments explaining the rationale for these rules
  2. (maybe) add test coverage (ideally looking like your beautiful table from the PR description) that helps to illustrate the point

However, I don't think this is needed prior to merging this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added comments to explain the reasoning of the match patterns. Also took note to add some tests in a follow-on PR. Thanks for the review!

datafusion/expr/src/type_coercion/binary.rs Show resolved Hide resolved
(Int32, _) | (_, Int32) => Some(Int32),
(Int16, _) | (_, Int16) => Some(Int16),
(Int8, _) | (_, Int8) => Some(Int8),
// start checking from Int64, that is the most inclusive integer type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am just confused

For example this PR adds this rule:

(Int8, UInt16) => Some(Int32),

Which isn't just merging the schema (e.g. UInt16) in my mind, but is extending it in some way. In this case, so that the entire range of Int8 and UInt16 can be represented -- which now makes sense

I think what would help me would be:

  1. Add some comments explaining the rationale for these rules
  2. (maybe) add test coverage (ideally looking like your beautiful table from the PR description) that helps to illustrate the point

However, I don't think this is needed prior to merging this PR

@alamb
Copy link
Contributor

alamb commented Feb 28, 2023

Thank you!

@alamb alamb merged commit d076ab3 into apache:main Feb 28, 2023
@ursabot
Copy link

ursabot commented Feb 28, 2023

Benchmark runs are scheduled for baseline = 0000d4f and contender = d076ab3. d076ab3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@berkaysynnada berkaysynnada deleted the bug/union_wrong_casting branch March 12, 2023 14:58
@andygrove andygrove added the bug Something isn't working label Mar 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upcast types during union schema creation.
7 participants