Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: consistent PartialEq for Scalar #677

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Feb 4, 2025

What changes are proposed in this pull request?

We currently have a custom implementation for Scalar::partial_cmp whith correct NULL semantics. The current derived PartialEq implementation is inconsistent with that impl. This PR then introduces a custom implementation for PartialEq which uses the PartialOrd implementation.

We also extend the implementation of PartialOrd to cover decimals via the bigdecimals crate, which handles arbitrary precision decimals.

How was this change tested?

added tests for null handling in PartialEq / PartialOrd. Decimal comparisons are well covered in predicate tests, where they were previously disabled.

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.14%. Comparing base (6a82a57) to head (6379c73).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #677      +/-   ##
==========================================
+ Coverage   84.11%   84.14%   +0.02%     
==========================================
  Files          77       77              
  Lines       17749    17779      +30     
  Branches    17749    17779      +30     
==========================================
+ Hits        14930    14960      +30     
  Misses       2106     2106              
  Partials      713      713              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to stamp the PartialEq change but there's a bunch of decimal stuff here as well that still needs discussion...

Several matters of concern:

  • We now have at least three ways to represent a decimal value: kernel's Scalar::Decimal, whatever arrow does for its decimals, and now this BigDecimal.
    • I'm not sure whether delta-rs uses yet a fourth representation for its decimal scalars?
  • The bigdecimal crate becomes an unconditional dependency, is that desirable?
    • If we do decide we want it, should we adopt it everywhere by making Scalar::Decimal wrap a BigDecimal and delete our home-brew parse_decimal method?
  • Spark and arrow both use actual 128-bit values under the hood. I suspect most native engines will also use 128-bit values, but didn't double check e.g. duckdb yet. Meanwhile, BigDecimal allocates memory (via BigInt) in order to support arbitrary precision far beyond what we actually need.

kernel/src/expressions/scalars.rs Outdated Show resolved Hide resolved
@roeap
Copy link
Collaborator Author

roeap commented Feb 4, 2025

I was hoping to stamp the PartialEq change but there's a bunch of decimal stuff here as well that still needs discussion...

Not including the Decimal stuff here would have meant disabling a bunch of tests, so I thought better to get it over with? Hopefully the in list stuff can now move forward?

@roeap
Copy link
Collaborator Author

roeap commented Feb 4, 2025

@scovich - just checked how datafusion is handling this, and they more or less handwave the issue. I.e. when precision and scale don't match, they return None in partial_cmp.

Assuming that we are most interested in using these comparisons during file / partition skipping, in which cases the values we get should match in p and s, maybe that's a way forward for us as well?

At least it would be a better than what we have now, w/o needing to go too deep into the inners of decimals, which starts to feel more and more like engine territory?

@scovich
Copy link
Collaborator

scovich commented Feb 4, 2025

@scovich - just checked how datafusion is handling this, and they more or less handwave the issue. I.e. when precision and scale don't match, they return None in partial_cmp.

Assuming that we are most interested in using these comparisons during file / partition skipping, in which cases the values we get should match in p and s, maybe that's a way forward for us as well?

At least it would be a better than what we have now, w/o needing to go too deep into the inners of decimals, which starts to feel more and more like engine territory?

Agree this feels too much like engine territory. I like datafusion's approach -- simple and gets the job done 95% of the time with no new dependencies. In fact, we could argue that decimals of different scale/precision are different types, and that the correct way to reconcile them is by casting.

@roeap roeap requested a review from scovich February 4, 2025 17:13
Copy link
Collaborator

@hntd187 hntd187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to hand waving. @scovich do you think it will ever become an issue for collation or widening where we have to let the engine control the equality of things?

{ file = "../README.md", search = "delta_kernel = \"[a-z0-9\\.-]+\"", replace = "delta_kernel = \"{{version}}\"" },
{ file = "../README.md", search = "version = \"[a-z0-9\\.-]+\"", replace = "version = \"{{version}}\"" },
]
pre-release-hook = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting :(

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, except there's no new unit test coverage for decimal partial compares? Since it was previously unsupported I wouldn't expect any existing unit test to give meaningful coverage?

do you think it will ever become an issue for collation or widening where we have to let the engine control the equality of things?

Type widening only works for lossless casts, which means the target Decimal type can always fully represent all possible values of the source Decimal type. So the comparisons should Just Work as long as the appropriate casts are introduced.

AFAIK collations only affect string values?

@@ -117,7 +117,7 @@ fn test_default_partial_cmp_scalars() {
}

let expect_if_comparable_type = |s: &_, expect| match s {
Null(_) | Decimal(..) | Struct(_) | Array(_) => None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich - this disabled some tests that are using partial_cmp for decimals, that are now enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants