fix: consistent PartialEq for Scalar #677

roeap · 2025-02-04T12:13:15Z

What changes are proposed in this pull request?

We currently have a custom implementation for Scalar::partial_cmp whith correct NULL semantics. The current derived PartialEq implementation is inconsistent with that impl. This PR then introduces a custom implementation for PartialEq which uses the PartialOrd implementation.

We also extend the implementation of PartialOrd to cover decimals via the bigdecimals crate, which handles arbitrary precision decimals.

How was this change tested?

added tests for null handling in PartialEq / PartialOrd. Decimal comparisons are well covered in predicate tests, where they were previously disabled.

Signed-off-by: Robert Pack <[email protected]>

codecov · 2025-02-04T12:16:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.14%. Comparing base (6a82a57) to head (6379c73).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #677      +/-   ##
==========================================
+ Coverage   84.11%   84.14%   +0.02%     
==========================================
  Files          77       77              
  Lines       17749    17779      +30     
  Branches    17749    17779      +30     
==========================================
+ Hits        14930    14960      +30     
  Misses       2106     2106              
  Partials      713      713

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scovich

I was hoping to stamp the PartialEq change but there's a bunch of decimal stuff here as well that still needs discussion...

Several matters of concern:

We now have at least three ways to represent a decimal value: kernel's Scalar::Decimal, whatever arrow does for its decimals, and now this BigDecimal.
- I'm not sure whether delta-rs uses yet a fourth representation for its decimal scalars?
The bigdecimal crate becomes an unconditional dependency, is that desirable?
- If we do decide we want it, should we adopt it everywhere by making Scalar::Decimal wrap a BigDecimal and delete our home-brew parse_decimal method?
Spark and arrow both use actual 128-bit values under the hood. I suspect most native engines will also use 128-bit values, but didn't double check e.g. duckdb yet. Meanwhile, BigDecimal allocates memory (via BigInt) in order to support arbitrary precision far beyond what we actually need.

kernel/src/expressions/scalars.rs

roeap · 2025-02-04T14:47:51Z

I was hoping to stamp the PartialEq change but there's a bunch of decimal stuff here as well that still needs discussion...

Not including the Decimal stuff here would have meant disabling a bunch of tests, so I thought better to get it over with? Hopefully the in list stuff can now move forward?

roeap · 2025-02-04T16:32:27Z

@scovich - just checked how datafusion is handling this, and they more or less handwave the issue. I.e. when precision and scale don't match, they return None in partial_cmp.

Assuming that we are most interested in using these comparisons during file / partition skipping, in which cases the values we get should match in p and s, maybe that's a way forward for us as well?

At least it would be a better than what we have now, w/o needing to go too deep into the inners of decimals, which starts to feel more and more like engine territory?

scovich · 2025-02-04T16:42:38Z

@scovich - just checked how datafusion is handling this, and they more or less handwave the issue. I.e. when precision and scale don't match, they return None in partial_cmp.

Assuming that we are most interested in using these comparisons during file / partition skipping, in which cases the values we get should match in p and s, maybe that's a way forward for us as well?

At least it would be a better than what we have now, w/o needing to go too deep into the inners of decimals, which starts to feel more and more like engine territory?

Agree this feels too much like engine territory. I like datafusion's approach -- simple and gets the job done 95% of the time with no new dependencies. In fact, we could argue that decimals of different scale/precision are different types, and that the correct way to reconcile them is by casting.

Signed-off-by: Robert Pack <[email protected]>

hntd187

+1 to hand waving. @scovich do you think it will ever become an issue for collation or widening where we have to let the engine control the equality of things?

hntd187 · 2025-02-04T17:50:00Z

kernel/Cargo.toml

+  { file = "../README.md", search = "delta_kernel = \"[a-z0-9\\.-]+\"", replace = "delta_kernel = \"{{version}}\"" },
+  { file = "../README.md", search = "version = \"[a-z0-9\\.-]+\"", replace = "version = \"{{version}}\"" },
+]
+pre-release-hook = [


formatting :(

scovich

Change looks good, except there's no new unit test coverage for decimal partial compares? Since it was previously unsupported I wouldn't expect any existing unit test to give meaningful coverage?

do you think it will ever become an issue for collation or widening where we have to let the engine control the equality of things?

Type widening only works for lossless casts, which means the target Decimal type can always fully represent all possible values of the source Decimal type. So the comparisons should Just Work as long as the appropriate casts are introduced.

AFAIK collations only affect string values?

roeap · 2025-02-05T00:27:20Z

kernel/src/predicates/tests.rs

@@ -117,7 +117,7 @@ fn test_default_partial_cmp_scalars() {
    }

    let expect_if_comparable_type = |s: &_, expect| match s {
-        Null(_) | Decimal(..) | Struct(_) | Array(_) => None,


@scovich - this disabled some tests that are using partial_cmp for decimals, that are now enabled.

fix: consistent PartialEq for Scalar

0e011d8

Signed-off-by: Robert Pack <[email protected]>

github-actions bot assigned roeap Feb 4, 2025

roeap requested review from nicklan, hntd187, scovich, zachschuermann, OussamaSaoudi and sebastiantia February 4, 2025 12:14

scovich reviewed Feb 4, 2025

View reviewed changes

kernel/src/expressions/scalars.rs Outdated Show resolved Hide resolved

roeap mentioned this pull request Feb 4, 2025

feat: support 'col IN (a, b, c)' type expressions #652

Open

fix: simplify partial_cmp for decimals

6379c73

Signed-off-by: Robert Pack <[email protected]>

roeap requested a review from scovich February 4, 2025 17:13

hntd187 approved these changes Feb 4, 2025

View reviewed changes

scovich approved these changes Feb 5, 2025

View reviewed changes

roeap commented Feb 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: consistent PartialEq for Scalar #677

fix: consistent PartialEq for Scalar #677

roeap commented Feb 4, 2025

codecov bot commented Feb 4, 2025 •

edited

Loading

scovich left a comment

roeap commented Feb 4, 2025

roeap commented Feb 4, 2025

scovich commented Feb 4, 2025

hntd187 left a comment

hntd187 Feb 4, 2025

scovich left a comment

roeap Feb 5, 2025

fix: consistent PartialEq for Scalar #677

Are you sure you want to change the base?

fix: consistent PartialEq for Scalar #677

Conversation

roeap commented Feb 4, 2025

What changes are proposed in this pull request?

How was this change tested?

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

scovich left a comment

Choose a reason for hiding this comment

roeap commented Feb 4, 2025

roeap commented Feb 4, 2025

scovich commented Feb 4, 2025

hntd187 left a comment

Choose a reason for hiding this comment

hntd187 Feb 4, 2025

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

roeap Feb 5, 2025

Choose a reason for hiding this comment

codecov bot commented Feb 4, 2025 •

edited

Loading