-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: consistent PartialEq for Scalar #677
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Robert Pack <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #677 +/- ##
==========================================
+ Coverage 84.11% 84.14% +0.02%
==========================================
Files 77 77
Lines 17749 17779 +30
Branches 17749 17779 +30
==========================================
+ Hits 14930 14960 +30
Misses 2106 2106
Partials 713 713 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to stamp the PartialEq
change but there's a bunch of decimal stuff here as well that still needs discussion...
Several matters of concern:
- We now have at least three ways to represent a decimal value: kernel's
Scalar::Decimal
, whatever arrow does for its decimals, and now thisBigDecimal
.- I'm not sure whether delta-rs uses yet a fourth representation for its decimal scalars?
- The
bigdecimal
crate becomes an unconditional dependency, is that desirable?- If we do decide we want it, should we adopt it everywhere by making
Scalar::Decimal
wrap aBigDecimal
and delete our home-brewparse_decimal
method?
- If we do decide we want it, should we adopt it everywhere by making
- Spark and arrow both use actual 128-bit values under the hood. I suspect most native engines will also use 128-bit values, but didn't double check e.g. duckdb yet. Meanwhile,
BigDecimal
allocates memory (viaBigInt
) in order to support arbitrary precision far beyond what we actually need.
Not including the Decimal stuff here would have meant disabling a bunch of tests, so I thought better to get it over with? Hopefully the in list stuff can now move forward? |
@scovich - just checked how datafusion is handling this, and they more or less handwave the issue. I.e. when precision and scale don't match, they return Assuming that we are most interested in using these comparisons during file / partition skipping, in which cases the values we get should match in p and s, maybe that's a way forward for us as well? At least it would be a better than what we have now, w/o needing to go too deep into the inners of decimals, which starts to feel more and more like engine territory? |
Agree this feels too much like engine territory. I like datafusion's approach -- simple and gets the job done 95% of the time with no new dependencies. In fact, we could argue that decimals of different scale/precision are different types, and that the correct way to reconcile them is by casting. |
Signed-off-by: Robert Pack <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to hand waving. @scovich do you think it will ever become an issue for collation or widening where we have to let the engine control the equality of things?
{ file = "../README.md", search = "delta_kernel = \"[a-z0-9\\.-]+\"", replace = "delta_kernel = \"{{version}}\"" }, | ||
{ file = "../README.md", search = "version = \"[a-z0-9\\.-]+\"", replace = "version = \"{{version}}\"" }, | ||
] | ||
pre-release-hook = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good, except there's no new unit test coverage for decimal partial compares? Since it was previously unsupported I wouldn't expect any existing unit test to give meaningful coverage?
do you think it will ever become an issue for collation or widening where we have to let the engine control the equality of things?
Type widening only works for lossless casts, which means the target Decimal type can always fully represent all possible values of the source Decimal type. So the comparisons should Just Work as long as the appropriate casts are introduced.
AFAIK collations only affect string values?
@@ -117,7 +117,7 @@ fn test_default_partial_cmp_scalars() { | |||
} | |||
|
|||
let expect_if_comparable_type = |s: &_, expect| match s { | |||
Null(_) | Decimal(..) | Struct(_) | Array(_) => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich - this disabled some tests that are using partial_cmp for decimals, that are now enabled.
What changes are proposed in this pull request?
We currently have a custom implementation for
Scalar::partial_cmp
whith correct NULL semantics. The current derivedPartialEq
implementation is inconsistent with that impl. This PR then introduces a custom implementation forPartialEq
which uses thePartialOrd
implementation.We also extend the implementation of
PartialOrd
to cover decimals via thebigdecimals
crate, which handles arbitrary precision decimals.How was this change tested?
added tests for null handling in
PartialEq
/PartialOrd
. Decimal comparisons are well covered in predicate tests, where they were previously disabled.