fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376

parthchandra · 2025-02-07T00:01:22Z

Which issue does this PR close?

Partly addresses test failures caused by #1348

Rationale for this change

As the issue points out, datafusion comet returns different values from Spark for uint_8 and uint_16 parquet types that may have the sign bit set.

What changes are included in this PR?

Rewrites the parquet test files to not use the uint_8 and uint16 types if the complex type readers are enabled.

How are these changes tested?

Locally using existing unit tests. Note that the unit tests still fail, but not because of unsigned ints

…are enabled.

codecov-commenter · 2025-02-07T02:29:18Z

Codecov Report

Attention: Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 39.07%. Comparing base (f09f8af) to head (501c52a).
Report is 22 commits behind head on main.

Files with missing lines	Patch %	Lines
.../main/scala/org/apache/comet/DataTypeSupport.scala	25.00%	2 Missing and 1 partial ⚠️
...org/apache/comet/CometSparkSessionExtensions.scala	0.00%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #1376       +/-   ##
=============================================
- Coverage     56.12%   39.07%   -17.06%     
- Complexity      976     2077     +1101     
=============================================
  Files           119      263      +144     
  Lines         11743    60765    +49022     
  Branches       2251    12919    +10668     
=============================================
+ Hits           6591    23742    +17151     
- Misses         4012    32534    +28522     
- Partials       1140     4489     +3349

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andygrove · 2025-02-07T02:46:31Z

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

+            makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, valueRanges + 1)
+            withParquetTable(path.toString, "tbl") {
+              if (CometSparkSessionExtensions.isComplexTypeReaderEnabled(conf)) {
+                checkSparkAnswer("select _9, _10 FROM tbl order by _11")


Do we already have logic to fall back to Spark when the complex type reader is enabled and when the query references uint Parquet fields?

No we don't for two reasons. Firstly, in the plan we get the schema as understood by Spark so all the signed int_8 and int_16 values are indistinguishable from the unsigned ones. As a result we fall back to Spark for both signed and unsigned integers. Secondly, too many unit tests fail because we check that the plan contains a comet operator and would need to be modified.
I'm open to putting it back though.

As a result we fall back to Spark for both signed and unsigned integers.

Just 8 and 16 bit, or all integers? I'm fine with falling back for 8 and 16 bit for now, although it would be nice to have a config to override this (with the understanding that behavior is incorrect for unsigned integers).

Just 8 and 16 bit.
I started with the fallback to spark and a compat override. The reason I reverted it is that I couldn't see a way to get to compatibility with spark even after/if apache/arrow-rs#7040 is addressed.
Let me do as you suggest. Marking this as draft in the meantime.

…lex type readers are enabled.

parthchandra · 2025-02-10T17:08:45Z

@andygrove updates this to fallback, updated the unit tests and removed the draft tag

parthchandra added 2 commits February 6, 2025 15:53

fix: disable checking for uint_8 and uint_16 if complex type readers …

940aed5

…are enabled.

style fix

db7b11e

andygrove reviewed Feb 7, 2025

View reviewed changes

parthchandra marked this pull request as draft February 7, 2025 18:39

parthchandra added 4 commits February 7, 2025 15:55

Fallback to Spark if uint_8, uint_16 types are in the schema and comp…

0aa27d6

…lex type readers are enabled.

fix tests

5dfdc19

fix more tests

9389be8

style fix

501c52a

parthchandra marked this pull request as ready for review February 10, 2025 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376

fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376

parthchandra commented Feb 7, 2025

codecov-commenter commented Feb 7, 2025 •

edited

Loading

andygrove Feb 7, 2025 •

edited

Loading

parthchandra Feb 7, 2025

andygrove Feb 7, 2025

parthchandra Feb 7, 2025

parthchandra commented Feb 10, 2025

fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376

Are you sure you want to change the base?

fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376

Conversation

parthchandra commented Feb 7, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

codecov-commenter commented Feb 7, 2025 • edited Loading

Codecov Report

andygrove Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

parthchandra Feb 7, 2025

Choose a reason for hiding this comment

andygrove Feb 7, 2025

Choose a reason for hiding this comment

parthchandra Feb 7, 2025

Choose a reason for hiding this comment

parthchandra commented Feb 10, 2025

codecov-commenter commented Feb 7, 2025 •

edited

Loading

andygrove Feb 7, 2025 •

edited

Loading