Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: disable checking for uint_8 and uint_16 if complex type readers are enabled #1376

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

parthchandra
Copy link
Contributor

Which issue does this PR close?

Partly addresses test failures caused by #1348

Rationale for this change

As the issue points out, datafusion comet returns different values from Spark for uint_8 and uint_16 parquet types that may have the sign bit set.

What changes are included in this PR?

Rewrites the parquet test files to not use the uint_8 and uint16 types if the complex type readers are enabled.

How are these changes tested?

Locally using existing unit tests. Note that the unit tests still fail, but not because of unsigned ints

@codecov-commenter
Copy link

codecov-commenter commented Feb 7, 2025

Codecov Report

Attention: Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 39.07%. Comparing base (f09f8af) to head (501c52a).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
.../main/scala/org/apache/comet/DataTypeSupport.scala 25.00% 2 Missing and 1 partial ⚠️
...org/apache/comet/CometSparkSessionExtensions.scala 0.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #1376       +/-   ##
=============================================
- Coverage     56.12%   39.07%   -17.06%     
- Complexity      976     2077     +1101     
=============================================
  Files           119      263      +144     
  Lines         11743    60765    +49022     
  Branches       2251    12919    +10668     
=============================================
+ Hits           6591    23742    +17151     
- Misses         4012    32534    +28522     
- Partials       1140     4489     +3349     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, valueRanges + 1)
withParquetTable(path.toString, "tbl") {
if (CometSparkSessionExtensions.isComplexTypeReaderEnabled(conf)) {
checkSparkAnswer("select _9, _10 FROM tbl order by _11")
Copy link
Member

@andygrove andygrove Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have logic to fall back to Spark when the complex type reader is enabled and when the query references uint Parquet fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we don't for two reasons. Firstly, in the plan we get the schema as understood by Spark so all the signed int_8 and int_16 values are indistinguishable from the unsigned ones. As a result we fall back to Spark for both signed and unsigned integers. Secondly, too many unit tests fail because we check that the plan contains a comet operator and would need to be modified.
I'm open to putting it back though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result we fall back to Spark for both signed and unsigned integers.

Just 8 and 16 bit, or all integers? I'm fine with falling back for 8 and 16 bit for now, although it would be nice to have a config to override this (with the understanding that behavior is incorrect for unsigned integers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 8 and 16 bit.
I started with the fallback to spark and a compat override. The reason I reverted it is that I couldn't see a way to get to compatibility with spark even after/if apache/arrow-rs#7040 is addressed.
Let me do as you suggest. Marking this as draft in the meantime.

@parthchandra parthchandra marked this pull request as draft February 7, 2025 18:39
@parthchandra parthchandra marked this pull request as ready for review February 10, 2025 17:07
@parthchandra
Copy link
Contributor Author

@andygrove updates this to fallback, updated the unit tests and removed the draft tag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants