Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC reader should not fall back to using the ORC file's write schema when reading an ORC file with an empty read schema #8838

Closed
Tracked by #8666
razajafri opened this issue Jul 27, 2023 · 2 comments
Assignees
Labels
invalid This doesn't seem right

Comments

@razajafri
Copy link
Collaborator

razajafri commented Jul 27, 2023

Refer to OrcQuerySuite.scala#L503

@razajafri razajafri changed the title reading ORC file with an empty read schema: Reader should not fall back to using the ORC file's write schema. Refer to OrcQuerySuite.scala#L503. ORC reader should not fall back to using the ORC file's write schema when reading an ORC file with an empty read schema Jul 27, 2023
@razajafri razajafri self-assigned this Jul 27, 2023
@mythrocks
Copy link
Collaborator

mythrocks commented Aug 1, 2023

I'm reconsidering whether this test needs to work on GPU.

The way I would have tested this is to read an ORC file after (re)setting the INCLUDE_COLUMNS config to "" (empty), and expecting only nulls in the output.

The problem with this is that INCLUDE_COLUMNS is an internal ORC config, meant to be interpreted by the ORC record reader. It is internal in that its value cannot be set from user-land.

I don't think a Spark user can realistically exercise this config without writing their own reader or input format. At that point, GPU compatibility isn't a realistic concern.

I should have realized this sooner. Let's close this as WONT_FIX.

@razajafri
Copy link
Collaborator Author

No worries, thank you for looking into this further. I verified that there isn't a way for Spark to honor this from the user-land.

scala> val df = spark.read.option(OrcConf.INCLUDE_COLUMNS.getAttribute,"").option("hive.io.file.read.all.columns", false).orc("/home/rjafri/dev/test-data/no-read-schema-test.orc/part-00000-d398a78f-fddb-45d7-bd8c-8e75de08e79c-c000.snappy.orc")
df: org.apache.spark.sql.DataFrame = [_1: int, _2: int]

scala> df.show
23/08/01 20:53:36 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU

+---+---+
| _1| _2|
+---+---+
|  1|  1|
+---+---+

@sameerz sameerz added the invalid This doesn't seem right label Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants