ORC reader should not fall back to using the ORC file's write schema when reading an ORC file with an empty read schema #8838

razajafri · 2023-07-27T19:57:36Z

mythrocks · 2023-08-01T21:35:16Z

I'm reconsidering whether this test needs to work on GPU.

The way I would have tested this is to read an ORC file after (re)setting the INCLUDE_COLUMNS config to "" (empty), and expecting only nulls in the output.

The problem with this is that INCLUDE_COLUMNS is an internal ORC config, meant to be interpreted by the ORC record reader. It is internal in that its value cannot be set from user-land.

I don't think a Spark user can realistically exercise this config without writing their own reader or input format. At that point, GPU compatibility isn't a realistic concern.

I should have realized this sooner. Let's close this as WONT_FIX.

razajafri · 2023-08-01T22:14:24Z

No worries, thank you for looking into this further. I verified that there isn't a way for Spark to honor this from the user-land.

scala> val df = spark.read.option(OrcConf.INCLUDE_COLUMNS.getAttribute,"").option("hive.io.file.read.all.columns", false).orc("/home/rjafri/dev/test-data/no-read-schema-test.orc/part-00000-d398a78f-fddb-45d7-bd8c-8e75de08e79c-c000.snappy.orc")
df: org.apache.spark.sql.DataFrame = [_1: int, _2: int]

scala> df.show
23/08/01 20:53:36 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU

+---+---+
| _1| _2|
+---+---+
|  1|  1|
+---+---+

razajafri mentioned this issue Jul 27, 2023

[TEST] Compatibility tests for data formats #8666

Open

47 tasks

razajafri self-assigned this Jul 27, 2023

razajafri closed this as completed Aug 1, 2023

sameerz added the invalid This doesn't seem right label Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC reader should not fall back to using the ORC file's write schema when reading an ORC file with an empty read schema #8838

ORC reader should not fall back to using the ORC file's write schema when reading an ORC file with an empty read schema #8838

razajafri commented Jul 27, 2023 •

edited

Loading

mythrocks commented Aug 1, 2023 •

edited

Loading

razajafri commented Aug 1, 2023

ORC reader should not fall back to using the ORC file's write schema when reading an ORC file with an empty read schema #8838

ORC reader should not fall back to using the ORC file's write schema when reading an ORC file with an empty read schema #8838

Comments

razajafri commented Jul 27, 2023 • edited Loading

mythrocks commented Aug 1, 2023 • edited Loading

razajafri commented Aug 1, 2023

razajafri commented Jul 27, 2023 •

edited

Loading

mythrocks commented Aug 1, 2023 •

edited

Loading