Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC read can corrupt data when specified schema does not match file schema ordering #3060

Closed
wbo4958 opened this issue Jul 28, 2021 · 0 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@wbo4958
Copy link
Collaborator

wbo4958 commented Jul 28, 2021

@firestarman has found the same issue with #3007 for the special schema which can't be pruned. I can repro it with the below code.

      val df = Seq(Testing(1, "hello", 2021)).toDF
      df.printSchema()
      // root
      // |-- _col1: integer (nullable = false)
      // |-- _col2: string (nullable = true)
      // |-- _col3: long (nullable = false)
      df.show()
      // +-----+-----+-----+
      //|_col1|_col2|_col3|
      //+-----+-----+-----+
      //|    1|hello| 2021|
      //+-----+-----+-----+
      df.write.mode("overwrite").orc(resource1)

      val schema = StructType(
        Seq(
          StructField("_col2", StringType),
          StructField("_col3", LongType),
          StructField("_col1", IntegerType),
          ))
      val dfRead = spark.read.schema(schema).orc(resource1)
      dfRead.show()

The GPU output is

+-----+-----+-----+
|_col2|_col3|_col1|
+-----+-----+-----+
|     |    1|    5|
+-----+-----+-----+

while the CPU output is

+-----+-----+-----+
|_col2|_col3|_col1|
+-----+-----+-----+
|    1| null| 2021|
+-----+-----+-----+

Looks like there also is an issue for CPU reading ORC

@wbo4958 wbo4958 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 28, 2021
@wbo4958 wbo4958 changed the title [BUG] Data mess up for schema which can't be pruned [BUG] ORC read Data mess up for the schema which can't be pruned Jul 28, 2021
@wbo4958 wbo4958 self-assigned this Jul 28, 2021
@wbo4958 wbo4958 changed the title [BUG] ORC read Data mess up for the schema which can't be pruned [BUG] ORC read Data mess up for the disorder read schema which can't be pruned Jul 28, 2021
@jlowe jlowe added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jul 28, 2021
@jlowe jlowe changed the title [BUG] ORC read Data mess up for the disorder read schema which can't be pruned [BUG] ORC read can corrupt data when specified schema does not match file schema ordering Jul 28, 2021
@wbo4958 wbo4958 closed this as completed Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

2 participants