Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] No error detection in corrupted ORC files #13461

Open
revans2 opened this issue May 26, 2023 · 1 comment
Open

[BUG] No error detection in corrupted ORC files #13461

revans2 opened this issue May 26, 2023 · 1 comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented May 26, 2023

Describe the bug
This is related to #13460

After we write bad data, when we read it back in on the GPU we don't produce any errors. We don't crash. We just happily return corrupt data.

scala> spark.time(spark.read.parquet("./target/TMP_PAR").selectExpr("MIN(ts)", "MAX(ts)").show(false))
+--------------------------+--------------------------+
|min(ts)                   |max(ts)                   |
+--------------------------+--------------------------+
|2023-05-23 08:34:20.007655|2023-05-23 08:34:20.993544|
+--------------------------+--------------------------+

Time taken: 189 ms

scala> spark.time(spark.read.orc("./target/TMP_ORC").selectExpr("MIN(ts)", "MAX(ts)").show(false))
+-------------------+-------------------+
|min(ts)            |max(ts)            |
+-------------------+-------------------+
|2015-01-01 00:00:00|2015-01-01 00:00:08|
+-------------------+-------------------+

I realize that CUDF for performance reasons does not do much in the way of checks on the input data, but for input files we really should be doing something.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels May 26, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jun 26, 2023
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
@GregoryKimball GregoryKimball moved this to To be revisited in libcudf Feb 24, 2025
@GregoryKimball
Copy link
Contributor

We previously added support for "error code" tracking in the cuDF Parquet reader, starting with #14167 and follow-on work in #14237 and #14706. We should expand this pattern to manage decoding errors in the cuDF ORC reader as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: To be revisited
Development

No branches or pull requests

2 participants