Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CUDF crashes trying to read a valid ORC file with a single null entry in it #12155

Closed
revans2 opened this issue Nov 15, 2022 · 3 comments
Closed
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Nov 15, 2022

Describe the bug
The file that causes this error is attached below

Steps/Code to reproduce bug
Try to read this particular file using CUDF. It will throw an exception like

cuDF failure at: .../cpp/src/io/orc/reader_impl.cu:1081: Expected streams data within stripe

I am not an expert in ORC, but I tried to debug this a little bit at the point in which the error is being thrown we are trying to read two STRING columns (I assume the key and value for the map).

Expected behavior
I should be able to read the file and get the null out, like I can in Spark.

scala> spark.read.orc("503819761.orc").show()
+---------+
|query_map|
+---------+
|     null|
+---------+

or pandas

>>> pd.read_orc("503819761.orc")
  query_map
0      None
@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Nov 15, 2022
@revans2
Copy link
Contributor Author

revans2 commented Nov 15, 2022

error.zip has the offending ORC file in it. Jut FYI I also found another error when trying to write a separate repro case that I will file a separate issue for. It might be related.

@revans2
Copy link
Contributor Author

revans2 commented Nov 16, 2022

combined_error.zip Adding a second file that is like the first file, but also has a few more rows. This is closer to what our customer is seeing in production.

@vuule vuule self-assigned this Nov 18, 2022
@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress and removed Needs Triage Need team to review and classify labels Nov 19, 2022
@vuule
Copy link
Contributor

vuule commented Dec 4, 2022

Closed via #12160

@vuule vuule closed this as completed Dec 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants