-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] The ORC output data of a query is not readable #1550
Comments
@wjxiz1992 is it possible to produce the same output but in a Parquet file? It would be good to know what data is supposed to be in the corrupted ORC file and see if we can reproduce the bad ORC file when loading the equivalent Parquet file then writing it to an ORC file just with libcudf. |
Yes, I've put the parquet output (produced also by GPU) to spark-egx-02:/home/allxu/q0_out_gpu_parquet. |
By "just with libcudf" I meant isolating the issue by removing Spark from the equation. For example, just using the cudf APIs directly, e.g.: something like this from the Spark shell REPL:
and verify that the ORC file can be read by Spark CPU and looks correct relative to the Parquet file. I've checked this, and the Parquet file does not replicate the corrupted ORC file either when writing with Spark GPU nor when using the cudf APIs directly. So either the corruption problem is sensitive to the ordering of the data (the ORC and Parquet files are ordered quite differently) or it's some other issue (e.g.: a race condition). I noticed in the bad ORC file that one column in particular, a string column, is unreadable due to the corruption. The other columns are all readable by Spark CPU, however the data in another string column isn't completely correct. The first row has corrupted data relative to the Parquet file but many other rows are correct. So the nature of the corruption isn't completely isolated to just the one string column. Does this issue happen every time when the query is run in this cluster? |
Thanks for the explanation! |
cudf 0.18 has already shipped so we cannot fix this in the 0.4 release so I am moving this to 0.5 and filed #1722 to mitigate the issue in the 0.4 release. |
@revans2 This has been fixed after rapidsai/cudf#7565, I tested with latest 0.5 plugin jar and 0.19 cuDF jar. |
Re-enable the orc write since NVIDIA#1550 has been fixed Signed-off-by: Allen Xu <[email protected]>
Re-enable the orc write since #1550 has been fixed Signed-off-by: Allen Xu <[email protected]>
Re-enable the orc write since NVIDIA#1550 has been fixed Signed-off-by: Allen Xu <[email protected]>
Re-enable the orc write since NVIDIA#1550 has been fixed Signed-off-by: Allen Xu <[email protected]>
Describe the bug
When reading(use some Dataframe APIs to operate on the ORC data) the orc ouput produced by the plugin, there's an error:
Steps/Code to reproduce bug
the output is produced by an LHA query, but I think it's safe to only point where it is on our egx machines: spark-egx-02:/home/allxu/q0_out_gpu.
Expected behavior
No error should be seen.
Environment details (please complete the following information)
Additional context
It's a query from LHA, please reach me if I need to provide more information about it.
The text was updated successfully, but these errors were encountered: