-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Data corruption writing ORC data with lots of nulls before timestamp #13460
Comments
It looks like the cudf ORC writer data corruption is triggered if there are >=10000 nulls at the start of the series. Here is an example python repro that fails on host read:
Also, even though cudf can read this file, the data has changed:
So perhaps the solution to #13460 would detect the RLE encoding failure and crash instead of returning zeros in this case. I'm working a recent 23.06 commit ( |
Also the crash on the CPU only appears to happen if we have about 8 or more rows of data after the nulls. If there are less we get data corruption, but not a crash. |
I saw the 10,000 limit to. I think that this might be related cudf/cpp/include/cudf/io/orc.hpp Line 42 in cc317ed
|
Yup that is it. I changed it to 5000 and now the corruption shows up at >= 5000 nulls. |
Nope I was wrong. We still get data corruption on longs. It just does not always throw the exception. |
I also see the corruption with ints. So it looks like it is a generic issue that is not specific to timestamps. |
Here's my C++ repro, should be equivalent to @GregoryKimball 's Python code above
|
Issue #13460 Fixes the bug in `gpuCompactOrcDataStreams` where stream pointer would not get updated for empty row groups. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice)
@revans2 can this issue be closed now? I didn't set the PR to close the issue because I only tested a very derived repro. |
yes |
Describe the bug
I ma still working on a repro case for this in put CUDF. But I wanted to get this up ASAP as I work on it.
We have a customer that got data corruption trying to write out a file in ORC that had lots of nulls before it hit a non-null timestamp value.
I am still working on a pure CUDF C++ repro case, but for now what I have is.
I also wrote the same data out to a parquet file and if I transcode it to ORC I get the same error.
data.zip
The error that the CPU outputs when reading in the corrupt file is.
This is very similar to the error that I get when I try to read the file using the ORC java command line tools.
I tested this with 23.04.1 and I didn't see this problem at all, so I think this is something that was introduced recently with CUDF.
The text was updated successfully, but these errors were encountered: