-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cuDF ORC reader incorrectly reads file written by pyarrow #11890
Comments
Seems to be related to row group index; reading nonsensical offsets for the second row group (larger than the streams):
Even if we switch the offsets around 272 does not make any sense. Trying to find whether the stream lengths are wrong or the offsets. |
…orc` (#12325) Issue #11890 Motivating issue: The ORC reader reads nulls in row groups after the first one when reading a string column encoded with Pandas, with direct encoding. The root cause is that cuDF reads offsets from the row group index as larger then the stream sizes. This PR does not fix the issue, but ensures that the reader fails loudly when the row group index offsets are read as too large to be correct. This should prevent data corruption until the fix is implemented. This PR also sets up a mechanism to report decode errors from unsupported data. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Mike Wilson (https://github.com/hyperbolic2346) - Nghia Truong (https://github.com/ttnghia) URL: #12325
Turns out that the row index streams are read in incorrect order. The correct values are With a hack to read the streams in inverse order, the file is read correctly. |
Now, as to why the row index streams are read in wrong order: |
So, why do we read these in inconsistent order? Either libcudf mixes up the order somehow, or the file stores this info in the incorrect order. Reading the same data written by cuDF with dictionary encode disabled (RI stands for row index): Reading the repro file: There are tiny differences in the data encode, but in both files the data stream is huge, while the length stream is tiny. The 10000 value should apply to the DATA stream in both cases, but the stripe footer order does not match the index streams. |
@vuule I believe this is unrelated to Arrow and is either a bug in the libORC C++ library or a bug in libcudf. Here's a reproducer without arrow: import cudf
import pyorc
with open("test_pyorc.orc", "wb") as output:
with pyorc.Writer(output, "struct<col0:string>", struct_repr=pyorc.StructRepr.DICT) as writer:
for _ in range(10001):
writer.write({"col0": "*"})
print(cudf.read_orc("test_pyorc.orc")) |
Thanks @kkraus14 .I'll try to find what's going on in libORC C++. Meanwhile, I looked into why we haven't caught the bug in unit tests until recently. Turns out we dodged the issue by chance (?) in multiple tests. With some small changes, I can repro the issue in different tests - the bug is not specific to string data in this or the customer issue. |
Fixes #11890 Use fixed order when reading row index streams. The order is `PRESENT`, `DATA`, `SECONDARY`/`LENGTH` (maps to `DATA2`). Is any of these is absent in the column, relative order is maintained. Thus, the order is a sub-array of the one above. Also simplified some logic related to stream order, as we do not need to pass it from the host. Instead, we only pass a bitmap to denote which streams are present. Updated the xfail test, as it now passes :) Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Keith Kraus (https://github.com/kkraus14) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13242
Elements after the first row group are empty.
Only reproes with a string column.
Does not repro when the file is created by cuDF.
Minimal repro code:
The text was updated successfully, but these errors were encountered: