[BUG] cuDF ORC reader incorrectly reads file written by pyarrow #11890

vuule · 2022-10-10T21:17:04Z

Elements after the first row group are empty.

Only reproes with a string column.
Does not repro when the file is created by cuDF.

Minimal repro code:

expected = cudf.DataFrame({"str": ["*"] * 10001}, dtype="string")

buffer = BytesIO()
expected.to_pandas().to_orc(buffer)

got = cudf.read_orc(buffer)
assert_eq(expected, got)

The text was updated successfully, but these errors were encountered:

vuule · 2022-11-23T23:09:35Z

Seems to be related to row group index; reading nonsensical offsets for the second row group (larger than the streams):

strm_offset0, strm_offset1: 272 10000
strm_len, strm_len2: 10001 80

Even if we switch the offsets around 272 does not make any sense. Trying to find whether the stream lengths are wrong or the offsets.

…orc` (#12325) Issue #11890 Motivating issue: The ORC reader reads nulls in row groups after the first one when reading a string column encoded with Pandas, with direct encoding. The root cause is that cuDF reads offsets from the row group index as larger then the stream sizes. This PR does not fix the issue, but ensures that the reader fails loudly when the row group index offsets are read as too large to be correct. This should prevent data corruption until the fix is implemented. This PR also sets up a mechanism to report decode errors from unsupported data. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Mike Wilson (https://github.com/hyperbolic2346) - Nghia Truong (https://github.com/ttnghia) URL: #12325

vuule · 2023-03-29T00:42:33Z

Turns out that the row index streams are read in incorrect order.
The values that we read from the pyarrow file are
byte offset, RLE run
272, 0
10000, 72

The correct values are
byte offset, RLE run
10000, 0
72, 272

With a hack to read the streams in inverse order, the file is read correctly.

vuule · 2023-03-29T00:48:22Z

Now, as to why the row index streams are read in wrong order:
libcudf ORC reader reader these streams in the same order as the order of corresponding data streams in the stripe footer.
When reading the pyarrow file these data streams' descriptors are read from the stripe footer as [LENGTH, DATA], while we read the row index streams in the [DATA, LENGTH] order. This difference in order causes the difference shown in the comment above.

vuule · 2023-03-29T00:59:49Z

So, why do we read these in inconsistent order? Either libcudf mixes up the order somehow, or the file stores this info in the incorrect order.
With some trace "logging" from the ORC reader, we can inspect the stripe footer and the row index, as they appear in the file (protobuf parsing always marches forward over the input, so the output order much match the input order).

Reading the same data written by cuDF with dictionary encode disabled (RI stands for row index):
StripeFooter begin
Stream begin
enum read: 6
Stream end
Stream begin
enum read: 6
Stream end
Stream begin
enum read: 1 <- DATA stream type
Stream end
Stream begin
enum read: 2 <- LENGTH stream type
Stream end
ColumnEncoding begin
enum read: 0
ColumnEncoding end
ColumnEncoding begin
enum read: 2
ColumnEncoding end
StripeFooter end
RI: 10 RI: 22
RI: 10 RI: 3 RI: 0 RI: 0 RI: 0
RI: 10 RI: 20
RI: 10 RI: 4 RI: 10000 RI: 80 RI: 0

Reading the repro file:
StripeFooter begin
Stream begin
enum read: 6
Stream end
Stream begin
enum read: 6
Stream end
Stream begin
enum read: 2 <- LENGTH stream type!!
Stream end
Stream begin
enum read: 1 <- DATA stream type!!
Stream end
ColumnEncoding begin
enum read: 0
ColumnEncoding end
ColumnEncoding begin
enum read: 2
ColumnEncoding end
StripeFooter end
RI: 10 RI: 24
RI: 10 RI: 3 RI: 0 RI: 0 RI: 0
RI: 10 RI: 23
RI: 10 RI: 5 RI: 10000 RI: 76 RI: 272 <- Same order as cuDF, DATA, then LENGTH

There are tiny differences in the data encode, but in both files the data stream is huge, while the length stream is tiny. The 10000 value should apply to the DATA stream in both cases, but the stripe footer order does not match the index streams.
TL;DR: It looks like the pyarrow file is invalid.

kkraus14 · 2023-04-17T21:41:11Z

@vuule I believe this is unrelated to Arrow and is either a bug in the libORC C++ library or a bug in libcudf. Here's a reproducer without arrow:

import cudf
import pyorc

with open("test_pyorc.orc", "wb") as output:
   with pyorc.Writer(output, "struct<col0:string>", struct_repr=pyorc.StructRepr.DICT) as writer:
      for _ in range(10001):
         writer.write({"col0": "*"})
         
print(cudf.read_orc("test_pyorc.orc"))

vuule · 2023-04-18T23:32:48Z

Thanks @kkraus14 .I'll try to find what's going on in libORC C++.

Meanwhile, I looked into why we haven't caught the bug in unit tests until recently. Turns out we dodged the issue by chance (?) in multiple tests. With some small changes, I can repro the issue in different tests - the bug is not specific to string data in this or the customer issue.

Fixes #11890 Use fixed order when reading row index streams. The order is `PRESENT`, `DATA`, `SECONDARY`/`LENGTH` (maps to `DATA2`). Is any of these is absent in the column, relative order is maintained. Thus, the order is a sub-array of the one above. Also simplified some logic related to stream order, as we do not need to pass it from the host. Instead, we only pass a bitmap to denote which streams are present. Updated the xfail test, as it now passes :) Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Keith Kraus (https://github.com/kkraus14) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13242

vuule added bug Something isn't working cuIO cuIO issue labels Oct 10, 2022

vuule self-assigned this Oct 10, 2022

GregoryKimball added this to the ORC continuous improvement milestone Nov 19, 2022

vuule mentioned this issue Dec 6, 2022

Fail loudly to avoid data corruption with unsupported input in read_orc #12325

Merged

3 tasks

GregoryKimball added this to libcudf Jan 15, 2023

vuule removed this from libcudf Jan 18, 2023

GregoryKimball mentioned this issue Mar 16, 2023

[BUG] reading ORC file fails #12920

Closed

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

This was referenced Apr 21, 2023

[C++] Order of row index streams does not match the order of streams in the file footer apache/orc#1475

Open

Fix the row index stream order in ORC reader #13242

Merged

rapids-bot bot closed this as completed in #13242 May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cuDF ORC reader incorrectly reads file written by pyarrow #11890

[BUG] cuDF ORC reader incorrectly reads file written by pyarrow #11890

vuule commented Oct 10, 2022

vuule commented Nov 23, 2022

vuule commented Mar 29, 2023

vuule commented Mar 29, 2023

vuule commented Mar 29, 2023

kkraus14 commented Apr 17, 2023 •

edited

Loading

vuule commented Apr 18, 2023

[BUG] cuDF ORC reader incorrectly reads file written by pyarrow #11890

[BUG] cuDF ORC reader incorrectly reads file written by pyarrow #11890

Comments

vuule commented Oct 10, 2022

vuule commented Nov 23, 2022

vuule commented Mar 29, 2023

vuule commented Mar 29, 2023

vuule commented Mar 29, 2023

kkraus14 commented Apr 17, 2023 • edited Loading

vuule commented Apr 18, 2023

kkraus14 commented Apr 17, 2023 •

edited

Loading