Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cuDF ORC reader incorrectly reads file written by pyarrow #11890

Closed
vuule opened this issue Oct 10, 2022 · 6 comments · Fixed by #13242
Closed

[BUG] cuDF ORC reader incorrectly reads file written by pyarrow #11890

vuule opened this issue Oct 10, 2022 · 6 comments · Fixed by #13242
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@vuule
Copy link
Contributor

vuule commented Oct 10, 2022

Elements after the first row group are empty.

Only reproes with a string column.
Does not repro when the file is created by cuDF.

Minimal repro code:

expected = cudf.DataFrame({"str": ["*"] * 10001}, dtype="string")

buffer = BytesIO()
expected.to_pandas().to_orc(buffer)

got = cudf.read_orc(buffer)
assert_eq(expected, got)
@vuule vuule added bug Something isn't working cuIO cuIO issue labels Oct 10, 2022
@vuule vuule self-assigned this Oct 10, 2022
@vuule
Copy link
Contributor Author

vuule commented Nov 23, 2022

Seems to be related to row group index; reading nonsensical offsets for the second row group (larger than the streams):

strm_offset0, strm_offset1: 272 10000
strm_len, strm_len2: 10001 80

Even if we switch the offsets around 272 does not make any sense. Trying to find whether the stream lengths are wrong or the offsets.

rapids-bot bot pushed a commit that referenced this issue Dec 12, 2022
…orc` (#12325)

Issue #11890

Motivating issue: The ORC reader reads nulls in row groups after the first one when reading a string column encoded with Pandas, with direct encoding. The root cause is that cuDF reads offsets from the row group index as larger then the stream sizes.

This PR does not fix the issue, but ensures that the reader fails loudly when the row group index offsets are read as too large to be correct. This should prevent data corruption until the fix is implemented. 

This PR also sets up a mechanism to report decode errors from unsupported data.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12325
@vuule vuule removed this from libcudf Jan 18, 2023
@vuule
Copy link
Contributor Author

vuule commented Mar 29, 2023

Turns out that the row index streams are read in incorrect order.
The values that we read from the pyarrow file are
byte offset, RLE run
272, 0
10000, 72

The correct values are
byte offset, RLE run
10000, 0
72, 272

With a hack to read the streams in inverse order, the file is read correctly.

@vuule
Copy link
Contributor Author

vuule commented Mar 29, 2023

Now, as to why the row index streams are read in wrong order:
libcudf ORC reader reader these streams in the same order as the order of corresponding data streams in the stripe footer.
When reading the pyarrow file these data streams' descriptors are read from the stripe footer as [LENGTH, DATA], while we read the row index streams in the [DATA, LENGTH] order. This difference in order causes the difference shown in the comment above.

@vuule
Copy link
Contributor Author

vuule commented Mar 29, 2023

So, why do we read these in inconsistent order? Either libcudf mixes up the order somehow, or the file stores this info in the incorrect order.
With some trace "logging" from the ORC reader, we can inspect the stripe footer and the row index, as they appear in the file (protobuf parsing always marches forward over the input, so the output order much match the input order).

Reading the same data written by cuDF with dictionary encode disabled (RI stands for row index):
StripeFooter begin
Stream begin
enum read: 6
Stream end
Stream begin
enum read: 6
Stream end
Stream begin
enum read: 1 <- DATA stream type
Stream end
Stream begin
enum read: 2 <- LENGTH stream type
Stream end
ColumnEncoding begin
enum read: 0
ColumnEncoding end
ColumnEncoding begin
enum read: 2
ColumnEncoding end
StripeFooter end
RI: 10 RI: 22
RI: 10 RI: 3 RI: 0 RI: 0 RI: 0
RI: 10 RI: 20
RI: 10 RI: 4 RI: 10000 RI: 80 RI: 0

Reading the repro file:
StripeFooter begin
Stream begin
enum read: 6
Stream end
Stream begin
enum read: 6
Stream end
Stream begin
enum read: 2 <- LENGTH stream type!!
Stream end
Stream begin
enum read: 1 <- DATA stream type!!
Stream end
ColumnEncoding begin
enum read: 0
ColumnEncoding end
ColumnEncoding begin
enum read: 2
ColumnEncoding end
StripeFooter end
RI: 10 RI: 24
RI: 10 RI: 3 RI: 0 RI: 0 RI: 0
RI: 10 RI: 23
RI: 10 RI: 5 RI: 10000 RI: 76 RI: 272 <- Same order as cuDF, DATA, then LENGTH

There are tiny differences in the data encode, but in both files the data stream is huge, while the length stream is tiny. The 10000 value should apply to the DATA stream in both cases, but the stripe footer order does not match the index streams.
TL;DR: It looks like the pyarrow file is invalid.

@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 17, 2023

@vuule I believe this is unrelated to Arrow and is either a bug in the libORC C++ library or a bug in libcudf. Here's a reproducer without arrow:

import cudf
import pyorc

with open("test_pyorc.orc", "wb") as output:
   with pyorc.Writer(output, "struct<col0:string>", struct_repr=pyorc.StructRepr.DICT) as writer:
      for _ in range(10001):
         writer.write({"col0": "*"})
         
print(cudf.read_orc("test_pyorc.orc"))

@vuule
Copy link
Contributor Author

vuule commented Apr 18, 2023

Thanks @kkraus14 .I'll try to find what's going on in libORC C++.

Meanwhile, I looked into why we haven't caught the bug in unit tests until recently. Turns out we dodged the issue by chance (?) in multiple tests. With some small changes, I can repro the issue in different tests - the bug is not specific to string data in this or the customer issue.

rapids-bot bot pushed a commit that referenced this issue May 10, 2023
Fixes #11890 

Use fixed order when reading row index streams.
The order is `PRESENT`, `DATA`, `SECONDARY`/`LENGTH` (maps to `DATA2`). 
Is any of these is absent in the column, relative order is maintained. Thus, the order is a sub-array of the one above.

Also simplified some logic related to stream order, as we do not need to pass it from the host. Instead, we only pass a bitmap to denote which streams are present.
Updated the xfail test, as it now passes :)

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13242
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants