Fail loudly to avoid data corruption with unsupported input in `read_orc` #12325

vuule · 2022-12-06T23:56:39Z

Description

Motivating issue: The ORC reader reads nulls in row groups after the first one when reading a string column encoded with Pandas, with direct encoding. The root cause is that cuDF reads offsets from the row group index as larger then the stream sizes.

This PR does not fix the issue, but ensures that the reader fails loudly when the row group index offsets are read as too large to be correct. This should prevent data corruption until the fix is implemented.

This PR also sets up a mechanism to report decode errors from unsupported data.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…/vuule/cudf into bug-read_orc-strm-ofst-loud-fail

codecov · 2022-12-07T01:13:28Z

Codecov Report

Base: 88.37% // Head: 86.56% // Decreases project coverage by -1.80% ⚠️

Coverage data is based on head (4653907) compared to base (a9f9958).
Patch coverage: 96.19% of modified lines in pull request are covered.

❗ Current head 4653907 differs from pull request most recent head 5bf0f37. Consider uploading reports for the commit 5bf0f37 to get more accurate results

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-23.02   #12325      +/-   ##
================================================
- Coverage         88.37%   86.56%   -1.81%     
================================================
  Files               137      155      +18     
  Lines             22657    24510    +1853     
================================================
+ Hits              20022    21218    +1196     
- Misses             2635     3292     +657

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/column.py	`87.95% <ø> (-0.03%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`73.90% <ø> (-0.31%)`	⬇️
python/dask_cudf/dask_cudf/groupby.py	`97.36% <ø> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_csv.py	`100.00% <ø> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_json.py	`100.00% <ø> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_orc.py	`100.00% <ø> (ø)`
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100.00% <ø> (ø)`
python/dask_cudf/dask_cudf/tests/test_core.py	`95.42% <ø> (-0.02%)`	⬇️
python/cudf/cudf/core/dataframe.py	`93.49% <66.66%> (-0.18%)`	⬇️
python/cudf/cudf/core/_base_index.py	`81.38% <100.00%> (+0.22%)`	⬆️
... and 35 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ttnghia · 2022-12-08T03:05:30Z

cpp/src/io/orc/orc_gpu.hpp

@@ -287,6 +287,7 @@ void DecodeNullsAndStringDictionaries(ColumnDesc* chunks,
 * @param[in] num_rowgroups Number of row groups in row index data
 * @param[in] rowidx_stride Row index stride
 * @param[in] level Current nesting level being processed
+ * @param[out] error_count Number of errors during decode


Why not just return this number, instead of using the void return type and modifying this parameter? I understand that this may be a pointer to device memory but we will read it to host anyway, right?

DecodeOrcColumnData is asynchronous. The fact that we copy chunks to host immediately after calling DecodeOrcColumnData should not impact how its implemented. If we return the error code we are enforcing this synchronization even though it might not be required otherwise.

wence-

Comment on the xfail in the test (sorry this was a bit delayed)

wence- · 2022-12-08T09:30:29Z

python/cudf/cudf/tests/test_orc.py

+    try:
+        got = cudf.read_orc(buffer)
+    except RuntimeError:
+        pytest.mark.xfail(
+            reason="Unsupported file, "
+            "see https://github.com/rapidsai/cudf/issues/11890"
+        )
+    else:
+        assert_eq(expected, got)


This block of code is probably not doing what you want. I think the conditions you want to handle are:

The read fails with RuntimeError (this is an expected failure)

The read succeeds (and then we expect the data to match)

The read fails with some other error (this is an unexpected failure)

To handle this I think you want:

@pytest.mark.xfail(reason="https://github.com/rapidsai/cudf/issues/11890", raises=RuntimeError) def test_reader_unsupported_offsets(): expect = ... got = ... assert_eq(expect, got)

pytest.mark.xfail Doesn't do anything programmatically, so as written your "except RuntimeError" block just turns into a test pass.

With #12244, as soon as the bug is fixed, this marked test will turn into a failure (an unexpected pass) so we will be reminded to remove the mark.

done, thank you

hyperbolic2346

I think IO is a special beast as far as the cudf mantra of not validating input. I also think in this case there isn't really any extra overhead to do it. I like this change.

cpp/src/io/orc/reader_impl.cu

vuule · 2022-12-12T21:28:23Z

@gpucibot merge

vuule added 3 commits December 6, 2022 15:12

check offsets and fail with an error if too large

7fbdf3c

test

43f6126

test

b30f5bb

vuule added bug Something isn't working cuIO cuIO issue breaking Breaking change labels Dec 6, 2022

vuule self-assigned this Dec 6, 2022

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Dec 6, 2022

Merge branch 'bug-read_orc-strm-ofst-loud-fail' of https://github.com…

f0d52b3

…/vuule/cudf into bug-read_orc-strm-ofst-loud-fail

vuule changed the title ~~Fail loudly to avoid data corruption in read_orc~~ Fail loudly to avoid data corruption with unsupported input in read_orc Dec 8, 2022

vuule marked this pull request as ready for review December 8, 2022 01:17

vuule requested review from a team as code owners December 8, 2022 01:17

vuule requested review from shwina, isVoid, harrism and ttnghia December 8, 2022 01:17

galipremsagar approved these changes Dec 8, 2022

View reviewed changes

ttnghia reviewed Dec 8, 2022

View reviewed changes

wence- reviewed Dec 9, 2022

View reviewed changes

fix test xfail

876f8f8

vuule changed the base branch from branch-23.02 to branch-22.12 December 9, 2022 20:22

vuule changed the base branch from branch-22.12 to branch-23.02 December 9, 2022 20:22

hyperbolic2346 approved these changes Dec 9, 2022

View reviewed changes

ttnghia reviewed Dec 9, 2022

View reviewed changes

cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved

change error_count type to size_type

5bf0f37

ttnghia approved these changes Dec 12, 2022

View reviewed changes

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Dec 12, 2022

rapids-bot bot merged commit 8a40902 into rapidsai:branch-23.02 Dec 12, 2022

vuule deleted the bug-read_orc-strm-ofst-loud-fail branch August 10, 2023 03:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail loudly to avoid data corruption with unsupported input in `read_orc` #12325

Fail loudly to avoid data corruption with unsupported input in `read_orc` #12325

vuule commented Dec 6, 2022 •

edited

Loading

codecov bot commented Dec 7, 2022 •

edited

Loading

ttnghia Dec 8, 2022 •

edited

Loading

vuule Dec 9, 2022

wence- left a comment

wence- Dec 8, 2022

vuule Dec 9, 2022

hyperbolic2346 left a comment

vuule commented Dec 12, 2022

Fail loudly to avoid data corruption with unsupported input in read_orc #12325

Fail loudly to avoid data corruption with unsupported input in read_orc #12325

Conversation

vuule commented Dec 6, 2022 • edited Loading

Description

Checklist

codecov bot commented Dec 7, 2022 • edited Loading

Codecov Report

ttnghia Dec 8, 2022 • edited Loading

Choose a reason for hiding this comment

vuule Dec 9, 2022

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

wence- Dec 8, 2022

Choose a reason for hiding this comment

vuule Dec 9, 2022

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

vuule commented Dec 12, 2022

Fail loudly to avoid data corruption with unsupported input in `read_orc` #12325

Fail loudly to avoid data corruption with unsupported input in `read_orc` #12325

vuule commented Dec 6, 2022 •

edited

Loading

codecov bot commented Dec 7, 2022 •

edited

Loading

ttnghia Dec 8, 2022 •

edited

Loading