-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail loudly to avoid data corruption with unsupported input in read_orc
#12325
Fail loudly to avoid data corruption with unsupported input in read_orc
#12325
Conversation
…/vuule/cudf into bug-read_orc-strm-ofst-loud-fail
Codecov ReportBase: 88.37% // Head: 86.56% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-23.02 #12325 +/- ##
================================================
- Coverage 88.37% 86.56% -1.81%
================================================
Files 137 155 +18
Lines 22657 24510 +1853
================================================
+ Hits 20022 21218 +1196
- Misses 2635 3292 +657
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
read_orc
read_orc
@@ -287,6 +287,7 @@ void DecodeNullsAndStringDictionaries(ColumnDesc* chunks, | |||
* @param[in] num_rowgroups Number of row groups in row index data | |||
* @param[in] rowidx_stride Row index stride | |||
* @param[in] level Current nesting level being processed | |||
* @param[out] error_count Number of errors during decode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just return this number, instead of using the void return type and modifying this parameter? I understand that this may be a pointer to device memory but we will read it to host anyway, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DecodeOrcColumnData
is asynchronous. The fact that we copy chunks
to host immediately after calling DecodeOrcColumnData
should not impact how its implemented. If we return the error code we are enforcing this synchronization even though it might not be required otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on the xfail in the test (sorry this was a bit delayed)
python/cudf/cudf/tests/test_orc.py
Outdated
try: | ||
got = cudf.read_orc(buffer) | ||
except RuntimeError: | ||
pytest.mark.xfail( | ||
reason="Unsupported file, " | ||
"see https://github.com/rapidsai/cudf/issues/11890" | ||
) | ||
else: | ||
assert_eq(expected, got) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code is probably not doing what you want. I think the conditions you want to handle are:
- The read fails with
RuntimeError
(this is an expected failure) - The read succeeds (and then we expect the data to match)
- The read fails with some other error (this is an unexpected failure)
To handle this I think you want:
@pytest.mark.xfail(reason="https://github.com/rapidsai/cudf/issues/11890", raises=RuntimeError)
def test_reader_unsupported_offsets():
expect = ...
got = ...
assert_eq(expect, got)
pytest.mark.xfail
Doesn't do anything programmatically, so as written your "except RuntimeError" block just turns into a test pass.
With #12244, as soon as the bug is fixed, this marked test will turn into a failure (an unexpected pass) so we will be reminded to remove the mark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think IO is a special beast as far as the cudf mantra of not validating input. I also think in this case there isn't really any extra overhead to do it. I like this change.
@gpucibot merge |
Description
Issue #11890
Motivating issue: The ORC reader reads nulls in row groups after the first one when reading a string column encoded with Pandas, with direct encoding. The root cause is that cuDF reads offsets from the row group index as larger then the stream sizes.
This PR does not fix the issue, but ensures that the reader fails loudly when the row group index offsets are read as too large to be correct. This should prevent data corruption until the fix is implemented.
This PR also sets up a mechanism to report decode errors from unsupported data.
Checklist