-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail loudly to avoid data corruption with unsupported input in read_orc
#12325
Changes from 4 commits
7fbdf3c
43f6126
b30f5bb
f0d52b3
876f8f8
5bf0f37
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1834,3 +1834,22 @@ def test_reader_empty_stripe(datadir, fname): | |
expected = pd.read_orc(path) | ||
got = cudf.read_orc(path) | ||
assert_eq(expected, got) | ||
|
||
|
||
def test_reader_unsupported_offsets(): | ||
# needs enough data for more than one row group | ||
expected = cudf.DataFrame({"str": ["*"] * 10001}, dtype="string") | ||
|
||
buffer = BytesIO() | ||
expected.to_pandas().to_orc(buffer) | ||
|
||
# Reading this file should not lead to data corruption, even if it fails | ||
try: | ||
got = cudf.read_orc(buffer) | ||
except RuntimeError: | ||
pytest.mark.xfail( | ||
reason="Unsupported file, " | ||
"see https://github.com/rapidsai/cudf/issues/11890" | ||
) | ||
else: | ||
assert_eq(expected, got) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This block of code is probably not doing what you want. I think the conditions you want to handle are:
To handle this I think you want: @pytest.mark.xfail(reason="https://github.com/rapidsai/cudf/issues/11890", raises=RuntimeError)
def test_reader_unsupported_offsets():
expect = ...
got = ...
assert_eq(expect, got)
With #12244, as soon as the bug is fixed, this marked test will turn into a failure (an unexpected pass) so we will be reminded to remove the mark. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done, thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just return this number, instead of using the void return type and modifying this parameter? I understand that this may be a pointer to device memory but we will read it to host anyway, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DecodeOrcColumnData
is asynchronous. The fact that we copychunks
to host immediately after callingDecodeOrcColumnData
should not impact how its implemented. If we return the error code we are enforcing this synchronization even though it might not be required otherwise.