-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix invalid memory access in Parquet reader #14637
Conversation
@nvdbaranec this PR addresses the issue, but is not the most elegant fix. I tried to find a location to zero out the pointers where I could piggyback on an existing H2D copy of the chunks, but didn't have much luck. Could you please take a look and suggest a better fix? |
/ok to test |
An alternative that doesn't require copying the chunks array to the device is to zero out the pointers at the beginning of Also, the read is somewhat benign since the result of the dereference is never used. This shouldn't impact users if it went out in 23.12. Ran the full PARQUET_TEST through compute-sanitizer after the change and got no additional errors reported. |
Yet another option may be to add another boolean argument to |
/ok to test |
/ok to test |
@@ -616,7 +616,9 @@ __global__ void __launch_bounds__(preprocess_block_size) gpuComputeStringPageBou | |||
|
|||
// setup page info | |||
auto const mask = BitOr(decode_kernel_mask::STRING, decode_kernel_mask::DELTA_BYTE_ARRAY); | |||
if (!setupLocalPageInfo(s, pp, chunks, min_row, num_rows, mask_filter{mask}, true)) { return; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this PR change is_decode_step
value in some of these calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string preprocessing was passing that as true, leading the setup call to believe the output buffers were valid and thus accessing invalid memory. With the new flag true and the old flag false, we get the behavior that was originally desired, but can now skip the bad pointer arithmetic.
Co-authored-by: Nghia Truong <[email protected]>
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two adjacent bool parameters are a bit sketchy, but clean up can be a separate PR
// if we're in the decoding step, jump directly to the first | ||
// value we care about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could change this to move the is_bonds_step
test inside the else
block...I think all that's necessary is to just not zero out those values.
Well, it's a tri-state: preprocessing (false false), string bounds (false true), decode (true false). Could add an enum... |
Yeah an enum (with comment) should be much clearer, instead of just |
The change to enum is great. No need to decipher the steps. |
/ok to test |
/merge |
Fixes rapidsai#14633 When reading files in multiple passes, some pointer fields in `ColumnChunkDesc` that point to transient memory are not cleared out at the end of each pass. This can lead to trying to dereference deallocated memory during Parquet reader string preprocessing. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: rapidsai#14637
Fixes rapidsai#14633 When reading files in multiple passes, some pointer fields in `ColumnChunkDesc` that point to transient memory are not cleared out at the end of each pass. This can lead to trying to dereference deallocated memory during Parquet reader string preprocessing. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: rapidsai#14637
Description
Fixes #14633
When reading files in multiple passes, some pointer fields in
ColumnChunkDesc
that point to transient memory are not cleared out at the end of each pass. This can lead to trying to dereference deallocated memory during Parquet reader string preprocessing.Checklist