-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address potential race conditions in Parquet reader #14602
Address potential race conditions in Parquet reader #14602
Conversation
/ok to test |
There is another racecheck error in decode, might be somewhat related. Decided with @etseidl to investigate remaining errors before merging this fix. |
/ok to test |
cpp/src/io/parquet/page_decode.cuh
Outdated
// need this to ensure input_value_count is read by all threads before s->input_value_count | ||
// is modified below (just in case input_value count >= target_input_value_count). | ||
__syncwarp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one I'm not so sure about needing. In the worst case, thead 0 sets the local var, skips the loop (and the syncwarp within it) and then overwrites the shared value before other threads read it. But in that case it will just overwrite with the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually need to update s->nz_count, s->input_value_count and s->input_row_count if we never enter the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking no...they shouldn't have changed if the loop wasn't entered. But I'll admit this is one of the parts of the parquet code that I understand the least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, we should be able to return early if initially input_value_count >= target_input_value_count
, right?
That would simplify the logic and prevent the tool from reporting the race condition.
CC @nvdbaranec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change and verified that racecheck is happy
cpp/src/io/parquet/page_decode.cuh
Outdated
// need this to ensure input_value_count is read by all threads before s->input_value_count | ||
// is modified below (just in case input_value count >= target_input_value_count). | ||
__syncwarp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually need to update s->nz_count, s->input_value_count and s->input_row_count if we never enter the loop?
exit early from gpuUpdateValidityOffsetsAndRowIndices to avoid possible race observed warning for gpuDecodeRleBooleans so remove comment
/ok to test |
cpp/src/io/parquet/page_decode.cuh
Outdated
// ensure all threads read s->dict_pos before returning | ||
__syncwarp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one. The return value from this function is explicitly stated to only be valid on thread 0. Looking at all the call sites, it's always thread 0 that actually does any work with the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is kind of like the one in gpuUpdateValidityOffsetsAndRowIndices
, except here the assignment back to s->dict_pos is done after this call returns. If the loop is entered, then all threads will hit the syncwarp there. It's only an issue if pos >= target_pos
. Given this has worked without problems for quite some time, I can get rid of this and the one in gpuDecodeRleBooleans
.
cpp/src/io/parquet/page_decode.cuh
Outdated
@@ -357,6 +360,9 @@ inline __device__ int gpuDecodeRleBooleans(page_state_s* s, state_buf* sb, int t | |||
uint8_t const* end = s->data_end; | |||
int64_t pos = s->dict_pos; | |||
|
|||
// ensure all threads read s->dict_pos before returning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as the one in gpuDecodeDictionaryIndices
@@ -294,7 +296,6 @@ __device__ thrust::pair<int, int> page_bounds(page_state_s* const s, | |||
pp->num_nulls = null_count; | |||
pp->num_valids = pp->num_input_values - null_count; | |||
} | |||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems dangerous to remove. Aren't all threads except 0 in danger of using the wrong pp->num_nulls
value right below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another only-valid-on-thread-0 result. I originally added more syncthreads before all the other returns, but @vuule pointed out that once this function returns, all that happens is thread 0 takes the return values and copies them to global memory (along with 2 shared mem fields) and then returns. The other threads simply return ignored garbage and exit.
Actually, I should probably move this entire function into gpuComputeStringPageBounds
, which would make the above more obvious. It made sense to be a standalone when it was part of the gpuComputePageStringSizes
kernel (and back then the syncthreads was necessary), but now that it's its own kernel, there's no need for it.
@@ -243,6 +243,8 @@ __device__ cuda::std::pair<int, int> gpuDecodeDictionaryIndices(page_state_s* s, | |||
int pos = s->dict_pos; | |||
int str_len = 0; | |||
|
|||
// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something along these lines?
// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive | |
// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive because the only path that does not include a sync will lead to s->dict_pos being overwritten with the same value |
/ok to test |
/merge |
Description
Related to #14597. Fixes reported errors by racecheck.
Checklist