Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address potential race conditions in Parquet reader #14602

Merged
merged 20 commits into from
Dec 15, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Dec 8, 2023

Description

Related to #14597. Fixes reported errors by racecheck.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner December 8, 2023 21:10
@etseidl etseidl requested review from shrshi and davidwendt December 8, 2023 21:10
Copy link

copy-pr-bot bot commented Dec 8, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 8, 2023
@vuule vuule added bug Something isn't working non-breaking Non-breaking change cuIO cuIO issue labels Dec 8, 2023
@vuule
Copy link
Contributor

vuule commented Dec 8, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Dec 8, 2023

There is another racecheck error in decode, might be somewhat related. Decided with @etseidl to investigate remaining errors before merging this fix.

@vuule vuule marked this pull request as draft December 8, 2023 21:18
@vuule
Copy link
Contributor

vuule commented Dec 9, 2023

/ok to test

Comment on lines 699 to 701
// need this to ensure input_value_count is read by all threads before s->input_value_count
// is modified below (just in case input_value count >= target_input_value_count).
__syncwarp();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I'm not so sure about needing. In the worst case, thead 0 sets the local var, skips the loop (and the syncwarp within it) and then overwrites the shared value before other threads read it. But in that case it will just overwrite with the same value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need to update s->nz_count, s->input_value_count and s->input_row_count if we never enter the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking no...they shouldn't have changed if the loop wasn't entered. But I'll admit this is one of the parts of the parquet code that I understand the least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, we should be able to return early if initially input_value_count >= target_input_value_count, right?
That would simplify the logic and prevent the tool from reporting the race condition.
CC @nvdbaranec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change and verified that racecheck is happy

@etseidl etseidl changed the title Add sync to gpuDecodeStream in Parquet reader Address potential race conditions in Parquet reader Dec 10, 2023
@vuule vuule self-requested a review December 11, 2023 23:07
cpp/src/io/parquet/page_string_decode.cu Show resolved Hide resolved
Comment on lines 699 to 701
// need this to ensure input_value_count is read by all threads before s->input_value_count
// is modified below (just in case input_value count >= target_input_value_count).
__syncwarp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need to update s->nz_count, s->input_value_count and s->input_row_count if we never enter the loop?

etseidl and others added 3 commits December 11, 2023 16:48
exit early from gpuUpdateValidityOffsetsAndRowIndices to avoid possible race
observed warning for gpuDecodeRleBooleans so remove comment
@etseidl etseidl marked this pull request as ready for review December 12, 2023 19:44
@vuule
Copy link
Contributor

vuule commented Dec 12, 2023

/ok to test

Comment on lines 246 to 247
// ensure all threads read s->dict_pos before returning
__syncwarp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one. The return value from this function is explicitly stated to only be valid on thread 0. Looking at all the call sites, it's always thread 0 that actually does any work with the value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is kind of like the one in gpuUpdateValidityOffsetsAndRowIndices, except here the assignment back to s->dict_pos is done after this call returns. If the loop is entered, then all threads will hit the syncwarp there. It's only an issue if pos >= target_pos. Given this has worked without problems for quite some time, I can get rid of this and the one in gpuDecodeRleBooleans.

@@ -357,6 +360,9 @@ inline __device__ int gpuDecodeRleBooleans(page_state_s* s, state_buf* sb, int t
uint8_t const* end = s->data_end;
int64_t pos = s->dict_pos;

// ensure all threads read s->dict_pos before returning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the one in gpuDecodeDictionaryIndices

cpp/src/io/parquet/page_string_decode.cu Show resolved Hide resolved
@@ -294,7 +296,6 @@ __device__ thrust::pair<int, int> page_bounds(page_state_s* const s,
pp->num_nulls = null_count;
pp->num_valids = pp->num_input_values - null_count;
}
__syncthreads();
Copy link
Contributor

@nvdbaranec nvdbaranec Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous to remove. Aren't all threads except 0 in danger of using the wrong pp->num_nulls value right below?

Copy link
Contributor Author

@etseidl etseidl Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another only-valid-on-thread-0 result. I originally added more syncthreads before all the other returns, but @vuule pointed out that once this function returns, all that happens is thread 0 takes the return values and copies them to global memory (along with 2 shared mem fields) and then returns. The other threads simply return ignored garbage and exit.

Actually, I should probably move this entire function into gpuComputeStringPageBounds, which would make the above more obvious. It made sense to be a standalone when it was part of the gpuComputePageStringSizes kernel (and back then the syncthreads was necessary), but now that it's its own kernel, there's no need for it.

@@ -243,6 +243,8 @@ __device__ cuda::std::pair<int, int> gpuDecodeDictionaryIndices(page_state_s* s,
int pos = s->dict_pos;
int str_len = 0;

// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something along these lines?

Suggested change
// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive
// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive because the only path that does not include a sync will lead to s->dict_pos being overwritten with the same value

@vuule
Copy link
Contributor

vuule commented Dec 14, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Dec 15, 2023

/merge

@rapids-bot rapids-bot bot merged commit 2cb8f3d into rapidsai:branch-24.02 Dec 15, 2023
67 checks passed
@etseidl etseidl deleted the decode_levels_sync branch December 15, 2023 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants