Address potential race conditions in Parquet reader #14602

etseidl · 2023-12-08T21:10:07Z

Description

Related to #14597. Fixes reported errors by racecheck.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2023-12-08T21:10:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vuule · 2023-12-08T21:13:48Z

/ok to test

vuule · 2023-12-08T21:18:43Z

There is another racecheck error in decode, might be somewhat related. Decided with @etseidl to investigate remaining errors before merging this fix.

vuule · 2023-12-09T00:55:49Z

/ok to test

etseidl · 2023-12-09T01:08:13Z

cpp/src/io/parquet/page_decode.cuh

+  // need this to ensure input_value_count is read by all threads before s->input_value_count
+  // is modified below (just in case input_value count >= target_input_value_count).
+  __syncwarp();


This one I'm not so sure about needing. In the worst case, thead 0 sets the local var, skips the loop (and the syncwarp within it) and then overwrites the shared value before other threads read it. But in that case it will just overwrite with the same value.

do we actually need to update s->nz_count, s->input_value_count and s->input_row_count if we never enter the loop?

I'm thinking no...they shouldn't have changed if the loop wasn't entered. But I'll admit this is one of the parts of the parquet code that I understand the least.

If that's the case, we should be able to return early if initially input_value_count >= target_input_value_count, right?
That would simplify the logic and prevent the tool from reporting the race condition.
CC @nvdbaranec

I made the change and verified that racecheck is happy

…_sync

cpp/src/io/parquet/page_string_decode.cu

vuule · 2023-12-12T00:02:57Z

cpp/src/io/parquet/page_decode.cuh

+  // need this to ensure input_value_count is read by all threads before s->input_value_count
+  // is modified below (just in case input_value count >= target_input_value_count).
+  __syncwarp();


do we actually need to update s->nz_count, s->input_value_count and s->input_row_count if we never enter the loop?

exit early from gpuUpdateValidityOffsetsAndRowIndices to avoid possible race observed warning for gpuDecodeRleBooleans so remove comment

vuule · 2023-12-12T20:51:50Z

/ok to test

cpp/src/io/parquet/delta_binary.cuh

nvdbaranec · 2023-12-13T19:48:41Z

cpp/src/io/parquet/page_decode.cuh

+  // ensure all threads read s->dict_pos before returning
+  __syncwarp();


Not sure about this one. The return value from this function is explicitly stated to only be valid on thread 0. Looking at all the call sites, it's always thread 0 that actually does any work with the value.

Yeah, this is kind of like the one in gpuUpdateValidityOffsetsAndRowIndices, except here the assignment back to s->dict_pos is done after this call returns. If the loop is entered, then all threads will hit the syncwarp there. It's only an issue if pos >= target_pos. Given this has worked without problems for quite some time, I can get rid of this and the one in gpuDecodeRleBooleans.

nvdbaranec · 2023-12-13T19:49:22Z

cpp/src/io/parquet/page_decode.cuh

@@ -357,6 +360,9 @@ inline __device__ int gpuDecodeRleBooleans(page_state_s* s, state_buf* sb, int t
  uint8_t const* end = s->data_end;
  int64_t pos        = s->dict_pos;

+  // ensure all threads read s->dict_pos before returning


Same comment as the one in gpuDecodeDictionaryIndices

cpp/src/io/parquet/page_string_decode.cu

nvdbaranec · 2023-12-13T19:58:39Z

cpp/src/io/parquet/page_string_decode.cu

@@ -294,7 +296,6 @@ __device__ thrust::pair<int, int> page_bounds(page_state_s* const s,
      pp->num_nulls  = null_count;
      pp->num_valids = pp->num_input_values - null_count;
    }
-    __syncthreads();


This seems dangerous to remove. Aren't all threads except 0 in danger of using the wrong pp->num_nulls value right below?

This is another only-valid-on-thread-0 result. I originally added more syncthreads before all the other returns, but @vuule pointed out that once this function returns, all that happens is thread 0 takes the return values and copies them to global memory (along with 2 shared mem fields) and then returns. The other threads simply return ignored garbage and exit.

Actually, I should probably move this entire function into gpuComputeStringPageBounds, which would make the above more obvious. It made sense to be a standalone when it was part of the gpuComputePageStringSizes kernel (and back then the syncthreads was necessary), but now that it's its own kernel, there's no need for it.

vuule · 2023-12-13T23:06:29Z

cpp/src/io/parquet/page_decode.cuh

@@ -243,6 +243,8 @@ __device__ cuda::std::pair<int, int> gpuDecodeDictionaryIndices(page_state_s* s,
  int pos            = s->dict_pos;
  int str_len        = 0;

+  // NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive


Something along these lines?

Suggested change

// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive

// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive because the only path that does not include a sync will lead to s->dict_pos being overwritten with the same value

vuule · 2023-12-14T22:33:42Z

/ok to test

vuule · 2023-12-15T00:22:04Z

/merge

add sync

5f814df

etseidl requested a review from a team as a code owner December 8, 2023 21:10

etseidl requested review from shrshi and davidwendt December 8, 2023 21:10

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 8, 2023

vuule added bug Something isn't working non-breaking Non-breaking change cuIO cuIO issue labels Dec 8, 2023

vuule approved these changes Dec 8, 2023

View reviewed changes

vuule marked this pull request as draft December 8, 2023 21:18

etseidl and others added 6 commits December 8, 2023 14:10

oops, gpuDecodeStream only runs on one warp

837e83f

Merge branch 'branch-24.02' into decode_levels_sync

c4e60ea

remove TODO from prior PR rapidsai#14101

c237125

Merge branch 'branch-24.02' into decode_levels_sync

583a13d

fix a few more warnings

7e5c01f

add some sync calls to page_bounds

c789eba

etseidl commented Dec 9, 2023

View reviewed changes

etseidl changed the title ~~Add sync to gpuDecodeStream in Parquet reader~~ Address potential race conditions in Parquet reader Dec 10, 2023

etseidl and others added 6 commits December 11, 2023 09:42

remove TODO

ff45a2b

add syncwarp after read of s->dict_pos

f3f4ae0

Merge remote-tracking branch 'origin/branch-24.02' into decode_levels…

faf15a1

…_sync

add syncthreads before early return from page_bounds

4877e12

add some comments

0fed54a

Merge branch 'rapidsai:branch-24.02' into decode_levels_sync

06251e6

vuule self-requested a review December 11, 2023 23:07

vuule reviewed Dec 12, 2023

View reviewed changes

etseidl and others added 3 commits December 11, 2023 16:48

remove some syncthreads that should not be necessary

caf8206

remove sync from gpuInitStringDescriptors

c5b0058

exit early from gpuUpdateValidityOffsetsAndRowIndices to avoid possible race observed warning for gpuDecodeRleBooleans so remove comment

Merge branch 'branch-24.02' into decode_levels_sync

7763e2f

etseidl marked this pull request as ready for review December 12, 2023 19:44

Merge branch 'branch-24.02' into decode_levels_sync

da1ada6

ttnghia approved these changes Dec 12, 2023

View reviewed changes

nvdbaranec reviewed Dec 13, 2023

View reviewed changes

cpp/src/io/parquet/delta_binary.cuh Show resolved Hide resolved

nvdbaranec reviewed Dec 13, 2023

View reviewed changes

remove some syncs and add documenation instead

7d57a21

vuule reviewed Dec 13, 2023

View reviewed changes

add further clarification

4038267

nvdbaranec approved these changes Dec 14, 2023

View reviewed changes

Merge branch 'branch-24.02' into decode_levels_sync

6901c1e

rapids-bot bot merged commit 2cb8f3d into rapidsai:branch-24.02 Dec 15, 2023
67 checks passed

etseidl deleted the decode_levels_sync branch December 15, 2023 00:28

etseidl mentioned this pull request Dec 16, 2023

[BUG] Random Parquet CI failures #14597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address potential race conditions in Parquet reader #14602

Address potential race conditions in Parquet reader #14602

etseidl commented Dec 8, 2023 •

edited by vuule

Loading

copy-pr-bot bot commented Dec 8, 2023

vuule commented Dec 8, 2023

vuule commented Dec 8, 2023

vuule commented Dec 9, 2023

etseidl Dec 9, 2023

vuule Dec 12, 2023

etseidl Dec 12, 2023

vuule Dec 12, 2023

etseidl Dec 12, 2023

vuule Dec 12, 2023

vuule commented Dec 12, 2023

nvdbaranec Dec 13, 2023

etseidl Dec 13, 2023

nvdbaranec Dec 13, 2023

nvdbaranec Dec 13, 2023 •

edited

Loading

etseidl Dec 13, 2023 •

edited

Loading

vuule Dec 13, 2023

vuule commented Dec 14, 2023

vuule commented Dec 15, 2023

		// ensure all threads read s->dict_pos before returning
		__syncwarp();

	// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive
	// NOTE: racecheck warns about a RAW involving s->dict_pos, which is likely a false positive because the only path that does not include a sync will lead to s->dict_pos being overwritten with the same value

Address potential race conditions in Parquet reader #14602

Address potential race conditions in Parquet reader #14602

Conversation

etseidl commented Dec 8, 2023 • edited by vuule Loading

Description

Checklist

copy-pr-bot bot commented Dec 8, 2023

vuule commented Dec 8, 2023

vuule commented Dec 8, 2023

vuule commented Dec 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Dec 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvdbaranec Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

etseidl Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Dec 14, 2023

vuule commented Dec 15, 2023

etseidl commented Dec 8, 2023 •

edited by vuule

Loading

nvdbaranec Dec 13, 2023 •

edited

Loading

etseidl Dec 13, 2023 •

edited

Loading