Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in ORC string dictionary creation #13214

Merged
merged 4 commits into from
Apr 25, 2023

Conversation

revans2
Copy link
Contributor

@revans2 revans2 commented Apr 25, 2023

Description

Unfortunately this is really hard to reproduce. For whatever reason I had to try and reproduce this on a relatively small data set with at least 140,001 rows or more, where one column is a LIST but all of the lists are empty lists and another column is a STRUCT column with two STRING child columns where all of the STRINGS are empty. I also had to sort and partition the data before doing the write, and it had to be in a very specific environment with T4 GPUs. I don't know why all of those were needed to make the race happen regularly, but it did.

Because of this complexity in reproducing it I have not added in any unit tests.

The problem was essentially a race when trying to calculate dictionary duplication for strings in ORC. As a part of this a function LoadNonNullIndices was being called that was supposed to set a value nnz in a shared memory location s. In the normal case a loop was taken where __syncthreads() was called, but if there were no rows in the column (the LIST column) then the loop was not taken and it was a race to see if nnz which was set to 0 by thread 0 showed up in all of the threads or not.

What made this crash is that this nnz value is used to determine what happens in the rest of the kernel to see if it reads data, or writes to temp memory (which is not allocated if previous processing shows that there is no need for it), or any of that. If nnz is non-zero then it tries to do all of those things and bad stuff starts to happen.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

As a side note I am not a C++ or CUDA expert so I am happy to any suggestions.

@revans2 revans2 added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Apr 25, 2023
@revans2 revans2 requested a review from a team as a code owner April 25, 2023 18:22
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 25, 2023
@@ -72,6 +72,10 @@ static __device__ void LoadNonNullIndices(volatile dictinit_state_s* s,
Storage& temp_storage)
{
if (t == 0) { s->nnz = 0; }
if (s->chunk.num_rows <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyrights need to be updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvdbaranec Is there anything similar in parquet code?

@abellina
Copy link
Contributor

Nice catch :)

@vuule
Copy link
Contributor

vuule commented Apr 25, 2023

/merge

@rapids-bot rapids-bot bot merged commit 086726c into rapidsai:branch-23.06 Apr 25, 2023
@GregoryKimball
Copy link
Contributor

Thank you @revans2 for investigating this - excellent work

@revans2 revans2 deleted the fix_orc_race branch April 26, 2023 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants