From 086726c873b359abfc25997ee837f56b46d36e85 Mon Sep 17 00:00:00 2001 From: "Robert (Bobby) Evans" Date: Tue, 25 Apr 2023 18:22:09 -0500 Subject: [PATCH] Fix race in ORC string dictionary creation (#13214) Unfortunately this is really hard to reproduce. For whatever reason I had to try and reproduce this on a relatively small data set with at least 140,001 rows or more, where one column is a LIST but all of the lists are empty lists and another column is a STRUCT column with two STRING child columns where all of the STRINGS are empty. I also had to sort and partition the data before doing the write, and it had to be in a very specific environment with T4 GPUs. I don't know why all of those were needed to make the race happen regularly, but it did. Because of this complexity in reproducing it I have not added in any unit tests. The problem was essentially a race when trying to calculate dictionary duplication for strings in ORC. As a part of this a function `LoadNonNullIndices` was being called that was supposed to set a value `nnz` in a shared memory location `s`. In the normal case a loop was taken where `__syncthreads()` was called, but if there were no rows in the column (the LIST column) then the loop was not taken and it was a race to see if `nnz` which was set to 0 by thread 0 showed up in all of the threads or not. What made this crash is that this `nnz` value is used to determine what happens in the rest of the kernel to see if it reads data, or writes to temp memory (which is not allocated if previous processing shows that there is no need for it), or any of that. If `nnz` is non-zero then it tries to do all of those things and bad stuff starts to happen. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: https://github.com/rapidsai/cudf/pull/13214 --- cpp/src/io/orc/dict_enc.cu | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/cpp/src/io/orc/dict_enc.cu b/cpp/src/io/orc/dict_enc.cu index 898df3ef0f9..a0e873f75d8 100644 --- a/cpp/src/io/orc/dict_enc.cu +++ b/cpp/src/io/orc/dict_enc.cu @@ -1,5 +1,5 @@ /* - * Copyright (c) 2019-2022, NVIDIA CORPORATION. + * Copyright (c) 2019-2023, NVIDIA CORPORATION. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -72,6 +72,10 @@ static __device__ void LoadNonNullIndices(volatile dictinit_state_s* s, Storage& temp_storage) { if (t == 0) { s->nnz = 0; } + if (s->chunk.num_rows <= 0) { + // A sync is needed for s->nnz if there are no times through the loop + __syncthreads(); + } for (uint32_t i = 0; i < s->chunk.num_rows; i += block_size) { const uint32_t* valid_map = s->chunk.leaf_column->null_mask(); auto column_offset = s->chunk.leaf_column->offset();