[BUG] cudaErrorIllegalAddress on distinct_count and nunique for reduce #13576

revans2 · 2023-06-14T17:30:27Z

Describe the bug
I recently needed to get an approximate distinct count for a dynamic optimization I was doing in Spark. As a part of this I tried to use nunique as a reduce aggregation and ran into cudaErrorIllegalAddress on anything more complicated than some very basic types. I then tried to use distinct_count for the same thing, because I wanted the result on the CPU and this made it very simple to do, but I hit the same problem.

Steps/Code to reproduce bug
I was doing a test with TPCDS data and the keys to various hash aggregates, but I also saw it in a few unit tests. Generally the rule that I found, but don't really trust is that anything with more than one column in it or any value that is larger than 64-bits. So strings cause this, structs cause this, tables with more than one column has this show up. Decimal_128 causes this. Lots of things, but here is one example test that I wrote that also causes it to crash.

TEST_F(DistinctCount, SimpleStruct)
{
//            .column(5, 3, 3, 1, 1)
//            .column(5, 3, null, null, 5)
//            .column(1, 3, 5, 7, 9)

    auto a  = cudf::test::fixed_width_column_wrapper<int32_t>{5, 3, 3, 1, 1};
    auto b  = cudf::test::fixed_width_column_wrapper<int32_t>({5, 3, XXX, XXX, 5}, nulls_at({2, 3}));
    auto c  = cudf::test::fixed_width_column_wrapper<int32_t>{1, 3, 5, 7, 9};

    cudf::table_view input_table({a, b, c});
    cudf::distinct_count(input_table);
}

I am happy to provide more data that causes crashes if people want me to. In the short term I have worked around the issue by hashing the row before sending it to distinct_count, because all I really care about is an approximate count, not an exact one. But I don't want to put anything into production until this gets fixed.

I did run compute sanitizer on the failing test and got back a lot of issues, but this is the first one

========= Invalid __global__ read of size 4 bytes
=========     at 0x170 in void cuco::experimental::detail::insert_if_n<(int)1, (int)128, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::row_validity, cuda::__4::atomic<unsigned long, (cuda::std::__4::__detail::thread_scope)1>, cuco::experimental::static_set_ref<int, (cuda::std::__4::__detail::thread_scope)1, cudf::experimental::row::equality::device_row_comparator<(bool)0, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator>, cuco::experimental::linear_probing<(int)1, const cudf::detail::experimental::compaction_hash<cudf::experimental::row::hash::device_row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>>>, cuco::experimental::detail::aow_storage_ref<(int)1, int, cuco::experimental::extent<unsigned long, (unsigned long)18446744073709551615>>, cuco::experimental::op::insert_tag>>(T3, long, T4, T5, T6 *, T7)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x0 is out of bounds
=========     and is 140591841148928 bytes before the nearest allocation at 0x7fde16c00000 of size 20 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x23adbc]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart1071 [0x85215b]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:cudaLaunchKernel_ptsz [0x891da8]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:unsigned long cuco::experimental::static_set<int, cuco::experimental::extent<unsigned long, 18446744073709551615ul>, (cuda::std::__4::__detail::thread_scope)1, cudf::experimental::row::equality::device_row_comparator<false, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator>, cuco::experimental::linear_probing<1, cudf::detail::experimental::compaction_hash<cudf::experimental::row::hash::device_row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC> > const>, rmm::mr::stream_allocator_adaptor<default_allocator<char> >, cuco::experimental::aow_storage<1> >::insert_if<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::row_validity>(thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::row_validity, cuco::experimental::cuda_stream_ref) [0x40476a]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:cudf::detail::distinct_count(cudf::table_view const&, cudf::null_equality, rmm::cuda_stream_view) [0x400cf6]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:cudf::distinct_count(cudf::table_view const&, cudf::null_equality) [0x400de5]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:DistinctCount_SimpleStruct_Test::TestBody() [0x20b045]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [0x84c8ed]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::Test::Run() [clone .part.0] [0x83bd76]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::TestInfo::Run() [0x83c4aa]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::TestSuite::Run() [clone .part.0] [0x83ccbb]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::internal::UnitTestImpl::RunAllTests() [0x842e7a]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::UnitTest::Run() [0x83c641]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:main [0x1de237]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:__libc_start_main [0x240b3]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x20636e]
=========                in gtests/DISTINCT_COUNT

This was on the latest 23.08 code.

Expected behavior
No crashes.

The text was updated successfully, but these errors were encountered:

ttnghia · 2023-06-14T19:55:56Z

Address 0x0 is out of bounds
=========     and is 140591841148928 bytes before the nearest allocation

This looks like an overflow due to deref a null pointer.

wence- · 2023-06-19T10:14:41Z

(gdb) break distinct_count.cu:160
No symbol table is loaded.  Use the "file" command.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (distinct_count.cu:160) pending.
(gdb) r
Starting program: /home/wence/Documents/src/rapids/doodles/c++/distinct-count 
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fb4d87ff000 (LWP 52639)]
[New Thread 0x7fb4bfb3d000 (LWP 52644)]
[New Thread 0x7fb4bf33c000 (LWP 52645)]
[New Thread 0x7fb4be8da000 (LWP 52646)]
2

Thread 1 "distinct-count" hit Breakpoint 1, operator()<cudf::experimental::row::equality::device_row_comparator<false, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator> > (__closure=0x7ffecb027820, row_equal=...) at /home/wence/Documents/src/rapids/cudf/cpp/src/stream_compaction/distinct_count.cu:160
160	      return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) +
(gdb) p iter
$1 = {<thrust::iterator_adaptor<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, int, int, thrust::any_system_tag, thrust::random_access_traversal_tag, int, long>> = {<thrust::iterator_facade<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, int, thrust::any_system_tag, thrust::random_access_traversal_tag, int, long>> = {<No data fields>}, m_iterator = 0}, <No data fields>}
(gdb) p pred
$2 = {_row_bitmask = 0x0}

wence- · 2023-06-19T10:52:26Z

That row_bitmask is constructed from:

      auto const [row_bitmask, null_count] =
        cudf::detail::bitmask_or(keys, stream, rmm::mr::get_current_device_resource());

What does the documentation for bitmask_or say?

/**
 * @brief Performs bitwise OR of the bitmasks of columns of a table. Returns
 * a pair of resulting mask and count of unset bits.
 *
 * If any of the columns isn't nullable, it is considered all valid.
 * If no column in the table is nullable, an empty bitmask is returned.
 *
 * @param view The table of columns
 * @param mr Device memory resource used to allocate the returned device_buffer
 * @return A pair of resulting bitmask and count of unset bits
 */

So, if only some of the input columns are nullable (as is the case here), this returns a null pointer and a null count of zero (everything is valid).

This is then used by cuco's insert_if_n kernel:

    auto const iter = thrust::counting_iterator<cudf::size_type>(0);
    // when nulls are equal, insert non-null rows only to improve efficiency
    if (nulls_equal == null_equality::EQUAL and has_nulls) {
      thrust::counting_iterator<size_type> stencil(0);
      auto const [row_bitmask, null_count] =
        cudf::detail::bitmask_or(keys, stream, rmm::mr::get_current_device_resource());
      row_validity pred{static_cast<bitmask_type const*>(row_bitmask.data())};

      return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) +
             static_cast<cudf::size_type>(null_count > 0);
    }
    // otherwise, insert all
    return key_set.insert(iter, iter + num_rows, stream.value());

AIUI, this check is designed to skip comparing rows when all entries in the row are null and nulls should compare equal.

So I think we need to check that all (rather than any) columns are nulls before dispatching to insert_if rather than insert.

diff --git a/cpp/src/stream_compaction/distinct_count.cu b/cpp/src/stream_compaction/distinct_count.cu
index 7185dae77b..25f2a67cba 100644
--- a/cpp/src/stream_compaction/distinct_count.cu
+++ b/cpp/src/stream_compaction/distinct_count.cu
@@ -151,10 +151,12 @@ cudf::size_type distinct_count(table_view const& keys,
 
     auto const iter = thrust::counting_iterator<cudf::size_type>(0);
     // when nulls are equal, insert non-null rows only to improve efficiency
-    if (nulls_equal == null_equality::EQUAL and has_nulls) {
+    if (nulls_equal == null_equality::EQUAL and
+        std::all_of(keys.begin(), keys.end(), [](auto const& col) { return col.has_nulls(); })) {
       thrust::counting_iterator<size_type> stencil(0);
       auto const [row_bitmask, null_count] =
         cudf::detail::bitmask_or(keys, stream, rmm::mr::get_current_device_resource());
+      CUDF_EXPECTS(null_count > 0, "Unpossible!");
       row_validity pred{static_cast<bitmask_type const*>(row_bitmask.data())};
 
       return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) +

wence- · 2023-06-19T11:11:35Z

So, if only some of the input columns are nullable (as is the case here), this returns a null pointer and a null count of zero (everything is valid).

Actually, this is a misreading of the documentation. I think the documentation intends that the implementation of bitmask_or should just or together those columns with non-empty bitmasks. but, the implementation does this:

Ah, no, this is deliberate (as per #7406 (comment)). If any masks are null then the validity of their or is null (i.e. everything is valid). So I think the first patch is correct.

ttnghia · 2023-06-19T13:31:16Z

I think the logic here was wrong from the beginning. We want to avoid insert if there is any null in the row. So we should check "null_or" instead of bitmask_or (which is equivalent to "valid_or"). In other word, the correct logic should be using bitmask_and: only insert a row if that row doesn't contain any nulls, so the bitmask to check should be resulted from bitmask_and.

ttnghia · 2023-06-19T13:38:28Z

@wence- While looking for similar code, I only see one more instance of bitmask_or in Python:

cudf/python/cudf/cudf/core/groupby/groupby.py

Line 869 in 67efaf6

_, has_null_group = bitmask_or([*index._columns])

.

I'm not familiar with Python code here so can you have a look and check if it is also correct, please?

wence- · 2023-06-19T14:21:18Z

I think the logic here was wrong from the beginning. We want to avoid insert if there is any null in the row. So we should check "null_or" instead of bitmask_or (which is equivalent to "valid_or"). In other word, the correct logic should be using bitmask_and: only insert a row if that row doesn't contain any nulls, so the bitmask to check should be resulted from bitmask_and.

I think I disagree. We want to count distinct rows, if nulls should compare equal, then we still need to hash the row if it only contains some nulls.

That is, the table:

with null_equality::EQUAL should have a distinct_count of 4. If we were only to insert rows when the row has no nulls, then we would end up with a distinct_count of 3 (2 + (int)(null_count > 0)).

Do you agree?

wence- · 2023-06-19T14:28:05Z

@wence- While looking for similar code, I only see one more instance of bitmask_or in Python:

cudf/python/cudf/cudf/core/groupby/groupby.py

Line 869 in 67efaf6

_, has_null_group = bitmask_or([*index._columns])

.

I'm not familiar with Python code here so can you have a look and check if it is also correct, please?

This is safe afaict.

It ignores the null mask itself anyway (it's only checking if the null count of the OR of the columns)
The cython wrapper wraps up an rmm::device_buffer for the null mask, but does this appropriately even for an empty buffer.

ttnghia · 2023-06-19T15:18:05Z

Humnn, sorry it seems that I was wrong. So we want to insert a row if it still contains at least one non-null element. So using bitmask_or is correct. @wence- your patch above seems to be correct too. Thanks.

I thought that bitmask_or is incorrect when recalling similar (but not the same) situations in other places such as here

cudf/cpp/src/search/contains_table.cu

Line 127 in 67efaf6

cudf::detail::bitmask_and(

.

@PointKernel Comment in the code misled me:

// when nulls are equal, insert non-null rows only to improve efficiency

So I thought that we are inserting rows without any nulls. It should state this:

// when nulls are equal, insert only rows that are not all-nulls to improve efficiency

wence- · 2023-06-19T15:37:34Z

So I thought that we are inserting rows without any nulls. It should state this:

Yeah, I continue to be confused all the time by the interpretation of the null mask as "bit set => no null at this position".

So the way we detect that a row is fully null is to OR the bitmasks together (because setness [validity] in any column means that the row is not fully null).

ttnghia · 2023-06-19T15:38:35Z

In term of efficiency, I think using bitmask_or in case of nulls compared equal is not really efficient. If we have millions of rows in the table, all of them are repeated (so we have only 1 unique row), but not any row is all-nulls. In such case, we will insert all rows.

~~The better option should be using bitmask_and in case of nulls compared unequal, as in~~

cudf/cpp/src/search/contains_table.cu

Line 202 in 67efaf6

    
           // If the haystack table has nulls but they are compared unequal, don't insert them.

I really don't have a better solution for this. I (and maybe @wence- ) continue to be confused 😄

wence- · 2023-06-19T15:44:54Z

I think this patch is a correctness fix, and only does insert_if if it is necessary:

diff --git a/cpp/src/stream_compaction/distinct_count.cu b/cpp/src/stream_compaction/distinct_count.cu
index 7185dae77b..341648d548 100644
--- a/cpp/src/stream_compaction/distinct_count.cu
+++ b/cpp/src/stream_compaction/distinct_count.cu
@@ -150,15 +150,21 @@ cudf::size_type distinct_count(table_view const& keys,
       stream.value()};
 
     auto const iter = thrust::counting_iterator<cudf::size_type>(0);
-    // when nulls are equal, insert non-null rows only to improve efficiency
+    // when nulls are equal, only insert those rows that are not all null to improve efficiency
     if (nulls_equal == null_equality::EQUAL and has_nulls) {
       thrust::counting_iterator<size_type> stencil(0);
       auto const [row_bitmask, null_count] =
         cudf::detail::bitmask_or(keys, stream, rmm::mr::get_current_device_resource());
       row_validity pred{static_cast<bitmask_type const*>(row_bitmask.data())};
 
-      return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) +
-             static_cast<cudf::size_type>(null_count > 0);
+      // Unless all columns have a null mask, row_bitmask will be
+      // null, and null_count will be zero. Equally, unless there is
+      // some row which is null in all columns, null_count will be
+      // zero. So, it is only when null_count is not zero that we need
+      // to do a filtered insertion.
+      if (null_count > 0) {
+        return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) + 1;
+      }
     }
     // otherwise, insert all
     return key_set.insert(iter, iter + num_rows, stream.value());

wence- · 2023-06-19T15:48:33Z

In term of efficiency, I think using bitmask_or in case of nulls compared equal is not really efficient. If we have millions of rows in the table, all of them are repeated (so we have only 1 unique row), but not any row is all-nulls. In such case, we will insert all rows.

But this is no worse than the case without any nulls (which is presumably the more common one).

Do we have any benchmarks here?

ttnghia · 2023-06-19T15:52:13Z

I don't think we have any benchmark for such extreme cases, because such benchmarks will never be able to complete.

If nulls should compare equal then we want to filter out rows for which all column entries are invalid (bitmask not set). If any column is not nullable, then bitmask_or returns an empty bitmask buffer (and a null count of zero) indicating that the returned bitmask is fully valid. When passed as a predicate to insert_if we get a null pointer dereference. To avoid this, only run predicated insertion if the null count returned from bitmask_or is positive (which guarantees that the validity bitmask exists). This also avoids running predicated insertion when the predicate is always true. Closes rapidsai#13576.

wence- · 2023-06-20T13:43:28Z

#13590 fixes the distinct count issue (with the most recent patch from this issue). @revans2 can you check if this also fixes the original (non-narrowed-down) problem?

revans2 · 2023-06-20T13:48:49Z

#13590 fixes the distinct count issue (with the most recent patch from this issue). @revans2 can you check if this also fixes the original (non-narrowed-down) problem?

Sure happy to check...

If nulls should compare equal then we want to filter out rows for which all column entries are invalid (bitmask not set). If any column is not nullable, then bitmask_or returns an empty bitmask buffer (and a null count of zero) indicating that the returned bitmask is fully valid. When passed as a predicate to insert_if we get a null pointer dereference. To avoid this, only run predicated insertion if the null count returned from bitmask_or is positive (which guarantees that the validity bitmask exists). This also avoids running predicated insertion when the predicate is always true. Closes #13576. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #13590

revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Jun 14, 2023

PointKernel self-assigned this Jun 14, 2023

PointKernel added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 14, 2023

wence- assigned wence- and unassigned PointKernel Jun 20, 2023

wence- mentioned this issue Jun 20, 2023

Handle nullptr return value from bitmask_or in distinct_count #13590

Merged

3 tasks

rapids-bot bot closed this as completed in #13590 Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cudaErrorIllegalAddress on distinct_count and nunique for reduce #13576

[BUG] cudaErrorIllegalAddress on distinct_count and nunique for reduce #13576

revans2 commented Jun 14, 2023

ttnghia commented Jun 14, 2023 •

edited

Loading

wence- commented Jun 19, 2023

wence- commented Jun 19, 2023 •

edited

Loading

wence- commented Jun 19, 2023 •

edited

Loading

ttnghia commented Jun 19, 2023 •

edited

Loading

ttnghia commented Jun 19, 2023 •

edited

Loading

wence- commented Jun 19, 2023

wence- commented Jun 19, 2023

ttnghia commented Jun 19, 2023 •

edited

Loading

wence- commented Jun 19, 2023

ttnghia commented Jun 19, 2023 •

edited

Loading

wence- commented Jun 19, 2023

wence- commented Jun 19, 2023

ttnghia commented Jun 19, 2023

wence- commented Jun 20, 2023

revans2 commented Jun 20, 2023

[BUG] cudaErrorIllegalAddress on distinct_count and nunique for reduce #13576

[BUG] cudaErrorIllegalAddress on distinct_count and nunique for reduce #13576

Comments

revans2 commented Jun 14, 2023

ttnghia commented Jun 14, 2023 • edited Loading

wence- commented Jun 19, 2023

wence- commented Jun 19, 2023 • edited Loading

wence- commented Jun 19, 2023 • edited Loading

ttnghia commented Jun 19, 2023 • edited Loading

ttnghia commented Jun 19, 2023 • edited Loading

wence- commented Jun 19, 2023

wence- commented Jun 19, 2023

ttnghia commented Jun 19, 2023 • edited Loading

wence- commented Jun 19, 2023

ttnghia commented Jun 19, 2023 • edited Loading

wence- commented Jun 19, 2023

wence- commented Jun 19, 2023

ttnghia commented Jun 19, 2023

wence- commented Jun 20, 2023

revans2 commented Jun 20, 2023

ttnghia commented Jun 14, 2023 •

edited

Loading

wence- commented Jun 19, 2023 •

edited

Loading

wence- commented Jun 19, 2023 •

edited

Loading

ttnghia commented Jun 19, 2023 •

edited

Loading

ttnghia commented Jun 19, 2023 •

edited

Loading

ttnghia commented Jun 19, 2023 •

edited

Loading

ttnghia commented Jun 19, 2023 •

edited

Loading