Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudaErrorIllegalAddress on distinct_count and nunique for reduce #13576

Closed
revans2 opened this issue Jun 14, 2023 · 16 comments · Fixed by #13590
Closed

[BUG] cudaErrorIllegalAddress on distinct_count and nunique for reduce #13576

revans2 opened this issue Jun 14, 2023 · 16 comments · Fixed by #13590
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@revans2
Copy link
Contributor

revans2 commented Jun 14, 2023

Describe the bug
I recently needed to get an approximate distinct count for a dynamic optimization I was doing in Spark. As a part of this I tried to use nunique as a reduce aggregation and ran into cudaErrorIllegalAddress on anything more complicated than some very basic types. I then tried to use distinct_count for the same thing, because I wanted the result on the CPU and this made it very simple to do, but I hit the same problem.

Steps/Code to reproduce bug
I was doing a test with TPCDS data and the keys to various hash aggregates, but I also saw it in a few unit tests. Generally the rule that I found, but don't really trust is that anything with more than one column in it or any value that is larger than 64-bits. So strings cause this, structs cause this, tables with more than one column has this show up. Decimal_128 causes this. Lots of things, but here is one example test that I wrote that also causes it to crash.

TEST_F(DistinctCount, SimpleStruct)
{
//            .column(5, 3, 3, 1, 1)
//            .column(5, 3, null, null, 5)
//            .column(1, 3, 5, 7, 9)

    auto a  = cudf::test::fixed_width_column_wrapper<int32_t>{5, 3, 3, 1, 1};
    auto b  = cudf::test::fixed_width_column_wrapper<int32_t>({5, 3, XXX, XXX, 5}, nulls_at({2, 3}));
    auto c  = cudf::test::fixed_width_column_wrapper<int32_t>{1, 3, 5, 7, 9};

    cudf::table_view input_table({a, b, c});
    cudf::distinct_count(input_table);
}

I am happy to provide more data that causes crashes if people want me to. In the short term I have worked around the issue by hashing the row before sending it to distinct_count, because all I really care about is an approximate count, not an exact one. But I don't want to put anything into production until this gets fixed.

I did run compute sanitizer on the failing test and got back a lot of issues, but this is the first one

========= Invalid __global__ read of size 4 bytes
=========     at 0x170 in void cuco::experimental::detail::insert_if_n<(int)1, (int)128, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::row_validity, cuda::__4::atomic<unsigned long, (cuda::std::__4::__detail::thread_scope)1>, cuco::experimental::static_set_ref<int, (cuda::std::__4::__detail::thread_scope)1, cudf::experimental::row::equality::device_row_comparator<(bool)0, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator>, cuco::experimental::linear_probing<(int)1, const cudf::detail::experimental::compaction_hash<cudf::experimental::row::hash::device_row_hasher<cudf::detail::default_hash, cudf::nullate::DYNAMIC>>>, cuco::experimental::detail::aow_storage_ref<(int)1, int, cuco::experimental::extent<unsigned long, (unsigned long)18446744073709551615>>, cuco::experimental::op::insert_tag>>(T3, long, T4, T5, T6 *, T7)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x0 is out of bounds
=========     and is 140591841148928 bytes before the nearest allocation at 0x7fde16c00000 of size 20 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x23adbc]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart1071 [0x85215b]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:cudaLaunchKernel_ptsz [0x891da8]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:unsigned long cuco::experimental::static_set<int, cuco::experimental::extent<unsigned long, 18446744073709551615ul>, (cuda::std::__4::__detail::thread_scope)1, cudf::experimental::row::equality::device_row_comparator<false, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator>, cuco::experimental::linear_probing<1, cudf::detail::experimental::compaction_hash<cudf::experimental::row::hash::device_row_hasher<cudf::detail::MurmurHash3_32, cudf::nullate::DYNAMIC> > const>, rmm::mr::stream_allocator_adaptor<default_allocator<char> >, cuco::experimental::aow_storage<1> >::insert_if<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::row_validity>(thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, cudf::detail::row_validity, cuco::experimental::cuda_stream_ref) [0x40476a]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:cudf::detail::distinct_count(cudf::table_view const&, cudf::null_equality, rmm::cuda_stream_view) [0x400cf6]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:cudf::distinct_count(cudf::table_view const&, cudf::null_equality) [0x400de5]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:DistinctCount_SimpleStruct_Test::TestBody() [0x20b045]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [0x84c8ed]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::Test::Run() [clone .part.0] [0x83bd76]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::TestInfo::Run() [0x83c4aa]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::TestSuite::Run() [clone .part.0] [0x83ccbb]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::internal::UnitTestImpl::RunAllTests() [0x842e7a]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:testing::UnitTest::Run() [0x83c641]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:main [0x1de237]
=========                in gtests/DISTINCT_COUNT
=========     Host Frame:__libc_start_main [0x240b3]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x20636e]
=========                in gtests/DISTINCT_COUNT

This was on the latest 23.08 code.

Expected behavior
No crashes.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Jun 14, 2023
@PointKernel PointKernel self-assigned this Jun 14, 2023
@PointKernel PointKernel added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 14, 2023
@ttnghia
Copy link
Contributor

ttnghia commented Jun 14, 2023

Address 0x0 is out of bounds
=========     and is 140591841148928 bytes before the nearest allocation

This looks like an overflow due to deref a null pointer.

@wence-
Copy link
Contributor

wence- commented Jun 19, 2023

(gdb) break distinct_count.cu:160
No symbol table is loaded.  Use the "file" command.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (distinct_count.cu:160) pending.
(gdb) r
Starting program: /home/wence/Documents/src/rapids/doodles/c++/distinct-count 
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fb4d87ff000 (LWP 52639)]
[New Thread 0x7fb4bfb3d000 (LWP 52644)]
[New Thread 0x7fb4bf33c000 (LWP 52645)]
[New Thread 0x7fb4be8da000 (LWP 52646)]
2

Thread 1 "distinct-count" hit Breakpoint 1, operator()<cudf::experimental::row::equality::device_row_comparator<false, cudf::nullate::DYNAMIC, cudf::experimental::row::equality::nan_equal_physical_equality_comparator> > (__closure=0x7ffecb027820, row_equal=...) at /home/wence/Documents/src/rapids/cudf/cpp/src/stream_compaction/distinct_count.cu:160
160	      return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) +
(gdb) p iter
$1 = {<thrust::iterator_adaptor<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, int, int, thrust::any_system_tag, thrust::random_access_traversal_tag, int, long>> = {<thrust::iterator_facade<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, int, thrust::any_system_tag, thrust::random_access_traversal_tag, int, long>> = {<No data fields>}, m_iterator = 0}, <No data fields>}
(gdb) p pred
$2 = {_row_bitmask = 0x0}

@wence-
Copy link
Contributor

wence- commented Jun 19, 2023

That row_bitmask is constructed from:

      auto const [row_bitmask, null_count] =
        cudf::detail::bitmask_or(keys, stream, rmm::mr::get_current_device_resource());

What does the documentation for bitmask_or say?

/**
 * @brief Performs bitwise OR of the bitmasks of columns of a table. Returns
 * a pair of resulting mask and count of unset bits.
 *
 * If any of the columns isn't nullable, it is considered all valid.
 * If no column in the table is nullable, an empty bitmask is returned.
 *
 * @param view The table of columns
 * @param mr Device memory resource used to allocate the returned device_buffer
 * @return A pair of resulting bitmask and count of unset bits
 */

So, if only some of the input columns are nullable (as is the case here), this returns a null pointer and a null count of zero (everything is valid).

This is then used by cuco's insert_if_n kernel:

    auto const iter = thrust::counting_iterator<cudf::size_type>(0);
    // when nulls are equal, insert non-null rows only to improve efficiency
    if (nulls_equal == null_equality::EQUAL and has_nulls) {
      thrust::counting_iterator<size_type> stencil(0);
      auto const [row_bitmask, null_count] =
        cudf::detail::bitmask_or(keys, stream, rmm::mr::get_current_device_resource());
      row_validity pred{static_cast<bitmask_type const*>(row_bitmask.data())};

      return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) +
             static_cast<cudf::size_type>(null_count > 0);
    }
    // otherwise, insert all
    return key_set.insert(iter, iter + num_rows, stream.value());

AIUI, this check is designed to skip comparing rows when all entries in the row are null and nulls should compare equal.

So I think we need to check that all (rather than any) columns are nulls before dispatching to insert_if rather than insert.

diff --git a/cpp/src/stream_compaction/distinct_count.cu b/cpp/src/stream_compaction/distinct_count.cu
index 7185dae77b..25f2a67cba 100644
--- a/cpp/src/stream_compaction/distinct_count.cu
+++ b/cpp/src/stream_compaction/distinct_count.cu
@@ -151,10 +151,12 @@ cudf::size_type distinct_count(table_view const& keys,
 
     auto const iter = thrust::counting_iterator<cudf::size_type>(0);
     // when nulls are equal, insert non-null rows only to improve efficiency
-    if (nulls_equal == null_equality::EQUAL and has_nulls) {
+    if (nulls_equal == null_equality::EQUAL and
+        std::all_of(keys.begin(), keys.end(), [](auto const& col) { return col.has_nulls(); })) {
       thrust::counting_iterator<size_type> stencil(0);
       auto const [row_bitmask, null_count] =
         cudf::detail::bitmask_or(keys, stream, rmm::mr::get_current_device_resource());
+      CUDF_EXPECTS(null_count > 0, "Unpossible!");
       row_validity pred{static_cast<bitmask_type const*>(row_bitmask.data())};
 
       return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) +

@wence-
Copy link
Contributor

wence- commented Jun 19, 2023

So, if only some of the input columns are nullable (as is the case here), this returns a null pointer and a null count of zero (everything is valid).

Actually, this is a misreading of the documentation. I think the documentation intends that the implementation of bitmask_or should just or together those columns with non-empty bitmasks. but, the implementation does this:

Ah, no, this is deliberate (as per #7406 (comment)). If any masks are null then the validity of their or is null (i.e. everything is valid). So I think the first patch is correct.

@ttnghia
Copy link
Contributor

ttnghia commented Jun 19, 2023

I think the logic here was wrong from the beginning. We want to avoid insert if there is any null in the row. So we should check "null_or" instead of bitmask_or (which is equivalent to "valid_or"). In other word, the correct logic should be using bitmask_and: only insert a row if that row doesn't contain any nulls, so the bitmask to check should be resulted from bitmask_and.

@ttnghia
Copy link
Contributor

ttnghia commented Jun 19, 2023

@wence- While looking for similar code, I only see one more instance of bitmask_or in Python:

_, has_null_group = bitmask_or([*index._columns])
.

I'm not familiar with Python code here so can you have a look and check if it is also correct, please?

@wence-
Copy link
Contributor

wence- commented Jun 19, 2023

I think the logic here was wrong from the beginning. We want to avoid insert if there is any null in the row. So we should check "null_or" instead of bitmask_or (which is equivalent to "valid_or"). In other word, the correct logic should be using bitmask_and: only insert a row if that row doesn't contain any nulls, so the bitmask to check should be resulted from bitmask_and.

I think I disagree. We want to count distinct rows, if nulls should compare equal, then we still need to hash the row if it only contains some nulls.

That is, the table:

 0   1
NA   2
 1   0
 2  NA

with null_equality::EQUAL should have a distinct_count of 4. If we were only to insert rows when the row has no nulls, then we would end up with a distinct_count of 3 (2 + (int)(null_count > 0)).

Do you agree?

@wence-
Copy link
Contributor

wence- commented Jun 19, 2023

@wence- While looking for similar code, I only see one more instance of bitmask_or in Python:

_, has_null_group = bitmask_or([*index._columns])

.

I'm not familiar with Python code here so can you have a look and check if it is also correct, please?

This is safe afaict.

  1. It ignores the null mask itself anyway (it's only checking if the null count of the OR of the columns)
  2. The cython wrapper wraps up an rmm::device_buffer for the null mask, but does this appropriately even for an empty buffer.

@ttnghia
Copy link
Contributor

ttnghia commented Jun 19, 2023

Humnn, sorry it seems that I was wrong. So we want to insert a row if it still contains at least one non-null element. So using bitmask_or is correct. @wence- your patch above seems to be correct too. Thanks.

I thought that bitmask_or is incorrect when recalling similar (but not the same) situations in other places such as here

cudf::detail::bitmask_and(
.

@PointKernel Comment in the code misled me:

// when nulls are equal, insert non-null rows only to improve efficiency

So I thought that we are inserting rows without any nulls. It should state this:

// when nulls are equal, insert only rows that are not all-nulls to improve efficiency

@wence-
Copy link
Contributor

wence- commented Jun 19, 2023

So I thought that we are inserting rows without any nulls. It should state this:

Yeah, I continue to be confused all the time by the interpretation of the null mask as "bit set => no null at this position".

So the way we detect that a row is fully null is to OR the bitmasks together (because setness [validity] in any column means that the row is not fully null).

@ttnghia
Copy link
Contributor

ttnghia commented Jun 19, 2023

In term of efficiency, I think using bitmask_or in case of nulls compared equal is not really efficient. If we have millions of rows in the table, all of them are repeated (so we have only 1 unique row), but not any row is all-nulls. In such case, we will insert all rows.

The better option should be using bitmask_and in case of nulls compared unequal, as in

// If the haystack table has nulls but they are compared unequal, don't insert them.

I really don't have a better solution for this. I (and maybe @wence- ) continue to be confused 😄

@wence-
Copy link
Contributor

wence- commented Jun 19, 2023

I think this patch is a correctness fix, and only does insert_if if it is necessary:

diff --git a/cpp/src/stream_compaction/distinct_count.cu b/cpp/src/stream_compaction/distinct_count.cu
index 7185dae77b..341648d548 100644
--- a/cpp/src/stream_compaction/distinct_count.cu
+++ b/cpp/src/stream_compaction/distinct_count.cu
@@ -150,15 +150,21 @@ cudf::size_type distinct_count(table_view const& keys,
       stream.value()};
 
     auto const iter = thrust::counting_iterator<cudf::size_type>(0);
-    // when nulls are equal, insert non-null rows only to improve efficiency
+    // when nulls are equal, only insert those rows that are not all null to improve efficiency
     if (nulls_equal == null_equality::EQUAL and has_nulls) {
       thrust::counting_iterator<size_type> stencil(0);
       auto const [row_bitmask, null_count] =
         cudf::detail::bitmask_or(keys, stream, rmm::mr::get_current_device_resource());
       row_validity pred{static_cast<bitmask_type const*>(row_bitmask.data())};
 
-      return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) +
-             static_cast<cudf::size_type>(null_count > 0);
+      // Unless all columns have a null mask, row_bitmask will be
+      // null, and null_count will be zero. Equally, unless there is
+      // some row which is null in all columns, null_count will be
+      // zero. So, it is only when null_count is not zero that we need
+      // to do a filtered insertion.
+      if (null_count > 0) {
+        return key_set.insert_if(iter, iter + num_rows, stencil, pred, stream.value()) + 1;
+      }
     }
     // otherwise, insert all
     return key_set.insert(iter, iter + num_rows, stream.value());

@wence-
Copy link
Contributor

wence- commented Jun 19, 2023

In term of efficiency, I think using bitmask_or in case of nulls compared equal is not really efficient. If we have millions of rows in the table, all of them are repeated (so we have only 1 unique row), but not any row is all-nulls. In such case, we will insert all rows.

But this is no worse than the case without any nulls (which is presumably the more common one).

Do we have any benchmarks here?

@ttnghia
Copy link
Contributor

ttnghia commented Jun 19, 2023

I don't think we have any benchmark for such extreme cases, because such benchmarks will never be able to complete.

@wence- wence- assigned wence- and unassigned PointKernel Jun 20, 2023
wence- added a commit to wence-/cudf that referenced this issue Jun 20, 2023
If nulls should compare equal then we want to filter out rows for
which all column entries are invalid (bitmask not set). If any column
is not nullable, then bitmask_or returns an empty bitmask buffer (and
a null count of zero) indicating that the returned bitmask is fully
valid. When passed as a predicate to insert_if we get a null pointer
dereference.

To avoid this, only run predicated insertion if the null count
returned from bitmask_or is positive (which guarantees that the
validity bitmask exists). This also avoids running predicated
insertion when the predicate is always true.

Closes rapidsai#13576.
@wence-
Copy link
Contributor

wence- commented Jun 20, 2023

#13590 fixes the distinct count issue (with the most recent patch from this issue). @revans2 can you check if this also fixes the original (non-narrowed-down) problem?

@revans2
Copy link
Contributor Author

revans2 commented Jun 20, 2023

#13590 fixes the distinct count issue (with the most recent patch from this issue). @revans2 can you check if this also fixes the original (non-narrowed-down) problem?

Sure happy to check...

rapids-bot bot pushed a commit that referenced this issue Jun 21, 2023
If nulls should compare equal then we want to filter out rows for which all column entries are invalid (bitmask not set). If any column is not nullable, then bitmask_or returns an empty bitmask buffer (and a null count of zero) indicating that the returned bitmask is fully valid. When passed as a predicate to insert_if we get a null pointer dereference.

To avoid this, only run predicated insertion if the null count returned from bitmask_or is positive (which guarantees that the validity bitmask exists). This also avoids running predicated insertion when the predicate is always true.

Closes #13576.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #13590
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants