Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] join output row count returns negative number when row count exceeds int32_t #9625

Closed
jlowe opened this issue Nov 8, 2021 · 0 comments · Fixed by #9626
Closed

[BUG] join output row count returns negative number when row count exceeds int32_t #9625

jlowe opened this issue Nov 8, 2021 · 0 comments · Fixed by #9626
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Nov 8, 2021

Describe the bug
When an inner join result would produce more than 2^31 output rows, hash_join::inner_join_size returns a negative number rather than the correct result.

Steps/Code to reproduce bug
Apply the following patch and run JOIN_TEST.

diff --git a/cpp/tests/join/join_tests.cpp b/cpp/tests/join/join_tests.cpp
index d64b40c38b..e6ae709f00 100644
--- a/cpp/tests/join/join_tests.cpp
+++ b/cpp/tests/join/join_tests.cpp
@@ -1418,6 +1418,19 @@ TEST_F(JoinTest, HashJoinWithStructsAndNulls)
   }
 }
 
+TEST_F(JoinTest, HashJoinLargeOutputSize)
+{
+  // self-join a table of zeroes to generate an output row count that would overflow int32_t
+  std::size_t col_size = 65567;
+  rmm::device_buffer zeroes(col_size * sizeof(int32_t), rmm::cuda_stream_default);
+  CUDA_TRY(cudaMemsetAsync(zeroes.data(), 0, zeroes.size(), rmm::cuda_stream_default.value()));
+  cudf::column_view col_zeros(cudf::data_type{cudf::type_id::INT32}, col_size, zeroes.data());
+  cudf::table_view tview{{col_zeros}};
+  cudf::hash_join hash_join(tview, cudf::null_equality::UNEQUAL);
+  std::size_t output_size = hash_join.inner_join_size(tview);
+  EXPECT_EQ(col_size * col_size, output_size);
+}
+
 struct JoinDictionaryTest : public cudf::test::BaseFixture {
 };

Expected behavior
The output row count is correct even if the value exceeds 31 bits.

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Nov 8, 2021
@jlowe jlowe removed the Needs Triage Need team to review and classify label Nov 8, 2021
@jlowe jlowe self-assigned this Nov 8, 2021
@rapids-bot rapids-bot bot closed this as completed in #9626 Nov 8, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 8, 2021
Fixes #9625.  Updates `hash_join::compute_join_output_size` to use std::size_t instead of cudf::size_type as the intermediate type to hold the computed output size.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Alessandro Bellina (https://github.com/abellina)
  - MithunR (https://github.com/mythrocks)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - https://github.com/nvdbaranec

URL: #9626
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant