Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use thread_index_type to avoid index overflow in grid-stride loops #13895

Merged
merged 10 commits into from
Aug 24, 2023

Conversation

PointKernel
Copy link
Member

Description

This PR checks all related files under src/hash, src/bitmask and src/transform folders and fixes potential index overflow issues by using thread_index_type.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Aug 16, 2023
@PointKernel PointKernel self-assigned this Aug 16, 2023
@PointKernel PointKernel requested a review from a team as a code owner August 16, 2023 23:24
@PointKernel
Copy link
Member Author

Question for reviewers: I don't think we have tests exercising an input that could cause overflow. I hesitated to add those tests since it requires relatively large memory and may take a long time to execute. What do you think?

@PointKernel PointKernel added the 3 - Ready for Review Ready for review by team label Aug 16, 2023
Comment on lines 267 to 269
thread_index_type const tid = threadIdx.x + blockIdx.x * blockDim.x;
thread_index_type const stride = blockDim.x * gridDim.x;
thread_index_type thread_word_index = tid + first_word_index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we also need to upgrade first_ and last_ indices above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reply on implicit conversions for < or <= since they are always cast to the larger data type for comparison. first_word_index and last_word_index are explicitly used as size_type later in the loop thus setting them to thread_index_type seems not worth the effort.

Copy link
Contributor

@mythrocks mythrocks Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand, but I'd better confirm:

  1. first_word_index and last_word_index are int32_t.
  2. tid is uint32_t. So thread_word_index is eventually uint32_t.
  3. while (thread_word_idx <= last_word_index) compares the two as uint32_t.
  4. The place where thread_word_index might overflow is at line 275 (thread_word_index += stride). But both thread_word_index and stride are uint32_t already.

Yeah, it looks like this should work.

@karthikeyann
Copy link
Contributor

karthikeyann commented Aug 17, 2023

Question for reviewers: I don't think we have tests exercising an input that could cause overflow. I hesitated to add those tests since it requires relatively large memory and may take a long time to execute. What do you think?

The for-loop comparison happens against shorter-width type size_type. So, is overflow possible to test?

@PointKernel
Copy link
Member Author

PointKernel commented Aug 17, 2023

The for-loop comparison happens against shorter-width type size_type. So, is overflow possible to test?

It should be possible if the input size is smaller than INT_MAX and larger than INT_MAX - stride, then the "last" iteration with int32_t index will produce negative indices when doing tid += stride; which is smaller than the end condition thus causing illegal memory access in the iteration right after.

@karthikeyann
Copy link
Contributor

Benchmarks could be a good place for very large sizes (not for testing, but to catch illegal accesses). But Only unit tests are run through memcheck regularly. So, it won't be caught regularly unless it's added in unit tests. Besides, memcheck will be very slow to run on very large inputs.

Almost all of our algorithms aren't tested for INT32 max sizes. So, it may be okay to not add unit tests for this.
@PointKernel Were you able to verify with a unit test case locally? (How much time does it take to run?)

@PointKernel
Copy link
Member Author

@PointKernel Were you able to verify with a unit test case locally? (How much time does it take to run?)

I wrote a test as below and realized bitmask is a special case: the loop end condition is no smaller than the number of bitmask words which is total_bits / num_bits_per_word. Since total_bits is size_type, the thread index would almost never overflow (unless we change the "word" type to something 1 bit only).

TEST_F(SetBitmaskTest, index_overflow)
{
  auto const begin = 0;
  auto const end   = INT_MAX - 10;
  auto const valid = true;
  auto const size  = end - begin;

  thrust::host_vector<bool> expected(size, valid);
  rmm::device_buffer mask = create_null_mask(size, cudf::mask_state::UNINITIALIZED);

  auto bitmask = static_cast<cudf::bitmask_type*>(mask.data());
  cudf::set_null_mask(bitmask, begin, end, valid);

  auto stream = cudf::get_default_stream();

  rmm::device_uvector<bool> output(size, stream);
  auto counting_iter = thrust::counting_iterator<cudf::size_type>{0};
  thrust::transform(rmm::exec_policy(stream),
                    counting_iter,
                    counting_iter + size,
                    output.begin(),
                    valid_bit_functor{bitmask});

  auto const result = thrust::all_of(
    rmm::exec_policy(stream), output.begin(), output.end(), thrust::identity<bool>{});

  EXPECT_EQ(result, valid);
}

Not sure if it's still relevant, to answer your question about runtime, the test took about 700 ms to run.

(base) yunsongw@yunsongw-dt:~/dev/rapids/cudf/cpp/build/release/gtests$ ./BITMASK_TEST --gtest_filter=SetBitmaskTest.index_overflow
Note: Google Test filter = SetBitmaskTest.index_overflow
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from SetBitmaskTest
[ RUN      ] SetBitmaskTest.index_overflow
[       OK ] SetBitmaskTest.index_overflow (708 ms)
[----------] 1 test from SetBitmaskTest (708 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (708 ms total)
[  PASSED  ] 1 test.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@karthikeyann karthikeyann changed the title Use thread_index_type to avoid index overflow in grid-stride loops Use thread_index_type to avoid index overflow in grid-stride loops Aug 24, 2023
@karthikeyann karthikeyann changed the title Use thread_index_type to avoid index overflow in grid-stride loops Use thread_index_type to avoid index overflow in grid-stride loops Aug 24, 2023
@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit ff99f98 into rapidsai:branch-23.10 Aug 24, 2023
@PointKernel PointKernel deleted the fix-index-overflow branch May 23, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants