Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] compute-sanitizer errors in gtests #528

Closed
cjnolet opened this issue Feb 25, 2022 · 7 comments
Closed

[BUG] compute-sanitizer errors in gtests #528

cjnolet opened this issue Feb 25, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@cjnolet
Copy link
Member

cjnolet commented Feb 25, 2022

I noticed some stange failures in the gtests for the 3d rbc changes when running all of the gtests that I don't see when just running the ball cover tests. Running compute-sanitizer in the gtests is tripping some different memory errors. I'm creating this issue as a tracker and will put the different memory errors in comments.

@cjnolet cjnolet added the bug Something isn't working label Feb 25, 2022
@cjnolet
Copy link
Member Author

cjnolet commented Feb 25, 2022

The Hangarian tests look like they are failing. I also verified they are failing in 22.02:

========= COMPUTE-SANITIZER
Running main() from ../googletest/src/gtest_main.cc
Note: Google Test filter = Raft.Hung*
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from Raft
[ RUN      ] Raft.HungarianIntFloat
========= Invalid __global__ write of size 4 bytes
=========     at 0xdd0 in void raft::lap::detail::kernel_dualUpdate_1<int, float>(T2 *, const T2 *, const int *, int, T1, T2)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7fb04a880a08 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7fb04a880a00 of size 8 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x23adbc]
=========                in /lib/x86_64-linux-gnu/libcuda.so
=========     Host Frame: [0x141cc]

@cjnolet cjnolet changed the title [BUG] Memory errors in gtests [BUG] computer-sanitizer errors in gtests Feb 25, 2022
@cjnolet
Copy link
Member Author

cjnolet commented Feb 25, 2022

I'm seeing an error in the interruptible tests as well. I'm also seeing this when running the distances tests so I'm not sure if related to the interruptible or maybe somehow to my environment. cc @achirkin

========= COMPUTE-SANITIZER
Running main() from ../googletest/src/gtest_main.cc
Note: Google Test filter = Raft.Inter*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from Raft
[ RUN      ] Raft.InterruptibleBasic
[       OK ] Raft.InterruptibleBasic (0 ms)
[ RUN      ] Raft.InterruptibleRepeatedGetToken
[       OK ] Raft.InterruptibleRepeatedGetToken (0 ms)
[ RUN      ] Raft.InterruptibleDelayedInit
[       OK ] Raft.InterruptibleDelayedInit (0 ms)
[ RUN      ] Raft.InterruptibleOpenMP
========= Program hit cudaErrorNotReady (error 600) due to "device not ready" on CUDA API call to cudaStreamQuery_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x34fb13]
=========                in /lib/x86_64-linux-gnu/libcuda.so
=========     Host Frame:cudaStreamQuery_ptsz [0x496c8]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_021922/lib/libcudart.so.11.0
=========     Host Frame:void raft::interruptible::synchronize_impl<cudaError (*)(CUstream_st*), rmm::cuda_stream_view>(cudaError (*)(CUstream_st*), rmm::cuda_stream_view) [0x1c67c0]
=========                in /share/workspace/rapids_projects/raft/cpp/build/./test_raft
=========     Host Frame:raft::Raft_InterruptibleOpenMP_Test::TestBody() [clone ._omp_fn.0] [0x2b1460]
=========                in /share/workspace/rapids_projects/raft/cpp/build/./test_raft
=========     Host Frame:../../../libgomp/team.c:126:gomp_thread_start [0x16208]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_021922/lib/libgomp.so.1
=========     Host Frame:./nptl/pthread_create.c:474:start_thread [0x9450]
=========                in /lib/x86_64-linux-gnu/libpthread.so.0
=========     Host Frame:clone [0x117d53]
=========                in /lib/x86_64-linux-gnu/libc.so.6
========= 
========= Program hit cudaErrorNotReady (error 600) due to "device not ready" on CUDA API call to cudaStreamQuery_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x34fb13]
=========                in /lib/x86_64-linux-gnu/libcuda.so
=========     Host Frame:cudaStreamQuery_ptsz [0x496c8]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_021922/lib/libcudart.so.11.0
=========     Host Frame:void raft::interruptible::synchronize_impl<cudaError (*)(CUstream_st*), rmm::cuda_stream_view>(cudaError (*)(CUstream_st*), rmm::cuda_stream_view) [0x1c67c0]
=========                in /share/workspace/rapids_projects/raft/cpp/build/./test_raft
output

Even in the distance tests, though, it seems to arise from interruptible:

=========     Host Frame:void raft::interruptible::synchronize_impl<cudaError (*)(CUstream_st*), rmm::cuda_stream_view>(cudaError (*)(CUstream_st*), rmm::cuda_stream_view) [0x1c67c0]

@cjnolet
Copy link
Member Author

cjnolet commented Feb 25, 2022

cc @ChuckHastings. Are you the correct poc for the lap/Hungarian code?

@ChuckHastings
Copy link
Contributor

Yes.

rapids-bot bot pushed a commit that referenced this issue Feb 27, 2022
Addresses Hungarian bug described in #528.

The `dualUpdate` method was originally using an array of size one which was eventually changed to a scalar.  It really needs to be an array of size SP (number of subproblems in Date/Nagi nomenclature, number of batches as integrated into raft).

Authors:
  - Chuck Hastings (https://github.com/ChuckHastings)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #531
@achirkin
Copy link
Contributor

I'm seeing an error in the interruptible tests as well. I'm also seeing this when running the distances tests so I'm not sure if related to the interruptible or maybe somehow to my environment. cc @achirkin

========= COMPUTE-SANITIZER
...
[ RUN      ] Raft.InterruptibleOpenMP
========= Program hit cudaErrorNotReady (error 600) due to "device not ready" on CUDA API call to cudaStreamQuery_ptsz.
...

If I understand this correctly, this is an expected behavior, isn't it? We call cudaStreamQuery in the loop, which returns cudaErrorNotReady until all the work in the stream is finished (or cancel is triggered). Though I wonder, is it possible to suppress this "error code" for the compute-sanitizer?..

@cjnolet
Copy link
Member Author

cjnolet commented Mar 1, 2022

Though I wonder, is it possible to suppress this "error code" for the compute-sanitizer?..

As it is currently, these errors really flood the compute-sanitizer output, making it hard to investigate the real errors. It would definitely be ideal to if we could suppress these when running compute sanitizer. Maybe in the worst case we could use a flag to turn this behavior off when running gtests?

@cjnolet cjnolet changed the title [BUG] computer-sanitizer errors in gtests [BUG] compute-sanitizer errors in gtests Mar 1, 2022
@achirkin
Copy link
Contributor

achirkin commented Mar 2, 2022

I've contacted compute-sanitizer devs; this will be fixed on their side. For now, we could use --report-api-error=no flag to view the other types of errors.

@cjnolet cjnolet closed this as completed Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants