Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Thrust 1.12 causes segfault in SVC pytest #3885

Closed
dantegd opened this issue May 21, 2021 · 1 comment · Fixed by #3968
Closed

[BUG] Thrust 1.12 causes segfault in SVC pytest #3885

dantegd opened this issue May 21, 2021 · 1 comment · Fixed by #3968
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@dantegd
Copy link
Member

dantegd commented May 21, 2021

Describe the bug
PR #3844 allows for easily pinning thrust independent of the CTK, but when I pinned it to 1.12 (the latest thrust release) the PR ran into the following segfault:

cuml/test/test_svm.py::test_svm_skl_cmp_decision_function[params0] Fatal Python error: Aborted

Current thread 0x00007f4948292740 (most recent call first):
  File "/home/galahad/RAPIDS/0.20/cuml/fea-rapids-cmake/python/cuml/internals/api_decorators.py", line 409 in inner_with_setters
  File "/home/galahad/RAPIDS/0.20/cuml/fea-rapids-cmake/python/cuml/test/test_svm.py", line 304 in test_svm_skl_cmp_decision_function
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/home/galahad/miniconda3/envs/ns0520/bin/pytest", line 11 in <module>
[1]    72354 abort (core dumped)  pytest cuml/test/test_svm.py::test_svm_skl_cmp_decision_function -v

I traced the error to the call of SVC.fit in

cuSVC.fit(X_train, y_train)
, so creating a small reproducer shouldn't be hard.
A quick run of that pytest with cuda-gdb threw the following error:

=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/galahad/RAPIDS/0.20/cuml/fea-rapids-cmake/python, configfile: pytest.ini
plugins: benchmark-3.4.1, hypothesis-6.13.0, xdist-2.2.1, cov-2.12.0, timeout-1.4.2, anyio-3.1.0, asyncio-0.12.0, forked-1.3.0
collected 2 items                                                                                                                                                                                                 

cuml/test/test_svm.py [New Thread 0x7fff38634700 (LWP 74042)]
[New Thread 0x7fff40e35700 (LWP 74043)]
warning: Cuda API error detected: cudaMemsetAsync returned (0x1)

warning: Cuda API error detected: cudaGetLastError returned (0x1)


CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x555559a87900

Thread 1 "python" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 40, block (0,0,0), thread (192,0,0), device 0, sm 0, warp 5, lane 0]
0x0000555559a87910 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<thrust::device_ptr<bool>, thrust::device_ptr<int> >, thrust::device_ptr<bool>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<bool>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<thrust::device_ptr<bool>, thrust::device_ptr<int> >, thrust::device_ptr<bool>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<bool>, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<thrust::device_ptr<bool>, thrust::device_ptr<int> >, thrust::device_ptr<bool>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<bool>, thrust::cuda_cub::__transform::always_true_predicate>, long)<<<(6,1,1),(256,1,1)>>> ()

Steps/Code to reproduce bug
Easiest way is to use PR #3844 or wait until it is merged and then change the thrust version in cpp/cmake/thirdparty/get_thrust to 1.12 and build libcuml++ with it.

Expected behavior
Pytests passing with no issue, also this could be a blocker to upgrade thrust.

Environment details (please complete the following information):

  • Environment location: CI and bare metal
  • Linux Distro/Architecture: 20.04 and Centos 7
  • GPU Model/Driver: 3080 and V11
  • CUDA: 11.2 using thrust 1.12
  • Method of cuDF & cuML install: from source

cc @tfeher who might be able to diagnose or triage this much faster than myself

@dantegd dantegd added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 21, 2021
@tfeher tfeher self-assigned this May 28, 2021
@tfeher
Copy link
Contributor

tfeher commented May 28, 2021

I could reproduce the issue using the nightly Ubuntu 18.04 dev image, on V100. Looking into the details.

trxcllnt added a commit to trxcllnt/cuml that referenced this issue Jun 10, 2021
rapids-bot bot pushed a commit that referenced this issue Jun 16, 2021
…s, update dependencies (#3968)

* Updates dask/distributed versions to match cuDF (rapidsai/cudf#8458)
* Updates to Thrust v1.12.0 to align with cuDF and cuGraph
* Don't include the src and src_prims directories in `cuml::cuml++` target's public include paths
* Add missing `<cstddef>` and `<cstdint>` include directives
* Promote `trustworthiness_score` to public `cuml/metrics/metrics.hpp` header and update Cython
* Compile Cython with `-std=c++17`
* Remove `-Wstrict-prototypes` Cython warning
* Fixes linker error in debug builds
* Fixes #3885

Authors:
  - Paul Taylor (https://github.com/trxcllnt)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #3968
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this issue Oct 9, 2023
…s, update dependencies (rapidsai#3968)

* Updates dask/distributed versions to match cuDF (rapidsai/cudf#8458)
* Updates to Thrust v1.12.0 to align with cuDF and cuGraph
* Don't include the src and src_prims directories in `cuml::cuml++` target's public include paths
* Add missing `<cstddef>` and `<cstdint>` include directives
* Promote `trustworthiness_score` to public `cuml/metrics/metrics.hpp` header and update Cython
* Compile Cython with `-std=c++17`
* Remove `-Wstrict-prototypes` Cython warning
* Fixes linker error in debug builds
* Fixes rapidsai#3885

Authors:
  - Paul Taylor (https://github.com/trxcllnt)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: rapidsai#3968
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
2 participants