Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cuml RF Regressor hangs indefinitely for n_bins that are not multiples of TPB #3919

Closed
venkywonka opened this issue Jun 1, 2021 · 2 comments · Fixed by #3921
Closed
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@venkywonka
Copy link
Contributor

venkywonka commented Jun 1, 2021

Describe the bug
The Rf regressor seems to indefinitely hang for a specific input of n_bins that are greater than the threads-per-block of the regression kernel (TPB=64 when this was encountered) AND are not multiples of TPB.
IOW, in the currently regression kernel implementation, all n_bins > 64 && n_bins % 64 != 0 cause kernel to deadlock.

Steps/Code to reproduce bug
Quickest way is to change this line from
(1, 1.0, True, 32)
to
(1, 1.0, True, 100)
and run this pytest command after building cuml from source:

pytest python/cuml/test/test_random_forest.py::test_rf_regression[special_reg0-1-1.0-True-100-float32-1.0] -v --run_quality --run_stress --run_unit

Expected behavior
The test must pass, instead of hanging indefinitely

Environment details (please complete the following information):

  • Environment location: [Bare-metal]
  • Linux Distro/Architecture: [Ubuntu 18.04 amd64]
  • GPU Model/Driver: [V100 and driver 460.35]
  • CUDA: [11.0]
  • Method of cuDF & cuML install: [from source]
    • If method of install is [from source], provide versions of cmake & gcc/g++ and commit hash of build
      • cmake: 3.20.1
      • g++: 9.3.0
      • commit hash: 95efa251e (branch-21.06)

Additional context
seems to happen with regression when split_algo=1 and new backend when n_bins=100 irrespective of other hyperparameters

@venkywonka venkywonka added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 1, 2021
@vinaydes
Copy link
Contributor

vinaydes commented Jun 1, 2021

The line https://github.com/rapidsai/cuml/blob/branch-21.06/cpp/src/decisiontree/batched-levelalgo/kernels.cuh#L372 seems to be the issue.

for (IdxT tix = threadIdx.x; tix < max(TPB, nbins); tix += blockDim.x)

If nbins is not divisible by TPB and TPB < nbins, in the last stride of loop only some of the threads will participate. The loop has __syncthreads() in it which causes the hang. The loop should instead be

for (IdxT tix = threadIdx.x; tix < ceil_div(nbins, TPB)*TPB; tix += blockDim.x)

This worked on my initial test.

@venkywonka
Copy link
Contributor Author

Thank you vinay, will update the Issue description accordingly based on the info

@venkywonka venkywonka changed the title [BUG] cuml RF Regressor hangs indefinitely for some n_bins for new backend [BUG] cuml RF Regressor hangs indefinitely for n_bins that are not multiples of TPB Jun 1, 2021
@rapids-bot rapids-bot bot closed this as completed in #3921 Jun 1, 2021
rapids-bot bot pushed a commit that referenced this issue Jun 1, 2021
… `n_bins > TPB && n_bins % TPB != 0` (#3921)

* This mini-(but important)-PR fixes the bug in `pdf_to_cdf` device function that causes hang when `n_bins > TPB && n_bins % TPB != 0`
* This closes #3919

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Philip Hyunsu Cho (https://github.com/hcho3)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #3921
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this issue Oct 9, 2023
… `n_bins > TPB && n_bins % TPB != 0` (rapidsai#3921)

* This mini-(but important)-PR fixes the bug in `pdf_to_cdf` device function that causes hang when `n_bins > TPB && n_bins % TPB != 0`
* This closes rapidsai#3919

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Philip Hyunsu Cho (https://github.com/hcho3)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#3921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
2 participants