Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix for crash in RF when max_leaves parameter is specified #4126

Merged
merged 2 commits into from
Jul 29, 2021

Conversation

vinaydes
Copy link
Contributor

Fixes issue #4046.
In the nodeSplitKernel each thread calls leafBasedOnParams() which reads global variable n_leaves. Different threads from same threadblock read n_leaves at different times. Between two threads reading n_leaves, value of it could be changed by some other threadblock. Thus one or few threads might concluded that max_leaves is reached, and rest of the threads might conclude otherwise. This caused crash in partitioning the samples.

In the solution provided here, instead of every thread reading n_leaves, only one thread from a threadblock reads the value and broadcasts it to every other thread via shared memory. This ensures complete agreement on max_leaves criterion among threads from threadblock.

Performance results to be posted shortly.

@vinaydes vinaydes requested a review from a team as a code owner July 29, 2021 08:18
@vinaydes
Copy link
Contributor Author

Minimal performance regression observed overall, no loss of accuracy.
Detailed performance results:
For max_depth = 32

Average performance gain =  -0.19167074530609718 %
Classification accuracy =
  dataset_name  Baseline   PR 4126
0      airline  0.830460  0.830460
1        higgs  0.747569  0.747569
2      epsilon  0.755010  0.755010
Regression MSE =
         dataset_name    Baseline     PR 4126
0  airline_regression  419.453003  419.453003
1                year   88.178551   88.178551

image

For max_depth = 24

Average performance gain =  -0.12289140368544338 %
Classification accuracy =
  dataset_name  Baseline   PR 4126
0      airline  0.810743  0.810743
1        higgs  0.742905  0.742905
2      epsilon  0.756150  0.756150
Regression MSE =
         dataset_name    Baseline     PR 4126
0  airline_regression  418.358765  418.358765
1                year   87.964699   87.964699

image

For max_depth = 18

Average performance gain =  -0.2654329881113543
Classification accuracy =
  dataset_name  Baseline   PR 4126
0      airline  0.753953  0.753953
1        higgs  0.734163  0.734163
2      epsilon  0.760830  0.760830
Regression MSE =
         dataset_name    Baseline     PR 4126
0  airline_regression  433.178467  433.178467
1                year   87.441147   87.441147

image

The GBM-bench parameter file used for benchmarking can be located here.

@vinaydes vinaydes changed the title [WIP] Fix for crash in RF when max_leaves parameter is specified [REVIEW] Fix for crash in RF when max_leaves parameter is specified Jul 29, 2021
Copy link
Contributor

@venkywonka venkywonka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome fix vinay! 🙏🏻

Copy link
Member

@teju85 teju85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Thank you Vinay and Venkat for the fix.

@teju85 teju85 added bug Something isn't working non-breaking Non-breaking change labels Jul 29, 2021
@dantegd
Copy link
Member

dantegd commented Jul 29, 2021

@gpucibot merge

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@9406d53). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #4126   +/-   ##
===============================================
  Coverage                ?   85.91%           
===============================================
  Files                   ?      232           
  Lines                   ?    18377           
  Branches                ?        0           
===============================================
  Hits                    ?    15788           
  Misses                  ?     2589           
  Partials                ?        0           
Flag Coverage Δ
dask 48.03% <0.00%> (?)
non-dask 78.45% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9406d53...3a814fa. Read the comment docs.

@rapids-bot rapids-bot bot merged commit 5f5fc49 into rapidsai:branch-21.08 Jul 29, 2021
@vinaydes vinaydes deleted the fix-rf-n-leaves-issue branch February 8, 2023 08:34
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
…i#4126)

Fixes issue rapidsai#4046. 
In the `nodeSplitKernel` each thread calls `leafBasedOnParams()` which reads global variable `n_leaves`. Different threads from same threadblock read `n_leaves` at different times. Between two threads reading `n_leaves`, value of it could be changed by some other threadblock. Thus one or few threads might concluded that `max_leaves` is reached, and rest of the threads might conclude otherwise. This caused crash in partitioning the samples.

In the solution provided here, instead of every thread reading `n_leaves`, only one thread from a threadblock reads the value and broadcasts it to every other thread via shared memory. This ensures complete agreement on `max_leaves` criterion among threads from threadblock.

Performance results to be posted shortly.

Authors:
  - Vinay Deshpande (https://github.com/vinaydes)

Approvers:
  - Venkat (https://github.com/venkywonka)
  - Rory Mitchell (https://github.com/RAMitchell)
  - Thejaswi. N. S (https://github.com/teju85)

URL: rapidsai#4126
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUDA/C++ non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants