[REVIEW] Fix for crash in RF when `max_leaves` parameter is specified #4126

vinaydes · 2021-07-29T08:18:34Z

Fixes issue #4046.
In the nodeSplitKernel each thread calls leafBasedOnParams() which reads global variable n_leaves. Different threads from same threadblock read n_leaves at different times. Between two threads reading n_leaves, value of it could be changed by some other threadblock. Thus one or few threads might concluded that max_leaves is reached, and rest of the threads might conclude otherwise. This caused crash in partitioning the samples.

In the solution provided here, instead of every thread reading n_leaves, only one thread from a threadblock reads the value and broadcasts it to every other thread via shared memory. This ensures complete agreement on max_leaves criterion among threads from threadblock.

Performance results to be posted shortly.

…of isLeaf across entire threadblock

vinaydes · 2021-07-29T08:38:08Z

Minimal performance regression observed overall, no loss of accuracy.
Detailed performance results:
For max_depth = 32

Average performance gain =  -0.19167074530609718 %
Classification accuracy =
  dataset_name  Baseline   PR 4126
0      airline  0.830460  0.830460
1        higgs  0.747569  0.747569
2      epsilon  0.755010  0.755010
Regression MSE =
         dataset_name    Baseline     PR 4126
0  airline_regression  419.453003  419.453003
1                year   88.178551   88.178551

For max_depth = 24

Average performance gain =  -0.12289140368544338 %
Classification accuracy =
  dataset_name  Baseline   PR 4126
0      airline  0.810743  0.810743
1        higgs  0.742905  0.742905
2      epsilon  0.756150  0.756150
Regression MSE =
         dataset_name    Baseline     PR 4126
0  airline_regression  418.358765  418.358765
1                year   87.964699   87.964699

For max_depth = 18

Average performance gain =  -0.2654329881113543
Classification accuracy =
  dataset_name  Baseline   PR 4126
0      airline  0.753953  0.753953
1        higgs  0.734163  0.734163
2      epsilon  0.760830  0.760830
Regression MSE =
         dataset_name    Baseline     PR 4126
0  airline_regression  433.178467  433.178467
1                year   87.441147   87.441147

The GBM-bench parameter file used for benchmarking can be located here.

venkywonka

awesome fix vinay! 🙏🏻

teju85

Changes LGTM. Thank you Vinay and Venkat for the fix.

dantegd · 2021-07-29T11:40:01Z

@gpucibot merge

codecov-commenter · 2021-07-29T12:34:11Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@9406d53). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #4126   +/-   ##
===============================================
  Coverage                ?   85.91%           
===============================================
  Files                   ?      232           
  Lines                   ?    18377           
  Branches                ?        0           
===============================================
  Hits                    ?    15788           
  Misses                  ?     2589           
  Partials                ?        0

Flag	Coverage Δ
dask	`48.03% <0.00%> (?)`
non-dask	`78.45% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9406d53...3a814fa. Read the comment docs.

…i#4126) Fixes issue rapidsai#4046. In the `nodeSplitKernel` each thread calls `leafBasedOnParams()` which reads global variable `n_leaves`. Different threads from same threadblock read `n_leaves` at different times. Between two threads reading `n_leaves`, value of it could be changed by some other threadblock. Thus one or few threads might concluded that `max_leaves` is reached, and rest of the threads might conclude otherwise. This caused crash in partitioning the samples. In the solution provided here, instead of every thread reading `n_leaves`, only one thread from a threadblock reads the value and broadcasts it to every other thread via shared memory. This ensures complete agreement on `max_leaves` criterion among threads from threadblock. Performance results to be posted shortly. Authors: - Vinay Deshpande (https://github.com/vinaydes) Approvers: - Venkat (https://github.com/venkywonka) - Rory Mitchell (https://github.com/RAMitchell) - Thejaswi. N. S (https://github.com/teju85) URL: rapidsai#4126

vinaydes added 2 commits July 29, 2021 10:12

Restricting leaf evaluation to single thread to make sure same value …

dbcbd1f

…of isLeaf across entire threadblock

Clang format fixes

3a814fa

vinaydes requested a review from a team as a code owner July 29, 2021 08:18

github-actions bot added the CUDA/C++ label Jul 29, 2021

vinaydes changed the title ~~[WIP] Fix for crash in RF when max_leaves parameter is specified~~ [REVIEW] Fix for crash in RF when max_leaves parameter is specified Jul 29, 2021

vinaydes mentioned this pull request Jul 29, 2021

[BUG] SIGABRT in CUML RF, out of bounds memory usage #4046

Closed

venkywonka approved these changes Jul 29, 2021

View reviewed changes

RAMitchell approved these changes Jul 29, 2021

View reviewed changes

teju85 approved these changes Jul 29, 2021

View reviewed changes

teju85 added bug Something isn't working non-breaking Non-breaking change labels Jul 29, 2021

rapids-bot bot merged commit 5f5fc49 into rapidsai:branch-21.08 Jul 29, 2021

vinaydes deleted the fix-rf-n-leaves-issue branch February 8, 2023 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Fix for crash in RF when `max_leaves` parameter is specified #4126

[REVIEW] Fix for crash in RF when `max_leaves` parameter is specified #4126

vinaydes commented Jul 29, 2021

vinaydes commented Jul 29, 2021

venkywonka left a comment

teju85 left a comment

dantegd commented Jul 29, 2021

codecov-commenter commented Jul 29, 2021

[REVIEW] Fix for crash in RF when max_leaves parameter is specified #4126

[REVIEW] Fix for crash in RF when max_leaves parameter is specified #4126

Conversation

vinaydes commented Jul 29, 2021

vinaydes commented Jul 29, 2021

venkywonka left a comment

Choose a reason for hiding this comment

teju85 left a comment

Choose a reason for hiding this comment

dantegd commented Jul 29, 2021

codecov-commenter commented Jul 29, 2021

Codecov Report

[REVIEW] Fix for crash in RF when `max_leaves` parameter is specified #4126

[REVIEW] Fix for crash in RF when `max_leaves` parameter is specified #4126