-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimization of RF split kernels by removing empty cycles #3818
Performance optimization of RF split kernels by removing empty cycles #3818
Conversation
…for classification
…riable-cta-per-node
|
||
// variables | ||
auto end = range_start + range_len; | ||
auto len = nbins * 2; | ||
// auto len = nbins * 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply remove this line instead of commenting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I need to redo the regression part anyway after merging with #3845. I'll remove it then.
auto cdf_spred_len = 2 * nbins; | ||
IdxT stride = blockDim.x * gridDim.x; | ||
IdxT tid = threadIdx.x + blockIdx.x * blockDim.x; | ||
// auto cdf_spred_len = 2 * nbins; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply remove this line instead of commenting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -655,7 +708,7 @@ __global__ void computeSplitRegressionKernel( | |||
__syncthreads(); | |||
|
|||
/* Make a second pass over the data to compute gain */ | |||
|
|||
auto coloffset = col * input.M; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is coloffset
used anywhere in the kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## branch-21.06 #3818 +/- ##
===============================================
Coverage ? 85.43%
===============================================
Files ? 226
Lines ? 17281
Branches ? 0
===============================================
Hits ? 14764
Misses ? 2517
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
gbm-bench results for this PR Accuracy remains unchanged for both classification and regression Fit time improves for both classification and regression Removing sklearn for zooming in on impact of this PR Improvement in percentage term Covtype and Fraud are relatively tiny datasets. Their fit time performance change is not dominated by |
@gpucibot merge |
…rapidsai#3818) The compute split kernels for classification and regression end up doing lot of work that is not required. This PR removes lot of these empty work cycles by doing following changes: 1. For computing split for a node, launch number of thread blocks proportional to number of samples in that node. Before this PR the number of thread blocks was fixed for all the nodes 2. Check if a node is leaf before launching the kernel and if it is leaf, do not launch any thread blocks for it 3. Don't call update on split, if not valid split is found for a feature 4. Skip round trip to global memory before evaluating best split, if only one thread block is operating on a node **Performance improvement observed** Classification problem on a synthetic dataset `computeSplitClassificationKernel` timings ``` branch-0.20: 22.91 seconds This branch: 5.27 seconds Gain: 4.35x ``` Regression problem on synthetic dataset `computeSplitRegessionKernel` timings ``` branch-0.20: 36.46 seconds This branch: 34.03 seconds Gain: 1.07x ``` Empty cycles is not the major performance issue in regression code, therefore we do not see large improvement currently. Authors: - Vinay Deshpande (https://github.com/vinaydes) - Rory Mitchell (https://github.com/RAMitchell) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Thejaswi. N. S (https://github.com/teju85) - Philip Hyunsu Cho (https://github.com/hcho3) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3818
The compute split kernels for classification and regression end up doing lot of work that is not required. This PR removes lot of these empty work cycles by doing following changes:
Performance improvement observed
Classification problem on a synthetic dataset
computeSplitClassificationKernel
timingsRegression problem on synthetic dataset
computeSplitRegessionKernel
timingsEmpty cycles is not the major performance issue in regression code, therefore we do not see large improvement currently.