-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve initial setup time and memory consumption in fast histogram #2543
Conversation
Here are some results on the URL dataset (see here for description), using a c3.8xlarge instance (60 GB RAM, 32 cores):
|
Codecov Report
@@ Coverage Diff @@
## master #2543 +/- ##
==========================================
- Coverage 35.11% 34.86% -0.26%
==========================================
Files 79 80 +1
Lines 6971 7062 +91
Branches 680 695 +15
==========================================
+ Hits 2448 2462 +14
- Misses 4422 4512 +90
+ Partials 101 88 -13
Continue to review full report at Codecov.
|
@hcho3 let me know if it is ready to merge |
@hcho3 Does this PR changes the training speed other than the setup time? (so I know if I have to redo my benchmarks or not based on this PR) |
Updating to dmlc/xgboost#2543 (24/07/2017)
@hcho3 It crashes on my custom reputation dataset (2,250,000 observations x 23,636 features) when it reaches the grouping feature part. If I disable feature grouping, it works (after 18 minutes of columnar access generation). Working without feature grouping: > model <- xgb.train(params = list(nthread = 40,
+ #max_depth = 3,
+ num_leaves = 127,
+ tree_method = "hist",
+ grow_policy = "depthwise",
+ eta = 0.25,
+ max_bin = 255,
+ eval_metric = "auc",
+ debug_verbose = 2,
+ enable_feature_grouping = 0),
+ data = train,
+ nrounds = 10,
+ watchlist = list(test = test),
+ verbose = 2,
+ early_stopping_rounds = 50)
[13:33:22] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[13:33:22] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[13:33:29] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 23636)...
[13:35:08] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[13:35:20] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[13:53:24] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 1201.81 sec
[13:54:20] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 98 extra nodes, 0 pruned nodes, max_depth=6
[13:54:20] amalgamation/../src/tree/updater_fast_hist.cc:273:
InitData: 0.0497 ( 0.09%)
InitNewNode: 0.0312 ( 0.06%)
BuildHist: 51.9788 (92.85%)
EvaluateSplit: 3.8466 ( 6.87%)
ApplySplit: 0.0780 ( 0.14%)
========================================
Total: 55.9844
[13:54:20] amalgamation/../src/gbm/gbtree.cc:274: CommitModel(): 0.34489 sec
[13:54:20] amalgamation/../src/learner.cc:373: EvalOneIter(): 0 sec
[1] test-auc:0.500636
Will train until test_auc hasn't improved in 50 rounds.
[13:54:58] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 90 extra nodes, 0 pruned nodes, max_depth=6
[13:54:58] amalgamation/../src/tree/updater_fast_hist.cc:273:
InitData: 0.0094 ( 0.02%)
InitNewNode: 0.0467 ( 0.12%)
BuildHist: 34.3774 (90.31%)
EvaluateSplit: 3.5781 ( 9.40%)
ApplySplit: 0.0546 ( 0.14%)
========================================
Total: 38.0662
[13:54:58] amalgamation/../src/gbm/gbtree.cc:274: CommitModel(): 0.0157399 sec
[13:54:58] amalgamation/../src/learner.cc:373: EvalOneIter(): 0.00942326 sec I also took Bosch to test it, it takes a long time to start while it used to start nearly instantly (it now takes 8 min 30). |
@Laurae2 Sorry for the late reply. I was out of town for a few days. The improvement was sorely concerned with improving setup time. Thanks again for your contribution. I do have a few questions for you.
I'm trying to guess whether feature grouping crashes due to lack of memory.
@tqchen I do need to take another look at this pull request. While I'm at it, I'm also inclined to completely re-write the feature grouping logic to make it more parallel. (The logic is inherently sequential now, as it has to inspect one feature at a time.) Let me get back to you on this. |
@hcho3 Issue is probably here: https://github.com/hcho3/xgboost/blob/14a33f6cdf2b388f64293b0ac0718d5ab945b37f/src/common/column_matrix.h#L157-L189 or in a part where there is OpenMP used on all cores (threads are probably sharing a common variable, leading to negative scalability)
It leads to the creation of a very large RDS file which contains the dataset (see #2326 (comment) for the training script).
However, when it took 8min30 on Bosch I had the same issue as 2., e.g low CPU usage. The very slow part was when xgboost was trying to use all available cores. Log for my laptop, i7-4600U: > model <- xgb.train(params = list(nthread = 4,
+ max_depth = 6,
+ num_leaves = 63,
+ tree_method = "hist",
+ grow_policy = "depthwise",
+ eta = 0.05,
+ max_bin = 255,
+ eval_metric = "auc",
+ debug_verbose = 2,
+ enable_feature_grouping = 1),
+ data = train,
+ nrounds = 1,
+ watchlist = list(test = test),
+ verbose = 2,
+ early_stopping_rounds = 50)
[21:05:28] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[21:05:29] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[21:05:33] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 969)...
[21:05:33] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[21:05:37] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[21:05:49] amalgamation/../src/tree/updater_fast_hist.cc:82: Grouping features together...
[21:08:17] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 167.956 sec
[21:08:24] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 82 extra nodes, 0 pruned nodes, max_depth=6
[21:08:24] amalgamation/../src/tree/updater_fast_hist.cc:273:
InitData: 0.0625 ( 0.83%)
InitNewNode: 0.0000 ( 0.00%)
BuildHist: 7.2841 (96.83%)
EvaluateSplit: 0.0499 ( 0.66%)
ApplySplit: 0.1260 ( 1.67%)
========================================
Total: 7.5224
[21:08:29] amalgamation/../src/gbm/gbtree.cc:274: CommitModel(): 4.39619 sec
[21:08:29] amalgamation/../src/learner.cc:373: EvalOneIter(): 0.0469153 sec
[1] test-auc:0.606967
Will train until test_auc hasn't improved in 50 rounds. |
@hcho3 parallelizing this part would be useful. I've also noticed that during the initialization lots of time was spent with just a single thread working. |
I had accidentally closed the pull request. Sorry for any confusion caused. |
@khotilov I was originally planning to postpone the re-write after this pull request, but I changed my mind. Let me go ahead and fix the feature grouping logic. |
@Laurae2 Thanks! I will promptly investigate the issue and get back to you. |
@hcho3 You can reproduce the issue by doing the following on any machine, even if they do not have 40 threads, on Bosch dataset (or any other large dataset): model <- xgb.train(params = list(nthread = 40,
max_depth = 6,
num_leaves = 63,
tree_method = "hist",
grow_policy = "depthwise",
eta = 0.05,
max_bin = 255,
eval_metric = "auc",
debug_verbose = 2,
enable_feature_grouping = 1),
data = train,
nrounds = 2,
watchlist = list(test = test),
verbose = 2,
early_stopping_rounds = 50) Log of the i7-7700K, which is roughly 2.5x faster (3 min vs 8 min here) than my 72 core server per thread: [12:52:52] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[12:52:52] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[12:52:55] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 969)...
[12:52:55] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[12:52:56] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[12:56:04] amalgamation/../src/tree/updater_fast_hist.cc:82: Grouping features together...
[12:56:18] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 205.547 sec
[12:56:19] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 90 extra nodes, 0 pruned nodes, max_depth=6
[12:56:19] amalgamation/../src/tree/updater_fast_hist.cc:273:
InitData: 0.0029 ( 0.24%)
InitNewNode: 0.0624 ( 5.17%)
BuildHist: 1.1251 (93.29%)
EvaluateSplit: 0.0157 ( 1.30%)
ApplySplit: 0.0000 ( 0.00%)
========================================
Total: 1.2060
[12:56:19] amalgamation/../src/gbm/gbtree.cc:274: CommitModel(): 0.0781817 sec
[12:56:19] amalgamation/../src/learner.cc:373: EvalOneIter(): 0.0336745 sec
[1] test-auc:0.606966
Will train until test_auc hasn't improved in 50 rounds.
[12:56:20] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 86 extra nodes, 0 pruned nodes, max_depth=6
[12:56:20] amalgamation/../src/tree/updater_fast_hist.cc:273:
InitData: 0.0000 ( 0.00%)
InitNewNode: 0.0000 ( 0.00%)
BuildHist: 0.7504 (85.57%)
EvaluateSplit: 0.0953 (10.86%)
ApplySplit: 0.0313 ( 3.57%)
========================================
Total: 0.8770
[12:56:20] amalgamation/../src/gbm/gbtree.cc:274: CommitModel(): 0.0135319 sec
[12:56:20] amalgamation/../src/learner.cc:373: EvalOneIter(): 0.0468612 sec
[2] test-auc:0.607018 |
@Laurae2 Sorry for the delay. I've been spending most of my time working on the first release of dmlc/tree-lite. Let me look at it this week for sure. Thanks! |
Updating to dmlc/xgboost#2543 (02/08/2017)
@hcho3 I changed this (https://github.com/Laurae2/ez_xgb/blob/devel/src/common/column_matrix.h#L157) and it's way faster now. Only feature grouping is now causing me issues (crashes). Old: [01:47:42] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[01:47:42] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[01:47:44] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 969)...
[01:47:45] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[01:47:45] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[01:55:08] amalgamation/../src/tree/updater_fast_hist.cc:82: Grouping features together...
[01:56:07] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 504.929 sec
[01:56:08] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 82 extra nodes, 0 pruned nodes, max_depth=6
[01:56:08] amalgamation/../src/tree/updater_fast_hist.cc:273:
InitData: 0.0156 ( 1.70%)
InitNewNode: 0.0157 ( 1.70%)
BuildHist: 0.7186 (77.95%)
EvaluateSplit: 0.1563 (16.96%)
ApplySplit: 0.0157 ( 1.70%)
========================================
Total: 0.9219
[01:56:08] amalgamation/../src/gbm/gbtree.cc:274: CommitModel(): 0.0940418 sec
[01:56:08] amalgamation/../src/learner.cc:373: EvalOneIter(): 0.0105848 sec
[1] test-auc:0.606967 New: [11:10:58] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[11:10:58] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[11:11:00] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 969)...
[11:11:01] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[11:11:01] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[11:11:09] amalgamation/../src/tree/updater_fast_hist.cc:82: Grouping features together...
[11:12:07] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 69.2763 sec
[11:12:08] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 82 extra nodes, 0 pruned nodes, max_depth=6
[11:12:08] amalgamation/../src/tree/updater_fast_hist.cc:273:
InitData: 0.0157 ( 1.71%)
InitNewNode: 0.0000 ( 0.00%)
BuildHist: 0.6918 (75.03%)
EvaluateSplit: 0.1832 (19.87%)
ApplySplit: 0.0312 ( 3.39%)
========================================
Total: 0.9220
[11:12:09] amalgamation/../src/gbm/gbtree.cc:205: CommitModel(): 0.0937798 sec
[11:12:09] amalgamation/../src/learner.cc:373: EvalOneIter(): 0.0104179 sec
[1] test-auc:0.606967 It improved the column access structure generation, and is approx the same speed as LightGBM (clocked also at approximately 4 minutes) when used on my custom reputation dataset: |
@Laurae2 With some difficulty, I've managed to reproduce the crash. The crash is really due to run-away memory usage of The lines 157-189 of column_matrix.h are not related to the crash. I tried to re-produce negative scaling at this section but could not. With your custom |
My value of I don't have access to my 1TB server, but I have some extra results below. i7-7700, 1 thread, baremetal server: [12:39:49] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[12:39:50] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[12:40:29] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 23636)...
[12:41:26] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[12:42:48] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[12:43:51] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 241.215 sec i7-7700, 8 threads, baremetal server: [12:31:22] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[12:31:22] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[12:31:29] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 23636)...
[12:32:33] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[12:32:57] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[12:33:28] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 126.432 sec 20 cores Ivy Bridge, 1 thread, virtualized: [13:05:39] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[13:05:42] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[13:07:19] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 23636)...
[13:08:55] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[13:10:59] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[13:13:09] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 447.5 sec 20 cores Ivy Bridge, 10 threads, virtualized: [13:27:20] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[13:27:20] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[13:27:28] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 23636)...
[13:29:06] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[13:29:23] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[13:33:30] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 370.016 sec 20 cores Ivy Bridge, 20 threads, virtualized: [13:14:48] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[13:14:49] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[13:14:57] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 23636)...
[13:16:34] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[13:16:48] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[13:25:36] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 647.326 sec 20 cores Ivy Bridge, 40 threads, virtualized: [12:43:17] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[12:43:18] amalgamation/../src/common/hist_util.cc:37: Generating sketches...
[12:43:25] amalgamation/../src/common/hist_util.cc:75: Computing quantiles for features [0, 23636)...
[12:45:07] amalgamation/../src/tree/updater_fast_hist.cc:70: Quantizing data matrix entries into quantile indices...
[12:45:15] amalgamation/../src/tree/updater_fast_hist.cc:75: Generating columnar access structure...
[13:03:55] amalgamation/../src/tree/updater_fast_hist.cc:92: Done initializing training: 1237.75 sec |
With a larger instance, I've re-produced negative scaling:
(I've wrapped the lines 157-189 in a But wow, negative scaling is really bad on your Ivy Bridge virtualized machine. I will try to modify this loop to eliminate all data sharing, but if it doesn't work out, I'd be happy with making this sequential. |
close due to stale PR, @hcho3 |
@tqchen Sorry I haven't gotten around working on it for a while. I will send a new PR when it's ready. |
This is a response to issue #2326.
use_columnar_access=0
) to disable columnar access structure entirely, to further reduce initial setup time and memory usage. Column access may slow down with this option.