Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized BuildHist function #5156

Merged
merged 1 commit into from
Jan 30, 2020
Merged

Optimized BuildHist function #5156

merged 1 commit into from
Jan 30, 2020

Conversation

SmirnovEgorRu
Copy link
Contributor

@SmirnovEgorRu SmirnovEgorRu commented Dec 24, 2019

Optimizations for Histogram building. A part of the issue #5104
The PR contains changes from #5138, it will be rebased after #5138 merging.

@SmirnovEgorRu
Copy link
Contributor Author

Current performance using previous commit #5138 also:

higgs1m ApplySplit EvaluateSplit BuildHist SyncHistogram Prediction Total, sec
Master 33 29 90 26 3 185
Before reverting 3.7 3.5 6.2 0.0 1.6 17.7
This PR 27.7 1.7 9.5 2.1 1.6 47.3
airline-ohe ApplySplit EvaluateSplit BuildHist SyncHistogram Prediction Total, sec
Master 26 27 67 12 2 157
Before reverting 9.0 6.1 28.8 0.0 0.7 63.7
This PR 21.4 2.9 42.2 1.0 0.7 93.5

@SmirnovEgorRu SmirnovEgorRu changed the title [WIP] Optimize BuildHist function Optimized BuildHist function Dec 31, 2019
@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, @trivialfis, I finalized the PR from my side. Could you, please, look at this?
It contains changes from #5138, I will do rebase after merging previous one into master.

@SmirnovEgorRu SmirnovEgorRu force-pushed the opt_hist_2 branch 2 times, most recently from b8b7c67 to 48da1df Compare January 8, 2020 00:59
@hcho3 hcho3 self-requested a review January 8, 2020 04:02
@hcho3
Copy link
Collaborator

hcho3 commented Jan 8, 2020

One distributed test is stuck: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-5156/8/pipeline/112. I had to kill it by hand. I'm looking at the code now to see what went wrong; probably a worker is not calling AllReduce().

@SmirnovEgorRu SmirnovEgorRu force-pushed the opt_hist_2 branch 2 times, most recently from e21888b to 8b7acd6 Compare January 9, 2020 22:41
@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, I have fixed the issue.
I also tested performance of distributed mode, for small data sets - it is similar as before, for large - gain is observed. It's reasonable, because # of "AllReduce" calls are the same.

CC @trivialfis

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Also thanks for adding tests.

src/tree/updater_quantile_hist.h Show resolved Hide resolved
src/tree/updater_quantile_hist.h Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Show resolved Hide resolved
src/common/hist_util.h Outdated Show resolved Hide resolved
src/common/hist_util.h Outdated Show resolved Hide resolved
src/common/hist_util.h Outdated Show resolved Hide resolved
@hcho3
Copy link
Collaborator

hcho3 commented Jan 16, 2020

Reminder to myself: Write a follow-up PR so that we can control how many threads common::ParallelFor2d will launch.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, I committed your comments and also added nthreads parameter for common::ParallelFor2d. CI is green.
@trivialfis, could you, please, review the PR?

@trivialfis
Copy link
Member

Yup. Will review tonight. Sorry for the long wait.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge thanks for the effort! Overall looks good to me. I will run some benchmarks tomorrow for memory usage, distributed environment etc. Will merge if no regression is found.

@trivialfis
Copy link
Member

@hcho3

Reminder to myself: Write a follow-up PR so that we can control how many threads common::ParallelFor2d will launch.

I'm a little bit concern that, it's possible for a user to change nthread during training.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 17, 2020

@trivialfis I think nthread is a configurable parameter. Once it's set, I don't think it would change in the middle of training. See

xgboost/src/learner.cc

Lines 208 to 210 in e526871

if (generic_parameters_.nthread != 0) {
omp_set_num_threads(generic_parameters_.nthread);
}

@trivialfis
Copy link
Member

That's assuring.

@trivialfis
Copy link
Member

trivialfis commented Jan 17, 2020

@SmirnovEgorRu Could you please take a look into this dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/url/url_svmlight.tar.gz ? It's extremely sparse, my benchmark shows regression on both training time and memory usage:

Before:

System  : 3.7.3 (default, Apr  3 2019, 05:39:12)
[GCC 8.3.0]
Xgboost : 1.0.0-SNAPSHOT
LightGBM: None
CatBoost: None
#jobs   : 6
Running 'xgb-cpu' ...
{
  "url": {
    "xgb-cpu": {
      "accuracy": {
        "AUC": 0.9920843803290437,
        "Accuracy": 0.9776493762859277,
        "Log_Loss": 0.4229395458885813,
        "Precision": 0.9781071399946979,
        "Recall": 0.9887536809959334
      },
      "train_time": 172.10726167000007
    }
  }
}
Results written to file 'url.json'
Child terminated but following descendants are still running: 4483

Could not remove sub-cgroup /sys/fs/cgroup/memory/cgmemtime/4462: Device or resource busy
Child user:  400.110 s
Child sys :   70.966 s
Child wall:  229.677 s
Child high-water RSS                    :   34821560 KiB
Recursive and acc. high-water RSS+CACHE :   38068864 KiB

After:

System  : 3.7.3 (default, Apr  3 2019, 05:39:12)
[GCC 8.3.0]
Xgboost : 1.0.0-SNAPSHOT
LightGBM: None
CatBoost: None
#jobs   : 6
Running 'xgb-cpu' ...
{
  "url": {
    "xgb-cpu": {
      "accuracy": {
        "AUC": 0.9920843803290437,
        "Accuracy": 0.9776493762859277,
        "Log_Loss": 0.4229395458885813,
        "Precision": 0.9781071399946979,
        "Recall": 0.9887536809959334
      },
      "train_time": 203.096213423
    }
  }
}
Results written to file 'url.json'
Child terminated but following descendants are still running: 3351

Could not remove sub-cgroup /sys/fs/cgroup/memory/cgmemtime/3330: Device or resource busy
Child user:  408.533 s
Child sys :   82.425 s
Child wall:  263.838 s
Child high-water RSS                    :   46879656 KiB
Recursive and acc. high-water RSS+CACHE :   46983308 KiB

@trivialfis
Copy link
Member

@SmirnovEgorRu BTW, the memory usage is measured by https://github.com/gsauthof/cgmemtime .

@SmirnovEgorRu
Copy link
Contributor Author

@trivialfis @hcho3
Thank you for your tests, I added additional changes to reduce amount of partial histograms. As result for URL data set I obtained following numbers:

Branch Time of Update, sec Memory usage, KB
master 74.51 24638840
this PR 23.53 24109408

So, now it's better for both execution time and memory consumption.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, @trivialfis, CI is also green. Do you see any road blockers to merge the pull-request?

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 @trivialfis, I have already created new PR #5244 which should finalize efforts on reverting the optimizations. Could you, please, accept or provide new comments for current PR to enable review of the next one?

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Also thanks for splitting AddHistRows from BuildLocalHistograms.

Comment on lines +115 to +123
{
size_t tid = omp_get_thread_num();
size_t chunck_size = num_blocks_in_space / nthreads + !!(num_blocks_in_space % nthreads);

size_t begin = chunck_size * tid;
size_t end = std::min(begin + chunck_size, num_blocks_in_space);
for (auto i = begin; i < end; i++) {
func(space.GetFirstDimension(i), space.GetRange(i));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we manually splitting the loop range here? Is it because Visual Studio doesn't support size_t for the loop variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because I should know which tasks are executed on which thread to allocate minimum possible amount of histograms (now it helps to achieve even less memory consumption on URL data set).
As I know - it's not exactly defined in OMP standard for "#pragma omp parallel for schedule(static)" and can't be different for various OMP implementations. So, I implemented this explicitly with "#pragma omp parallel" to have the same behavior on each platform.

hist_allocated_additionally++;
}
// map pair {tid, nid} to index of allocated histogram from hist_memory_
tid_nid_to_hist_[{tid, nid}] = hist_total++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to hear your reasoning: why did you choose std::map for tid_nid_to_hist_ but std::vector for threads_to_nids_map_? Is it due to memory efficiency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that both of them can be implemented by std::vector and by std::map.
No significant difference between them.

One reason what I see to use std::map for tid_nid_to_hist_ instead of std::vector:
In this line:

const size_t idx = tid_nid_to_hist_.at({tid, nid});

if I have std::vector here - I need to add check, smth like this

CHECK_NE(idx, std::numeric_limits<size_t>::max());

And fill all elements in tid_nid_to_hist_ by std::numeric_limits<size_t>::max() initially.
In case of std::map - it would throw an exception after calling .at() method without any additional lines of code.

@trivialfis
Copy link
Member

Will look into this once we branch out 1.0. Thanks for the patience.

@SmirnovEgorRu
Copy link
Contributor Author

@trivialfis, let me understand. Do we plan to include the change to 1.0 version as it was discussed originally #5008 (comment) with @hcho3 ?

@trivialfis
Copy link
Member

trivialfis commented Jan 29, 2020

I don't plan to add major changes at the last minute. It's fine as we used to have a release every 2 or 3 months, and I would like to resume the pace once this 1.0 thing is over. So your changes should be available in short time even not in the next release. Besides there's nightly build available for download.

Will merge once we have a release branch.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 29, 2020

@trivialfis I agree that, in general, we should not merge a major change right before a release. However, since you and I have approved the original version of this PR, can we merge this? The new version is only a little different from the original version, and the difference is confined in a small part of the codebase.

#5244 will have to wait, on the other hand.

@trivialfis
Copy link
Member

trivialfis commented Jan 29, 2020

Em .. Currently I'm at holiday, there's only a laptop available so I can't run any meaningful test. If you are confident then I will let you decide.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 @trivialfis, if you need some specific testing like running specified benchmarks or workloads to have more confidence - I'm happy to help here.

I am grateful for the chance to have this in the nearest XGB release

@trivialfis
Copy link
Member

trivialfis commented Jan 29, 2020

@SmirnovEgorRu Basically memory usage, computation time, accuracy (auc, rmse metrics) for these representative datasets (like higgs for dense wide columns, url for sparse), with restricted number of threads, max_depth, num_boosted_rounds etc, and maybe set cpu affinity env for OMP manually (not necessary but sometimes fun to see the difference). I usually do this myself as I can have a consistent environment for each run. For example the number posted in:

#5156 (comment)

and

#5244 (comment)

Seems to be running on different environments or with different parameters.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 29, 2020

It would be great if you can run Higgs and Airline dataset.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 29, 2020

@trivialfis I am confident about this PR. I’m inclined to merge, as long as @SmirnovEgorRu runs some more benchmarks as requested.

@trivialfis
Copy link
Member

trivialfis commented Jan 29, 2020

@hcho3 Got it. My concerns are mostly around the consistency of posted number across different PRs. As noted above, it would be nice to have benchmarks performed on same platform with fixed parameters. The variance can be difficult to control.

@SmirnovEgorRu
Copy link
Contributor Author

SmirnovEgorRu commented Jan 30, 2020

@hcho3 @trivialfis , I prepared measurements on Higgs and Airline.

Higgs:
1000 iterations, depth = 8.

nthreads 1 8 24 48 96
This PR, sec 280.9 87.4 48.4 44.8 45.3
master, sec 281.6 145.9 135.6 167.6 376.5

Log-loss in all cases this PR/master:
LogLoss for train data set = 0.404697
LogLoss for test data set = 0.525144

niter 50 200 500 1000
This PR, sec 3.5 10.7 22.7 44.8
master, sec 13.7 41.4 91.8 167.6

(For this table, nthreads is fixed to 48)

Airline + one-hot-encoding:
1000 iterations, depth = 8.

nthreads 1 8 24 48 96
This PR, sec 953.7 211.4 105 94.65 87.09
master, sec 949.3 265.5 159.6 142.8 272.5

Log-loss in all cases this PR/master:
LogLoss for train data set = 0.383229
LogLoss for test data set = 0.461403

niter 50 200 500 1000
This PR, sec 28.82 40.03 60.41 94.65
master, sec 35.65 57.2 89.37 142.8

(For this table, nthreads is fixed to 48)

HW: AWS c5.metal, CLX 8275 @3.0GHz, 24 cores per socket, 2 sockets, HT: on, 96 threads totally
Scripts are used from: https://github.com/dmlc/xgboost-bench

P.S. @trivialfis , yes, you're right I used different HW parameters for URL measurements - just because due to HW unavailability in the last time. I will try to measure this on 8275 too.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 30, 2020

@SmirnovEgorRu For the niter table, what is the number of threads you used? And are all the numbers end-to-end time?

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, for niter table I used 48 threads (to utilize only HW cores and dont use HT).
Yes, these numbers are measurements of whole xgb.train(...) call.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 30, 2020

And niter=1000 for the nthread table?

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 , yes, I just used default parameters in the benchmarks.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 30, 2020

Thanks for the clarification.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 @trivialfis,

For whole URL data set, for the same c5.metal AWS instance I obtained:

nthreads 8 24 48 96
This PR, sec 60.7 41.4 43.8 51.2
master, sec 179.6 123.1 159.5 179.7
nthreads 8 24 48 96
This PR, GB 18.835 20.156 22.232 26.282
master, GB 19.304 20.673 22.753 26.815

I use following line to fit URL:

output = xgb.train(params={'max_depth': 6, 'verbosity':3, 'tree_method':'hist'},
dtrain=dtrain, num_boost_round=10)

Accuracy parameters are the same also.

I hope this data is what you requested. Is it right?

@hcho3
Copy link
Collaborator

hcho3 commented Jan 30, 2020

@SmirnovEgorRu Yes, thanks for running the benchmarks.

@hcho3 hcho3 added the Blocking label Jan 30, 2020
@hcho3 hcho3 merged commit c671632 into dmlc:master Jan 30, 2020
@lock lock bot locked as resolved and limited conversation to collaborators May 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants