Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support multiple batches in gpu_hist #5014

Merged
merged 30 commits into from
Nov 16, 2019
Merged

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Nov 5, 2019

In the gpu_hist tree method, support reading in multiple ELLPACK pages from disk. Part of #4357.

The main changes are in updater_gpu_hist.cu:

  • In InitRoot() when building the initial histogram, we now loop through all the batches and accumulate the histograms.
  • Then for each node, we loop through the batches again, first update the position of the rows in the batch, and build the left/right histograms within the batch. After that, run AllReduce on the accumulated histograms.
  • In FinalisePosition we loop through the batches again to update the positions for all the rows in the dataset.

I found some problem with writing out the ELLPACK pages, can't really concatenate the compressed buffers, so now accumulate the CSR pages in memory first before compressing and writing to disk.

A few things are still O(n) of dataset size, such as the prediction output and the gradient pair list. We could potentially stream these from the main memory as well. Probably in a follow-up PR.

So far I've focused on correctness, haven't really looked at performance. Will try to run some benchmarks.

@RAMitchell @trivialfis @sriramch

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks more simple than I expected. I think way we use RowPartitioner can be improved, otherwise it looks pretty good.

int gidx =
matrix.gidx_iter[ridx * matrix.info.row_stride + idx % matrix.info.row_stride];
if (gidx != matrix.info.n_bins) {
// If we are not using shared memory, accumulate the values directly into
// global memory
GradientSumT* atomic_add_ptr =
use_shared_memory_histograms ? smem_arr : d_node_hist;
dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx]);
dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx + base_rowid]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be tidied up a little. base_rowid can be a member of EllpackMatrix. EllpackMatrix could have a method that returns the row index of each of its elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

page = batch.Impl();
if (page->n_rows != row_partitioner->GetRows().size()) {
row_partitioner.reset(); // Release the device memory first before reallocating
row_partitioner.reset(new RowPartitioner(device_id, page->n_rows));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it is better for us to keep a row_partitioner for each batch instead of resetting one of these objects. Memory usage will increase somewhat. Alternatively what is stopping us from creating the RowPartitioner object on the stack whenever we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the row_partitioner back to using absolute row ids. PTAL

@RAMitchell
Copy link
Member

Also given significant changes to our core algorithm I think you should run https://github.com/NVIDIA/gbm-bench checking for performance regression. You could also benchmark your external memory here with some modification. It might take some time but its worth becoming familiar with this benchmark suite.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 6, 2019

@RAMitchell Thanks for the link. Can I use it to set up regular benchmark for performance / accuracy regression?

@RAMitchell
Copy link
Member

RAMitchell commented Nov 6, 2019

@hcho3 yes! If we have the budget to run it. I've been running it on an 8gpu machine but 2 or 4 is fine. It downloads and caches the datasets, so I would suggest using persistent storage for that part.

@RAMitchell
Copy link
Member

@rongou I have approved this on the basis of the code, but please do benchmark before we merge to test for performance regression.

@hcho
Copy link

hcho commented Nov 7, 2019 via email

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a high level test in Python that says external memory produces exact same result with normal operation?

@@ -166,6 +166,15 @@ struct BatchParam {
int max_bin;
/*! \brief Number of rows in a GPU batch, used for finding quantiles on GPU. */
int gpu_batch_nrows;
/*! \brief Page size for external memory mode. */
size_t gpu_page_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we we expose this to users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. See below.

@@ -227,15 +228,15 @@ std::unique_ptr<DMatrix> CreateSparsePageDMatrixWithRC(
size_t j = 0;
if (rem_cols > 0) {
for (; j < std::min(static_cast<size_t>(rem_cols), cols_per_row); ++j) {
row_data << " " << (col_idx+j) << ":" << (col_idx+j+1)*10;
row_data << label(*gen) << " " << (col_idx+j) << ":" << (col_idx+j+1)*10*i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We implemented a dummy generator in helper, as the generator in std is not guaranteed to be reproducible on different compiler and different platform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to be reproducible across compilers/platforms, just need to be deterministic on multiple runs. Anyway, this is @sriramch's code, comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is correct. we just need the feature id's and values to be consistent when invoked multiple times for the same row/column configuration, when the deterministic flags is passed.

@@ -21,6 +21,8 @@ struct GenericParameter : public XGBoostParameter<GenericParameter> {
int nthread;
// primary device, -1 means no gpu.
int gpu_id;
// gpu page size in external memory mode, 0 means using the default.
size_t gpu_page_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter is configurable by users, please don't define it twice. Make one of them a normal variable. If we don't want to configure it by user, don't use parameter at all. As pickling might loss some of these information, dask uses pickle to move booster around workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only defined as a configurable parameter once here, the other one is really just plumbing. For now this is mostly used for testing, but perhaps user may want to set it depending on the GPU memory they have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em.. we want to do parameter validation, like detecting unused parameters. This may add some extra difficulties. Do you think it's possible to set it as a DMatrix parameter instead of a global one? Maybe another PR? Sorry for nitpicking here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we need to be careful adding global parameters due to upcoming work on serialisation. Unless you see a strong motivation for users tuning this, let's leave it out for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure if this is useful for end users. Is there way to make a parameter hidden/internal? It's really useful for the tests since we don't have to build a dataset bigger than 32MB.

@rongou
Copy link
Contributor Author

rongou commented Nov 7, 2019

@trivialfis I think the GpuHist.ExternalMemory cpp test is close to what you want, right?

@RAMitchell running the benchmarks now. Will report back with the numbers.

@rongou
Copy link
Contributor Author

rongou commented Nov 9, 2019

@RAMitchell I run the benchmarks 3 times on my desktop with a Titan V. Here are the average training times:

Dataset master this PR
airline 158.8517289 159.8197227
bosch 18.8830594 19.95914053
covtype 54.04702564 54.0232536
epsilon 712.1971298 682.5995135
fraud 2.265036382 2.316629445
higgs 56.99770937 56.53523625
year 9.098267321 9.279944749

Looks like most are about the same, but epsilon is much faster for some reason. I'm running the same things on a VM on GCP with a V100. Will report back once it's done there.

@RAMitchell
Copy link
Member

LGTM

@rongou
Copy link
Contributor Author

rongou commented Nov 11, 2019

@RAMitchell here are the 3-run average training times on V100:

Dataset master this PR
airline 137.6487908 138.3818154
bosch 15.30501916 14.95584854
covtype 54.80562482 57.31219463
epsilon 307.9471577 286.3518352
fraud 2.122671061 2.145682581
higgs 50.20506154 50.47706387
year 9.318194856 9.318194856

Again there is a noticeable speedup for epsilon.

@codecov-io
Copy link

codecov-io commented Nov 11, 2019

Codecov Report

Merging #5014 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5014   +/-   ##
=======================================
  Coverage   71.52%   71.52%           
=======================================
  Files          11       11           
  Lines        2311     2311           
=======================================
  Hits         1653     1653           
  Misses        658      658

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97abcc7...a6bf0dd. Read the comment docs.

this->UpdatePosition(candidate.nid, (*p_tree)[candidate.nid]);
monitor.StopCuda("UpdatePosition");
if (ExpandEntry::ChildIsValid(param, tree.GetDepth(left_child_nidx), num_leaves)) {
for (auto& batch : p_fmat->GetBatches<EllpackPage>(batch_param)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't iterating over all the batches successively while building root node, each node in the tree and finalizing the position for every iteration significantly add to the training time when external memory mode is enabled? it would be interesting to see the difference between this version and the one in master for one of the representative dataset when external memory is enabled.

@rongou
Copy link
Contributor Author

rongou commented Nov 12, 2019

@RAMitchell @trivialfis here are the 3-run average metrics for Titan V (the V100 is similar). Note all the numbers I've reported so far are for the in-memory mode. Will report external memory perf once I have the numbers.

Dataset Metric master this PR diff
airline AUC 0.843446422 0.8437542634 0.0003078414077
Accuracy 0.7183621187 0.7187676434 0.0004055247306
Log_Loss 0.5295130551 0.5290879854 -0.0004250697381
Precision 0.6538399493 0.654240727 0.0004007776586
Recall 0.8639324626 0.8640286929 0.00009623029934
bosch AUC 0.6910459197 0.6910459197 0
Accuracy 0.9585385428 0.9585385428 0
Log_Loss 0.2393587625 0.2393587625 0
Precision 0.04381863651 0.04381863651 0
Recall 0.2828070175 0.2828070175 0
covtype Accuracy 0.9391495917 0.9391495917 0
F1 0.9390102704 0.9390102704 0
Precision 0.9392223225 0.9392223225 0
Recall 0.9391495917 0.9391495917 0
epsilon AUC 0.947596176 0.947596176 0
Accuracy 0.87099 0.87099 0
Log_Loss 0.3012408595 0.3012408595 0
Precision 0.8443296287 0.8443296287 0
Recall 0.9094184766 0.9094184766 0
fraud AUC 0.9618900951 0.9618900951 0
Accuracy 0.9995611109 0.9995611109 0
Log_Loss 0.003617382829 0.003617382829 0
Precision 0.950617284 0.950617284 0
Recall 0.7857142857 0.7857142857 0
higgs AUC 0.8396768819 0.8394438219 -0.0002330599985
Accuracy 0.7338934848 0.7335419697 --0.0003515151515
Log_Loss 0.5221688581 0.5224705406 0.0003016824741
Precision 0.6908124497 0.6905024365 -0.0003100131171
Recall 0.9014905058 0.9013507388 -0.0001397670777
year MeanAbsError 6.228921157 6.228921157 0
MeanSquaredError 79.77385217 79.77385217 0
MedianAbsError 4.267578125 4.267578125 0

@trivialfis
Copy link
Member

Restarted the test.

@@ -626,6 +631,14 @@ struct GPUHistMakerDevice {
return std::vector<DeviceSplitCandidate>(result_all.begin(), result_all.end());
}

// Build gradient histograms for a given node across all the batches in the DMatrix.
void BuildHistBatches(int nidx, DMatrix* p_fmat) {
for (auto& batch : p_fmat->GetBatches<EllpackPage>(batch_param)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this result in pages getting recycled aggressively while walking through the batches, thus resulting in ellpack pages repeatedly getting created and destructed leading to large number of data copies to device and removal from it?

@rongou
Copy link
Contributor Author

rongou commented Nov 13, 2019

@trivialfis hmm looks like in external memory mode the accuracy is not quite as good as the in-core version. I think I'll add a python test to verify the model accuracy as you suggested.

@sriramch
Copy link
Contributor

i think the sparse pages are going to be reused heavily during iteration (due to the use of the threaded iterator), thus resulting in page states having to be reinitialized with every page read.

if pages are going to be persisted on a spinning disk, then for every iteration, it is going to be a page read from disk + creation of page state in system and/or device memory; the latter may involve data movement (~32mb worth) for every page.

thus, for every node creation for every tree in every iteration, this may turn out to be prohibitively expensive for a large dataset with very many pages.

@rongou
Copy link
Contributor Author

rongou commented Nov 14, 2019

With a generated synthetic dataset (20 million rows * 100 columns) that doesn't fit in my Titan V, in external memory mode, gpu_hist is about twice as fast as hist. But for now the results are not quite right, so take that with a grain of salt.

For this PR the main goal is to get correct results in external memory mode. There are probably a bunch of low hanging fruits for optimization.

@trivialfis
Copy link
Member

trivialfis commented Nov 14, 2019

@rongou

But for now the results are not quite right,

Could you elaborate on it a bit more? What's not quite right? ;-)

hmm looks like in external memory mode the accuracy is not quite as good as the in-core version. I think I'll add a python test to verify the model accuracy as you suggested.

Ah, now I know. Sorry for the noise.

@trivialfis
Copy link
Member

@rongou Sorry for creating some conflicts. Merged master in your branch, please just force push if the merge stands in your way.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You fixed it. :-). I see the new tests are passing, Is it ready to go? The code looks good to me. Not sure if you want to keep the generator here as it might be useful outside XGBoost.

@rongou
Copy link
Contributor Author

rongou commented Nov 15, 2019

@trivialfis yeah I think this is good to go. Will work on followup PRs for performance tuning.

@trivialfis trivialfis merged commit 0afcc55 into dmlc:master Nov 16, 2019
@trivialfis
Copy link
Member

That's quite an achievement, ellpack was simply a function before.

@lock lock bot locked as resolved and limited conversation to collaborators Feb 14, 2020
@rongou rongou deleted the gpu-hist-batches branch November 18, 2022 19:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants