support multiple batches in gpu_hist #5014

rongou · 2019-11-05T22:22:58Z

In the gpu_hist tree method, support reading in multiple ELLPACK pages from disk. Part of #4357.

The main changes are in updater_gpu_hist.cu:

In InitRoot() when building the initial histogram, we now loop through all the batches and accumulate the histograms.
Then for each node, we loop through the batches again, first update the position of the rows in the batch, and build the left/right histograms within the batch. After that, run AllReduce on the accumulated histograms.
In FinalisePosition we loop through the batches again to update the positions for all the rows in the dataset.

I found some problem with writing out the ELLPACK pages, can't really concatenate the compressed buffers, so now accumulate the CSR pages in memory first before compressing and writing to disk.

A few things are still O(n) of dataset size, such as the prediction output and the gradient pair list. We could potentially stream these from the main memory as well. Probably in a follow-up PR.

So far I've focused on correctness, haven't really looked at performance. Will try to run some benchmarks.

@RAMitchell @trivialfis @sriramch

This reverts commit 9f1fc21.

RAMitchell

Looks more simple than I expected. I think way we use RowPartitioner can be improved, otherwise it looks pretty good.

RAMitchell · 2019-11-06T00:09:58Z

src/tree/updater_gpu_hist.cu

    int gidx =
        matrix.gidx_iter[ridx * matrix.info.row_stride + idx % matrix.info.row_stride];
    if (gidx != matrix.info.n_bins) {
      // If we are not using shared memory, accumulate the values directly into
      // global memory
      GradientSumT* atomic_add_ptr =
          use_shared_memory_histograms ? smem_arr : d_node_hist;
-      dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx]);
+      dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx + base_rowid]);


I think this can be tidied up a little. base_rowid can be a member of EllpackMatrix. EllpackMatrix could have a method that returns the row index of each of its elements.

RAMitchell · 2019-11-06T00:34:49Z

src/tree/updater_gpu_hist.cu

+      page = batch.Impl();
+      if (page->n_rows != row_partitioner->GetRows().size()) {
+        row_partitioner.reset();  // Release the device memory first before reallocating
+        row_partitioner.reset(new RowPartitioner(device_id, page->n_rows));


I am wondering if it is better for us to keep a row_partitioner for each batch instead of resetting one of these objects. Memory usage will increase somewhat. Alternatively what is stopping us from creating the RowPartitioner object on the stack whenever we need it?

I changed the row_partitioner back to using absolute row ids. PTAL

RAMitchell · 2019-11-06T00:40:35Z

Also given significant changes to our core algorithm I think you should run https://github.com/NVIDIA/gbm-bench checking for performance regression. You could also benchmark your external memory here with some modification. It might take some time but its worth becoming familiar with this benchmark suite.

hcho3 · 2019-11-06T06:09:59Z

@RAMitchell Thanks for the link. Can I use it to set up regular benchmark for performance / accuracy regression?

RAMitchell · 2019-11-06T06:38:23Z

@hcho3 yes! If we have the budget to run it. I've been running it on an 8gpu machine but 2 or 4 is fine. It downloads and caches the datasets, so I would suggest using persistent storage for that part.

RAMitchell · 2019-11-07T00:55:49Z

@rongou I have approved this on the basis of the code, but please do benchmark before we merge to test for performance regression.

hcho · 2019-11-07T01:03:30Z

Just inform you guys that i block yall but just remind. While wrong person is getting crazy amount of email for no reason, a person who supposed to receive that message is out of the loop.

…

On Wed, Nov 6, 2019 at 5:51 PM Rong Ou ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/tree/updater_gpu_hist.cu: > int gidx = matrix.gidx_iter[ridx * matrix.info.row_stride + idx % matrix.info.row_stride]; if (gidx != matrix.info.n_bins) { // If we are not using shared memory, accumulate the values directly into // global memory GradientSumT* atomic_add_ptr = use_shared_memory_histograms ? smem_arr : d_node_hist; - dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx]); + dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx + base_rowid]); Done. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

trivialfis

Can we have a high level test in Python that says external memory produces exact same result with normal operation?

trivialfis · 2019-11-07T04:09:48Z

include/xgboost/data.h

@@ -166,6 +166,15 @@ struct BatchParam {
  int max_bin;
  /*! \brief Number of rows in a GPU batch, used for finding quantiles on GPU. */
  int gpu_batch_nrows;
+  /*! \brief Page size for external memory mode. */
+  size_t gpu_page_size;


Do we we expose this to users?

Yes. See below.

trivialfis · 2019-11-07T04:14:24Z

tests/cpp/helpers.cc

@@ -227,15 +228,15 @@ std::unique_ptr<DMatrix> CreateSparsePageDMatrixWithRC(
    size_t j = 0;
    if (rem_cols > 0) {
       for (; j < std::min(static_cast<size_t>(rem_cols), cols_per_row); ++j) {
-         row_data << " " << (col_idx+j) << ":" << (col_idx+j+1)*10;
+         row_data << label(*gen) << " " << (col_idx+j) << ":" << (col_idx+j+1)*10*i;


We implemented a dummy generator in helper, as the generator in std is not guaranteed to be reproducible on different compiler and different platform.

I don't think we need to be reproducible across compilers/platforms, just need to be deterministic on multiple runs. Anyway, this is @sriramch's code, comments?

that is correct. we just need the feature id's and values to be consistent when invoked multiple times for the same row/column configuration, when the deterministic flags is passed.

trivialfis · 2019-11-07T04:21:45Z

include/xgboost/generic_parameters.h

@@ -21,6 +21,8 @@ struct GenericParameter : public XGBoostParameter<GenericParameter> {
  int nthread;
  // primary device, -1 means no gpu.
  int gpu_id;
+  // gpu page size in external memory mode, 0 means using the default.
+  size_t gpu_page_size;


Parameter is configurable by users, please don't define it twice. Make one of them a normal variable. If we don't want to configure it by user, don't use parameter at all. As pickling might loss some of these information, dask uses pickle to move booster around workers.

It's only defined as a configurable parameter once here, the other one is really just plumbing. For now this is mostly used for testing, but perhaps user may want to set it depending on the GPU memory they have.

Em.. we want to do parameter validation, like detecting unused parameters. This may add some extra difficulties. Do you think it's possible to set it as a DMatrix parameter instead of a global one? Maybe another PR? Sorry for nitpicking here.

Agree, we need to be careful adding global parameters due to upcoming work on serialisation. Unless you see a strong motivation for users tuning this, let's leave it out for now.

I'm not quite sure if this is useful for end users. Is there way to make a parameter hidden/internal? It's really useful for the tests since we don't have to build a dataset bigger than 32MB.

rongou · 2019-11-07T19:17:16Z

@trivialfis I think the GpuHist.ExternalMemory cpp test is close to what you want, right?

@RAMitchell running the benchmarks now. Will report back with the numbers.

rongou · 2019-11-09T00:49:06Z

@RAMitchell I run the benchmarks 3 times on my desktop with a Titan V. Here are the average training times:

Dataset	master	this PR
airline	158.8517289	159.8197227
bosch	18.8830594	19.95914053
covtype	54.04702564	54.0232536
epsilon	712.1971298	682.5995135
fraud	2.265036382	2.316629445
higgs	56.99770937	56.53523625
year	9.098267321	9.279944749

Looks like most are about the same, but epsilon is much faster for some reason. I'm running the same things on a VM on GCP with a V100. Will report back once it's done there.

RAMitchell · 2019-11-09T22:51:14Z

LGTM

rongou · 2019-11-11T18:01:34Z

@RAMitchell here are the 3-run average training times on V100:

Dataset	master	this PR
airline	137.6487908	138.3818154
bosch	15.30501916	14.95584854
covtype	54.80562482	57.31219463
epsilon	307.9471577	286.3518352
fraud	2.122671061	2.145682581
higgs	50.20506154	50.47706387
year	9.318194856	9.318194856

Again there is a noticeable speedup for epsilon.

codecov-io · 2019-11-11T20:13:46Z

Codecov Report

Merging #5014 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #5014   +/-   ##
=======================================
  Coverage   71.52%   71.52%           
=======================================
  Files          11       11           
  Lines        2311     2311           
=======================================
  Hits         1653     1653           
  Misses        658      658

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97abcc7...a6bf0dd. Read the comment docs.

sriramch · 2019-11-11T22:20:26Z

src/tree/updater_gpu_hist.cu

-        this->UpdatePosition(candidate.nid, (*p_tree)[candidate.nid]);
-        monitor.StopCuda("UpdatePosition");
+      if (ExpandEntry::ChildIsValid(param, tree.GetDepth(left_child_nidx), num_leaves)) {
+        for (auto& batch : p_fmat->GetBatches<EllpackPage>(batch_param)) {


wouldn't iterating over all the batches successively while building root node, each node in the tree and finalizing the position for every iteration significantly add to the training time when external memory mode is enabled? it would be interesting to see the difference between this version and the one in master for one of the representative dataset when external memory is enabled.

rongou · 2019-11-12T06:52:39Z

@RAMitchell @trivialfis here are the 3-run average metrics for Titan V (the V100 is similar). Note all the numbers I've reported so far are for the in-memory mode. Will report external memory perf once I have the numbers.

Dataset	Metric	master	this PR	diff
airline	AUC	0.843446422	0.8437542634	0.0003078414077
	Accuracy	0.7183621187	0.7187676434	0.0004055247306
	Log_Loss	0.5295130551	0.5290879854	-0.0004250697381
	Precision	0.6538399493	0.654240727	0.0004007776586
	Recall	0.8639324626	0.8640286929	0.00009623029934
bosch	AUC	0.6910459197	0.6910459197	0
	Accuracy	0.9585385428	0.9585385428	0
	Log_Loss	0.2393587625	0.2393587625	0
	Precision	0.04381863651	0.04381863651	0
	Recall	0.2828070175	0.2828070175	0
covtype	Accuracy	0.9391495917	0.9391495917	0
	F1	0.9390102704	0.9390102704	0
	Precision	0.9392223225	0.9392223225	0
	Recall	0.9391495917	0.9391495917	0
epsilon	AUC	0.947596176	0.947596176	0
	Accuracy	0.87099	0.87099	0
	Log_Loss	0.3012408595	0.3012408595	0
	Precision	0.8443296287	0.8443296287	0
	Recall	0.9094184766	0.9094184766	0
fraud	AUC	0.9618900951	0.9618900951	0
	Accuracy	0.9995611109	0.9995611109	0
	Log_Loss	0.003617382829	0.003617382829	0
	Precision	0.950617284	0.950617284	0
	Recall	0.7857142857	0.7857142857	0
higgs	AUC	0.8396768819	0.8394438219	-0.0002330599985
	Accuracy	0.7338934848	0.7335419697	--0.0003515151515
	Log_Loss	0.5221688581	0.5224705406	0.0003016824741
	Precision	0.6908124497	0.6905024365	-0.0003100131171
	Recall	0.9014905058	0.9013507388	-0.0001397670777
year	MeanAbsError	6.228921157	6.228921157	0
	MeanSquaredError	79.77385217	79.77385217	0
	MedianAbsError	4.267578125	4.267578125	0

trivialfis · 2019-11-12T17:47:10Z

Restarted the test.

sriramch · 2019-11-12T18:33:16Z

src/tree/updater_gpu_hist.cu

@@ -626,6 +631,14 @@ struct GPUHistMakerDevice {
    return std::vector<DeviceSplitCandidate>(result_all.begin(), result_all.end());
  }

+  // Build gradient histograms for a given node across all the batches in the DMatrix.
+  void BuildHistBatches(int nidx, DMatrix* p_fmat) {
+    for (auto& batch : p_fmat->GetBatches<EllpackPage>(batch_param)) {


wouldn't this result in pages getting recycled aggressively while walking through the batches, thus resulting in ellpack pages repeatedly getting created and destructed leading to large number of data copies to device and removal from it?

rongou · 2019-11-13T22:30:04Z

@trivialfis hmm looks like in external memory mode the accuracy is not quite as good as the in-core version. I think I'll add a python test to verify the model accuracy as you suggested.

sriramch · 2019-11-13T23:22:57Z

i think the sparse pages are going to be reused heavily during iteration (due to the use of the threaded iterator), thus resulting in page states having to be reinitialized with every page read.

if pages are going to be persisted on a spinning disk, then for every iteration, it is going to be a page read from disk + creation of page state in system and/or device memory; the latter may involve data movement (~32mb worth) for every page.

thus, for every node creation for every tree in every iteration, this may turn out to be prohibitively expensive for a large dataset with very many pages.

rongou · 2019-11-14T00:49:36Z

With a generated synthetic dataset (20 million rows * 100 columns) that doesn't fit in my Titan V, in external memory mode, gpu_hist is about twice as fast as hist. But for now the results are not quite right, so take that with a grain of salt.

For this PR the main goal is to get correct results in external memory mode. There are probably a bunch of low hanging fruits for optimization.

trivialfis · 2019-11-14T02:57:02Z

@rongou

But for now the results are not quite right,

Could you elaborate on it a bit more? What's not quite right? ;-)

hmm looks like in external memory mode the accuracy is not quite as good as the in-core version. I think I'll add a python test to verify the model accuracy as you suggested.

Ah, now I know. Sorry for the noise.

trivialfis · 2019-11-14T12:18:12Z

@rongou Sorry for creating some conflicts. Merged master in your branch, please just force push if the merge stands in your way.

trivialfis

You fixed it. :-). I see the new tests are passing, Is it ready to go? The code looks good to me. Not sure if you want to keep the generator here as it might be useful outside XGBoost.

rongou · 2019-11-15T19:47:40Z

@trivialfis yeah I think this is good to go. Will work on followup PRs for performance tuning.

trivialfis · 2019-11-16T06:52:08Z

That's quite an achievement, ellpack was simply a function before.

rongou added 18 commits November 4, 2019 11:00

get rid of BinCount() method

4e0824e

pass dmatrix to GPUHistMakerDevice

f5160cc

reset row partitioner to n_rows on a page

397301e

Revert "pass dmatrix to GPUHistMakerDevice"

ca7f132

This reverts commit 9f1fc21.

get rid of the todo

57121bf

remove redundant code

aae1e8f

add page size to BatchParam

48fabd6

test multiple ellpack pages

a299db1

add failing test for gpu_hist in external memory mode

4365625

handle multiple batches in InitRoot

f9f0b7e

support multiple batches in gpu_hist

20fef95

debugging failing test

0b48ab6

add tests for ellpack page content

bb6afd8

add tests for ellpack content

3ead65d

test looping through ellpack pages multiple times

f0f8b54

tests passing

68d08c5

fix clang tidy warning

c6b8e8a

make the ellpack tests more forgiving

70e424e

RAMitchell reviewed Nov 6, 2019

View reviewed changes

rongou added 3 commits November 6, 2019 09:20

Merge branch 'master' into gpu-hist-batches

a6cca35

move base_rowid into EllpackMatrix

8d8b426

change row partitioner back to absolute row ids

14734d9

RAMitchell approved these changes Nov 7, 2019

View reviewed changes

trivialfis reviewed Nov 7, 2019

View reviewed changes

Merge branch 'master' into gpu-hist-batches

2a3c02a

sriramch reviewed Nov 11, 2019

View reviewed changes

actually verify every row

e86edc1

sriramch reviewed Nov 12, 2019

View reviewed changes

rongou added 3 commits November 12, 2019 15:54

add a libsvm generator

5a75451

libsvm is 0-based

0662f8d

Merge branch 'master' into gpu-hist-batches

a894c36

Merge branch 'master' into gpu-hist-batches

73eec1a

trivialfis and others added 2 commits November 14, 2019 20:21

Fix merge conflict.

3b716ce

fix a few issues

377d043

trivialfis approved these changes Nov 15, 2019

View reviewed changes

minor formatting

a6bf0dd

trivialfis merged commit 0afcc55 into dmlc:master Nov 16, 2019

lock bot locked as resolved and limited conversation to collaborators Feb 14, 2020

rongou deleted the gpu-hist-batches branch November 18, 2022 19:01

support multiple batches in gpu_hist #5014

support multiple batches in gpu_hist #5014

Conversation

rongou commented Nov 5, 2019

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RAMitchell commented Nov 6, 2019

hcho3 commented Nov 6, 2019

RAMitchell commented Nov 6, 2019 • edited by hcho3 Loading

RAMitchell commented Nov 7, 2019

hcho commented Nov 7, 2019 via email

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rongou commented Nov 7, 2019

rongou commented Nov 9, 2019

RAMitchell commented Nov 9, 2019

rongou commented Nov 11, 2019

codecov-io commented Nov 11, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

rongou commented Nov 12, 2019

trivialfis commented Nov 12, 2019

Choose a reason for hiding this comment

rongou commented Nov 13, 2019

sriramch commented Nov 13, 2019

rongou commented Nov 14, 2019

trivialfis commented Nov 14, 2019 • edited Loading

trivialfis commented Nov 14, 2019

trivialfis left a comment

Choose a reason for hiding this comment

rongou commented Nov 15, 2019

trivialfis commented Nov 16, 2019

RAMitchell commented Nov 6, 2019 •

edited by hcho3

Loading

codecov-io commented Nov 11, 2019 •

edited

Loading

trivialfis commented Nov 14, 2019 •

edited

Loading