-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support multiple batches in gpu_hist #5014
Conversation
This reverts commit 9f1fc21.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks more simple than I expected. I think way we use RowPartitioner can be improved, otherwise it looks pretty good.
src/tree/updater_gpu_hist.cu
Outdated
int gidx = | ||
matrix.gidx_iter[ridx * matrix.info.row_stride + idx % matrix.info.row_stride]; | ||
if (gidx != matrix.info.n_bins) { | ||
// If we are not using shared memory, accumulate the values directly into | ||
// global memory | ||
GradientSumT* atomic_add_ptr = | ||
use_shared_memory_histograms ? smem_arr : d_node_hist; | ||
dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx]); | ||
dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx + base_rowid]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be tidied up a little. base_rowid
can be a member of EllpackMatrix. EllpackMatrix could have a method that returns the row index of each of its elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/tree/updater_gpu_hist.cu
Outdated
page = batch.Impl(); | ||
if (page->n_rows != row_partitioner->GetRows().size()) { | ||
row_partitioner.reset(); // Release the device memory first before reallocating | ||
row_partitioner.reset(new RowPartitioner(device_id, page->n_rows)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if it is better for us to keep a row_partitioner for each batch instead of resetting one of these objects. Memory usage will increase somewhat. Alternatively what is stopping us from creating the RowPartitioner object on the stack whenever we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the row_partitioner
back to using absolute row ids. PTAL
Also given significant changes to our core algorithm I think you should run https://github.com/NVIDIA/gbm-bench checking for performance regression. You could also benchmark your external memory here with some modification. It might take some time but its worth becoming familiar with this benchmark suite. |
@RAMitchell Thanks for the link. Can I use it to set up regular benchmark for performance / accuracy regression? |
@hcho3 yes! If we have the budget to run it. I've been running it on an 8gpu machine but 2 or 4 is fine. It downloads and caches the datasets, so I would suggest using persistent storage for that part. |
@rongou I have approved this on the basis of the code, but please do benchmark before we merge to test for performance regression. |
Just inform you guys that i block yall but just remind.
While wrong person is getting crazy amount of email for no reason, a person
who supposed to receive that message is out of the loop.
…On Wed, Nov 6, 2019 at 5:51 PM Rong Ou ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/tree/updater_gpu_hist.cu:
> int gidx =
matrix.gidx_iter[ridx * matrix.info.row_stride + idx % matrix.info.row_stride];
if (gidx != matrix.info.n_bins) {
// If we are not using shared memory, accumulate the values directly into
// global memory
GradientSumT* atomic_add_ptr =
use_shared_memory_histograms ? smem_arr : d_node_hist;
- dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx]);
+ dh::AtomicAddGpair(atomic_add_ptr + gidx, d_gpair[ridx + base_rowid]);
Done.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a high level test in Python that says external memory produces exact same result with normal operation?
@@ -166,6 +166,15 @@ struct BatchParam { | |||
int max_bin; | |||
/*! \brief Number of rows in a GPU batch, used for finding quantiles on GPU. */ | |||
int gpu_batch_nrows; | |||
/*! \brief Page size for external memory mode. */ | |||
size_t gpu_page_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we we expose this to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. See below.
@@ -227,15 +228,15 @@ std::unique_ptr<DMatrix> CreateSparsePageDMatrixWithRC( | |||
size_t j = 0; | |||
if (rem_cols > 0) { | |||
for (; j < std::min(static_cast<size_t>(rem_cols), cols_per_row); ++j) { | |||
row_data << " " << (col_idx+j) << ":" << (col_idx+j+1)*10; | |||
row_data << label(*gen) << " " << (col_idx+j) << ":" << (col_idx+j+1)*10*i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We implemented a dummy generator in helper, as the generator in std is not guaranteed to be reproducible on different compiler and different platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to be reproducible across compilers/platforms, just need to be deterministic on multiple runs. Anyway, this is @sriramch's code, comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is correct. we just need the feature id's and values to be consistent when invoked multiple times for the same row/column configuration, when the deterministic flags is passed.
@@ -21,6 +21,8 @@ struct GenericParameter : public XGBoostParameter<GenericParameter> { | |||
int nthread; | |||
// primary device, -1 means no gpu. | |||
int gpu_id; | |||
// gpu page size in external memory mode, 0 means using the default. | |||
size_t gpu_page_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameter is configurable by users, please don't define it twice. Make one of them a normal variable. If we don't want to configure it by user, don't use parameter at all. As pickling might loss some of these information, dask uses pickle to move booster around workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only defined as a configurable parameter once here, the other one is really just plumbing. For now this is mostly used for testing, but perhaps user may want to set it depending on the GPU memory they have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Em.. we want to do parameter validation, like detecting unused parameters. This may add some extra difficulties. Do you think it's possible to set it as a DMatrix parameter instead of a global one? Maybe another PR? Sorry for nitpicking here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we need to be careful adding global parameters due to upcoming work on serialisation. Unless you see a strong motivation for users tuning this, let's leave it out for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure if this is useful for end users. Is there way to make a parameter hidden/internal? It's really useful for the tests since we don't have to build a dataset bigger than 32MB.
@trivialfis I think the @RAMitchell running the benchmarks now. Will report back with the numbers. |
@RAMitchell I run the benchmarks 3 times on my desktop with a Titan V. Here are the average training times:
Looks like most are about the same, but epsilon is much faster for some reason. I'm running the same things on a VM on GCP with a V100. Will report back once it's done there. |
LGTM |
@RAMitchell here are the 3-run average training times on V100:
Again there is a noticeable speedup for epsilon. |
Codecov Report
@@ Coverage Diff @@
## master #5014 +/- ##
=======================================
Coverage 71.52% 71.52%
=======================================
Files 11 11
Lines 2311 2311
=======================================
Hits 1653 1653
Misses 658 658 Continue to review full report at Codecov.
|
this->UpdatePosition(candidate.nid, (*p_tree)[candidate.nid]); | ||
monitor.StopCuda("UpdatePosition"); | ||
if (ExpandEntry::ChildIsValid(param, tree.GetDepth(left_child_nidx), num_leaves)) { | ||
for (auto& batch : p_fmat->GetBatches<EllpackPage>(batch_param)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't iterating over all the batches successively while building root node, each node in the tree and finalizing the position for every iteration significantly add to the training time when external memory mode is enabled? it would be interesting to see the difference between this version and the one in master for one of the representative dataset when external memory is enabled.
@RAMitchell @trivialfis here are the 3-run average metrics for Titan V (the V100 is similar). Note all the numbers I've reported so far are for the in-memory mode. Will report external memory perf once I have the numbers.
|
Restarted the test. |
@@ -626,6 +631,14 @@ struct GPUHistMakerDevice { | |||
return std::vector<DeviceSplitCandidate>(result_all.begin(), result_all.end()); | |||
} | |||
|
|||
// Build gradient histograms for a given node across all the batches in the DMatrix. | |||
void BuildHistBatches(int nidx, DMatrix* p_fmat) { | |||
for (auto& batch : p_fmat->GetBatches<EllpackPage>(batch_param)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this result in pages getting recycled aggressively while walking through the batches, thus resulting in ellpack pages repeatedly getting created and destructed leading to large number of data copies to device and removal from it?
@trivialfis hmm looks like in external memory mode the accuracy is not quite as good as the in-core version. I think I'll add a python test to verify the model accuracy as you suggested. |
i think the sparse pages are going to be reused heavily during iteration (due to the use of the threaded iterator), thus resulting in page states having to be reinitialized with every page read. if pages are going to be persisted on a spinning disk, then for every iteration, it is going to be a page read from disk + creation of page state in system and/or device memory; the latter may involve data movement (~32mb worth) for every page. thus, for every node creation for every tree in every iteration, this may turn out to be prohibitively expensive for a large dataset with very many pages. |
With a generated synthetic dataset (20 million rows * 100 columns) that doesn't fit in my Titan V, in external memory mode, For this PR the main goal is to get correct results in external memory mode. There are probably a bunch of low hanging fruits for optimization. |
Could you elaborate on it a bit more? What's not quite right? ;-)
Ah, now I know. Sorry for the noise. |
@rongou Sorry for creating some conflicts. Merged master in your branch, please just force push if the merge stands in your way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You fixed it. :-). I see the new tests are passing, Is it ready to go? The code looks good to me. Not sure if you want to keep the generator here as it might be useful outside XGBoost.
@trivialfis yeah I think this is good to go. Will work on followup PRs for performance tuning. |
That's quite an achievement, ellpack was simply a function before. |
In the
gpu_hist
tree method, support reading in multiple ELLPACK pages from disk. Part of #4357.The main changes are in
updater_gpu_hist.cu
:InitRoot()
when building the initial histogram, we now loop through all the batches and accumulate the histograms.FinalisePosition
we loop through the batches again to update the positions for all the rows in the dataset.I found some problem with writing out the ELLPACK pages, can't really concatenate the compressed buffers, so now accumulate the CSR pages in memory first before compressing and writing to disk.
A few things are still O(n) of dataset size, such as the prediction output and the gradient pair list. We could potentially stream these from the main memory as well. Probably in a follow-up PR.
So far I've focused on correctness, haven't really looked at performance. Will try to run some benchmarks.
@RAMitchell @trivialfis @sriramch