[REVIEW] Enable Multi-Node Multi-GPU functionality #4095

mtjrider · 2019-01-31T21:01:02Z

To enable multi-node multi-GPU functionality, this PR adjusts dh::AllReducer::Init to query all rabit::GetRank()s for GPU device information (additional node stats). This permits NCCL to construct a unique ID for a communicator clique across many rabit workers. Exposing this unique ID, also permits custom process schema for each GPU in the comm clique (one process per GPU, or one process for multiple GPUs, etc.).

In Dask, this would permit a user to construct a collection of XGBoost processes over an inhomogeneous device cluster. It would also permit resource management systems to assign a rank for each GPU in the cluster.

RAMitchell

Thanks for the PR! In addition to my comments there are a bunch of mistakes from the merge. I will comment on these changes after you clean up the merge.

src/common/device_helpers.cuh

src/tree/param.h

src/tree/updater_gpu_hist.cu

RAMitchell · 2019-01-31T21:44:06Z

We also need to find a way to write tests for this that can run on our Jenkins multi-GPU system.

hcho3 · 2019-01-31T21:46:50Z

@RAMitchell @mt-jones Let me know if you need any assistance regarding the Jenkins environment.

- it now crashes on NCCL initialization, but at least we're attempting it properly

…ent across workers

- now the workers don't go down, but just hang - no more "wild" values of gradients - probably needs syncing in more places

- this improves performance _significantly_ (7x faster for overall training, 20x faster for xgboost proper)

mtjrider · 2019-02-01T00:59:00Z

OK, I believe I've removed all of the merge conflict remnants, etc. Thanks for your patience!

Could you please re-review? Or apply your comments at the pertinent lines of code in the clean rebased version?

rongou · 2019-02-07T06:56:47Z

Looks like this PR doesn't build?

hcho3 · 2019-02-07T07:00:14Z

@rongou We have a merge conflict against the latest master.

trivialfis · 2019-02-16T19:19:45Z

Reminder of closing #3499 after merged.

Pulling remote changes

mtjrider · 2019-02-19T00:32:07Z

@RAMitchell @mt-jones Let me know if you need any assistance regarding the Jenkins environment.

I'm getting some SSL connection failures in some of the builds. Can you advise?

Note: I've resolved conflicts with Master, and properly updated the code-base so that it reflects the core changes we've made.

src/tree/param.h

…te GetUniqueId

src/tree/updater_gpu_hist.cu

jeffdk · 2019-02-27T21:06:47Z

489e499 fixes test 3. I'm going to give this build a try on some of my large data sets and see how things go in the non-distributed but multi-gpu scenario.

mtjrider · 2019-02-27T23:32:39Z

489e499 fixes test 3. I'm going to give this build a try on some of my large data sets and see how things go in the non-distributed but multi-gpu scenario.

@jeffdk As it happens, there was an issue where worker threads would not complete dumping their model before the file compare logic was entered. This caused a worker to compare a model file that didn't exist, or was only partially available.

I added a sleep call after the model dump to equilibrate worker states.

mtjrider · 2019-02-27T23:55:15Z

@RAMitchell looks like I'm getting an error in the CI from one of the standard MGPU tests.

*** glibc detected *** /opt/python/bin/python: malloc(): memory corruption: 0x00007fac29861890 ***

Have you seen this before?

jeffdk · 2019-02-27T23:57:30Z

@mt-jones Ahh, yeah a race condition in dumping/reading the models from each worker. I just read the commit message, tried again, figured it pertained to the failing test :)

I was able to successfully train non-distributed mGPU on a 5 million row data set, but running on a scaled up version (~11M rows) results in non-sensical training:

[23:33:49] 11068163x21692 matrix with 21753698677 entries loaded from /data/p19_dense_binary.xgb
[0]	train-logloss:0.677015	train-error:0.052836
[1]	train-logloss:5.31965	train-error:0.161321
[2]	train-logloss:4.58369	train-error:0.137881
[3]	train-logloss:4.59535	train-error:0.138178
[4]	train-logloss:4.5807	train-error:0.139837
[5]	train-logloss:4.56639	train-error:0.139432
[6]	train-logloss:4.53935	train-error:0.139342
[7]	train-logloss:4.54767	train-error:0.139897
[8]	train-logloss:4.5392	train-error:0.141285
[9]	train-logloss:4.64014	train-error:0.142456
[10]	train-logloss:4.67762	train-error:0.143468
[11]	train-logloss:4.67607	train-error:0.143598
...

Training completes and I have a model file
sample.model.gz

Hyperparameters:

 {
    "n_gpus": 8,
    "nthread": "96",
    "predictor": "cpu_predictor",
    "eta": 0.2,
    "max_bin": 31,
    "objective": "binary:logistic",
    "num_boost_round": 100,
    "tree_method": "gpu_hist",
    "max_depth": 10
  }

Only difference between the working 5M row run and this one is the data. Any thoughts about how I might debug this or gather more information on what might be going on?

jeffdk · 2019-02-28T00:14:17Z

@RAMitchell looks like I'm getting an error in the CI from one of the standard MGPU tests.
*** glibc detected *** /opt/python/bin/python: malloc(): memory corruption: 0x00007fac29861890 ***
Have you seen this before?

FWIW: I get a similar error in the same test, though not at the same precise test parameters:

tests/python-gpu/test_gpu_linear.py Training on dataset: Boston
<about 25 sets of test parameters passed before this>
...
Training on dataset: Digits
Using parameters: {'n_gpus': -1, 'num_class': 10, 'eval_metric': 'merror', 'gpu_id': 1, 'top_k': 10, 'coordinate_selection': 'random', 'eta': 0.5, 'updater': 'gpu_coord_descent', 'objective': 'multi:softmax', 'alpha': 0.005, 'lambda': 0.005, 'tolerance': 1e-05, 'booster': 'gblinear'}
Segmentation fault

mtjrider · 2019-02-28T01:03:13Z

Noting here that I get the same error on XGBoost Master.

hcho3 · 2019-02-28T01:13:25Z

@mt-jones @RAMitchell Is there a URL that I can use to install NCCL2 for CUDA 10.1? The official download page requires a login, and the CI system can't answer the login prompt.

rongou · 2019-02-28T01:48:35Z

@hcho3

sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt update
sudo apt install libnccl2 libnccl-dev

hcho3 · 2019-02-28T01:59:38Z

@rongou We use CentOS NVIDIA Docker image to build XGBoost-GPU, so we can't use apt. Is there a tarball alternative?

hcho3 · 2019-02-28T02:03:37Z

Currently, we run this script to install NCCL2 for CUDA 8.x and 9.x:

xgboost/tests/ci_build/Dockerfile.gpu

Lines 20 to 29 in 74009af

    
           # NCCL2 (License: https://docs.nvidia.com/deeplearning/sdk/nccl-sla/index.html) 
        
           RUN \ 
        
               export CUDA_SHORT=`echo $CUDA_VERSION | egrep -o '[0-9]+\.[0-9]'` && \ 
        
               if [ "${CUDA_SHORT}" != "10.0" ]; then \ 
        
               wget https://developer.download.nvidia.com/compute/redist/nccl/v2.2/nccl_2.2.13-1%2Bcuda${CUDA_SHORT}_x86_64.txz && \ 
        
               tar xf "nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64.txz" && \ 
        
               cp nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64/include/nccl.h /usr/include && \ 
        
               cp nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64/lib/* /usr/lib && \ 
        
               rm -f nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64.txz && \ 
        
               rm -r nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64; fi

I found out that it doesn't work for CUDA 10.0.

mtjrider · 2019-02-28T02:05:24Z

@rongou We use CentOS NVIDIA Docker image to build XGBoost-GPU, so we can't use apt. Is there a tarball alternative?

If you navigate to the NCCL download page, you should be able to grab the network installer for CentOS7, NCCL 2.4.2, CUDA 10.1

Instructions:

sudo yum install libnccl-2.4.2-1+cuda10.1 libnccl-devel-2.4.2-1+cuda10.1 libnccl-static-2.4.2-1+cuda10.1

For the local installer CentOS7:
https://developer.nvidia.com/compute/machine-learning/nccl/secure/v2.4/prod/nccl-repo-rhel7-2.4.2-ga-cuda10.1-1-1.x86_64.rpm

OS agnostic:
https://developer.nvidia.com/compute/machine-learning/nccl/secure/v2.4/prod//nccl_2.4.2-1%2Bcuda10.1_x86_64.txz

hcho3 · 2019-02-28T02:06:40Z

@mt-jones Awesome! Thanks a lot for the link. It should help me add a worker for CUDA 10.x.

mtjrider · 2019-02-28T02:09:05Z

@hcho3 it may also make sense to pull the source code from NCCL's GitHub page. It will always be up to date.

https://github.com/NVIDIA/nccl

The build process is pretty simple (for the OS agnostic solution):

$ make pkg.txz.build
$ ls build/pkg/txz/

mtjrider · 2019-02-28T15:03:05Z

@hcho3 I’m updating my response to the links question. Most of those will be behind a login wall.

To get the network RPM:
http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/nvidia-machine-learning-repo-rhel7-1.0.0-1.x86_64.rpm

RHEL7 package index here:
http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/

To install from network RPM:

sudo yum install libnccl-2.4.2-1+cuda10.1 libnccl-devel-2.4.2-1+cuda10.1 libnccl-static-2.4.2-1+cuda10.1

If you want a different CUDA+NCCL combo, just change the versions to match anything you see in the index I linked.

RAMitchell · 2019-03-01T00:31:39Z

Current multi-GPU test failure is for the coordinate descent updater and seems unrelated to this PR. Lets see if it keeps happening if we rerun the tests.

mtjrider · 2019-03-01T15:07:36Z

tests/distributed-gpu/test_gpu_basic_1x4.py

+#!/usr/bin/python
+import xgboost as xgb
+import time
+from collections import OrderedDict


This import statement may not be required given the correction to urllib3 version in the deployed Dockerfile.

mtjrider · 2019-03-01T15:09:19Z

tests/ci_build/Dockerfile.gpu

@@ -35,7 +35,7 @@ ENV CPP=/opt/rh/devtoolset-2/root/usr/bin/cpp

 # Install Python packages
 RUN \
-    pip install numpy pytest scipy scikit-learn wheel
+    pip install numpy pytest scipy scikit-learn wheel kubernetes urllib3==1.22


Necessary additions to fix unsatisfiable package sequence errors. Kubernetes is required for the dmlc_submit script.

trivialfis · 2019-03-01T15:38:37Z

@RAMitchell

Current multi-GPU test failure is for the coordinate descent updater and seems unrelated to this PR. Lets see if it keeps happening if we rerun the tests.

That might be related to #4194 .

RAMitchell requested changes Jan 31, 2019

View reviewed changes

teju85 and others added 8 commits January 31, 2019 15:55

Initial commit to support multi-node multi-gpu xgboost using dask

c950cb0

Fixed NCCL initialization by not ignoring the opg parameter.

1cad3f8

- it now crashes on NCCL initialization, but at least we're attempting it properly

At the root node, perform a rabit::Allreduce to get initial sum_gradi…

b8a4b48

…ent across workers

Synchronizing in a couple of more places.

061a9e8

- now the workers don't go down, but just hang - no more "wild" values of gradients - probably needs syncing in more places

Added another missing max-allreduce operation inside BuildHistLeftRight

abd3933

Removed unnecessary collective operations.

e969d97

Simplified rabit::Allreduce() sync of gradient sums.

bfe9b46

Removed unnecessary rabit syncs around ncclAllReduce.

3b3846b

- this improves performance _significantly_ (7x faster for overall training, 20x faster for xgboost proper)

trivialfis mentioned this pull request Feb 7, 2019

Use nccl group calls to prevent from dead lock. #4113

Merged

mtjrider force-pushed the mnmg branch from 542891c to 3b3846b Compare February 15, 2019 22:34

mtjrider and others added 5 commits February 18, 2019 14:49

pulling in latest xgboost

28fe834

Merge branch 'master' into mnmg

3feb460

removing changes to updater_quantile_hist.cc

df4c78f

changing use_nccl_opg initialization, removing unnecessary if statements

13f85ac

Merge branch 'mnmg' of github.com:rapidsai/xgboost into mnmg

021bca0

Pulling remote changes

mtjrider commented Feb 19, 2019

View reviewed changes

src/tree/param.h Outdated Show resolved Hide resolved

mtjrider added 3 commits February 18, 2019 19:42

added definition for opaque ncclUniqueId struct to properly encapsula…

a60840f

…te GetUniqueId

placing struct defintion in guard to avoid duplicate code errors

d0a3598

addressing linting errors

29aede6

mtjrider commented Feb 19, 2019

View reviewed changes

src/tree/updater_gpu_hist.cu Outdated Show resolved Hide resolved

removing

03471b1

trivialfis mentioned this pull request Feb 19, 2019

Enforce having more threads than GPUs. #4162

Closed

mtjrider added 4 commits February 27, 2019 13:20

adding proper import for OrderedDict

2f9827e

adding urllib3==1.22 to address ordered_dict import error

be0cd1a

added sleep to allow workers to save their models for comparison

1bfd668

adding name to GPU contributors under docs

ff6dc61

mtjrider changed the title ~~[WIP] Enable Multi-Node Multi-GPU functionality~~ [REVIEW] Enable Multi-Node Multi-GPU functionality Feb 27, 2019

mtjrider commented Mar 1, 2019

View reviewed changes

RAMitchell approved these changes Mar 1, 2019

View reviewed changes

RAMitchell merged commit 92b7577 into dmlc:master Mar 1, 2019

hcho3 mentioned this pull request Mar 4, 2019

[RFC] Version 0.82 release candidate #4201

Merged

jeffdk mentioned this pull request Mar 4, 2019

Numerical Instability with histogram methods for training on large data sets #4204

Closed

mtjrider deleted the mnmg branch March 5, 2019 20:37

hcho3 mentioned this pull request Mar 8, 2019

Multi Node GPU Support ? #3499

Closed

lock bot locked as resolved and limited conversation to collaborators Jun 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Enable Multi-Node Multi-GPU functionality #4095

[REVIEW] Enable Multi-Node Multi-GPU functionality #4095

mtjrider commented Jan 31, 2019 •

edited

Loading

RAMitchell left a comment

RAMitchell commented Jan 31, 2019

hcho3 commented Jan 31, 2019

mtjrider commented Feb 1, 2019

rongou commented Feb 7, 2019

hcho3 commented Feb 7, 2019

trivialfis commented Feb 16, 2019

mtjrider commented Feb 19, 2019 •

edited

Loading

jeffdk commented Feb 27, 2019

mtjrider commented Feb 27, 2019

mtjrider commented Feb 27, 2019

jeffdk commented Feb 27, 2019

jeffdk commented Feb 28, 2019

mtjrider commented Feb 28, 2019

hcho3 commented Feb 28, 2019

rongou commented Feb 28, 2019

hcho3 commented Feb 28, 2019

hcho3 commented Feb 28, 2019

mtjrider commented Feb 28, 2019

hcho3 commented Feb 28, 2019

mtjrider commented Feb 28, 2019

mtjrider commented Feb 28, 2019

RAMitchell commented Mar 1, 2019

mtjrider Mar 1, 2019

mtjrider Mar 1, 2019

trivialfis commented Mar 1, 2019

[REVIEW] Enable Multi-Node Multi-GPU functionality #4095

[REVIEW] Enable Multi-Node Multi-GPU functionality #4095

Conversation

mtjrider commented Jan 31, 2019 • edited Loading

RAMitchell left a comment

Choose a reason for hiding this comment

RAMitchell commented Jan 31, 2019

hcho3 commented Jan 31, 2019

mtjrider commented Feb 1, 2019

rongou commented Feb 7, 2019

hcho3 commented Feb 7, 2019

trivialfis commented Feb 16, 2019

mtjrider commented Feb 19, 2019 • edited Loading

jeffdk commented Feb 27, 2019

mtjrider commented Feb 27, 2019

mtjrider commented Feb 27, 2019

jeffdk commented Feb 27, 2019

jeffdk commented Feb 28, 2019

mtjrider commented Feb 28, 2019

hcho3 commented Feb 28, 2019

rongou commented Feb 28, 2019

hcho3 commented Feb 28, 2019

hcho3 commented Feb 28, 2019

mtjrider commented Feb 28, 2019

hcho3 commented Feb 28, 2019

mtjrider commented Feb 28, 2019

mtjrider commented Feb 28, 2019

RAMitchell commented Mar 1, 2019

mtjrider Mar 1, 2019

Choose a reason for hiding this comment

mtjrider Mar 1, 2019

Choose a reason for hiding this comment

trivialfis commented Mar 1, 2019

mtjrider commented Jan 31, 2019 •

edited

Loading

mtjrider commented Feb 19, 2019 •

edited

Loading