Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Enable Multi-Node Multi-GPU functionality #4095

Merged
merged 49 commits into from
Mar 1, 2019

Conversation

mtjrider
Copy link
Contributor

@mtjrider mtjrider commented Jan 31, 2019

To enable multi-node multi-GPU functionality, this PR adjusts dh::AllReducer::Init to query all rabit::GetRank()s for GPU device information (additional node stats). This permits NCCL to construct a unique ID for a communicator clique across many rabit workers. Exposing this unique ID, also permits custom process schema for each GPU in the comm clique (one process per GPU, or one process for multiple GPUs, etc.).

In Dask, this would permit a user to construct a collection of XGBoost processes over an inhomogeneous device cluster. It would also permit resource management systems to assign a rank for each GPU in the cluster.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! In addition to my comments there are a bunch of mistakes from the merge. I will comment on these changes after you clean up the merge.

src/common/device_helpers.cuh Outdated Show resolved Hide resolved
src/common/device_helpers.cuh Outdated Show resolved Hide resolved
src/common/device_helpers.cuh Outdated Show resolved Hide resolved
src/tree/param.h Outdated Show resolved Hide resolved
src/tree/updater_gpu_hist.cu Outdated Show resolved Hide resolved
src/tree/updater_gpu_hist.cu Outdated Show resolved Hide resolved
src/tree/updater_gpu_hist.cu Outdated Show resolved Hide resolved
src/tree/updater_gpu_hist.cu Outdated Show resolved Hide resolved
@RAMitchell
Copy link
Member

We also need to find a way to write tests for this that can run on our Jenkins multi-GPU system.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 31, 2019

@RAMitchell @mt-jones Let me know if you need any assistance regarding the Jenkins environment.

teju85 and others added 8 commits January 31, 2019 15:55
- it now crashes on NCCL initialization, but at least we're attempting it properly
- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places
- this improves performance _significantly_ (7x faster for overall training,
  20x faster for xgboost proper)
@mtjrider
Copy link
Contributor Author

mtjrider commented Feb 1, 2019

OK, I believe I've removed all of the merge conflict remnants, etc. Thanks for your patience!

Could you please re-review? Or apply your comments at the pertinent lines of code in the clean rebased version?

@rongou
Copy link
Contributor

rongou commented Feb 7, 2019

Looks like this PR doesn't build?

@hcho3
Copy link
Collaborator

hcho3 commented Feb 7, 2019

@rongou We have a merge conflict against the latest master.

@trivialfis
Copy link
Member

Reminder of closing #3499 after merged.

@mtjrider
Copy link
Contributor Author

mtjrider commented Feb 19, 2019

@RAMitchell @mt-jones Let me know if you need any assistance regarding the Jenkins environment.

I'm getting some SSL connection failures in some of the builds. Can you advise?

Note: I've resolved conflicts with Master, and properly updated the code-base so that it reflects the core changes we've made.

src/tree/param.h Outdated Show resolved Hide resolved
@jeffdk
Copy link

jeffdk commented Feb 27, 2019

489e499 fixes test 3. I'm going to give this build a try on some of my large data sets and see how things go in the non-distributed but multi-gpu scenario.

@mtjrider
Copy link
Contributor Author

489e499 fixes test 3. I'm going to give this build a try on some of my large data sets and see how things go in the non-distributed but multi-gpu scenario.

@jeffdk As it happens, there was an issue where worker threads would not complete dumping their model before the file compare logic was entered. This caused a worker to compare a model file that didn't exist, or was only partially available.

I added a sleep call after the model dump to equilibrate worker states.

@mtjrider mtjrider changed the title [WIP] Enable Multi-Node Multi-GPU functionality [REVIEW] Enable Multi-Node Multi-GPU functionality Feb 27, 2019
@mtjrider
Copy link
Contributor Author

@RAMitchell looks like I'm getting an error in the CI from one of the standard MGPU tests.

*** glibc detected *** /opt/python/bin/python: malloc(): memory corruption: 0x00007fac29861890 ***

Have you seen this before?

@jeffdk
Copy link

jeffdk commented Feb 27, 2019

@mt-jones Ahh, yeah a race condition in dumping/reading the models from each worker. I just read the commit message, tried again, figured it pertained to the failing test :)

I was able to successfully train non-distributed mGPU on a 5 million row data set, but running on a scaled up version (~11M rows) results in non-sensical training:

[23:33:49] 11068163x21692 matrix with 21753698677 entries loaded from /data/p19_dense_binary.xgb
[0]	train-logloss:0.677015	train-error:0.052836
[1]	train-logloss:5.31965	train-error:0.161321
[2]	train-logloss:4.58369	train-error:0.137881
[3]	train-logloss:4.59535	train-error:0.138178
[4]	train-logloss:4.5807	train-error:0.139837
[5]	train-logloss:4.56639	train-error:0.139432
[6]	train-logloss:4.53935	train-error:0.139342
[7]	train-logloss:4.54767	train-error:0.139897
[8]	train-logloss:4.5392	train-error:0.141285
[9]	train-logloss:4.64014	train-error:0.142456
[10]	train-logloss:4.67762	train-error:0.143468
[11]	train-logloss:4.67607	train-error:0.143598
... 

Training completes and I have a model file
sample.model.gz

Hyperparameters:

 {
    "n_gpus": 8,
    "nthread": "96",
    "predictor": "cpu_predictor",
    "eta": 0.2,
    "max_bin": 31,
    "objective": "binary:logistic",
    "num_boost_round": 100,
    "tree_method": "gpu_hist",
    "max_depth": 10
  }

Only difference between the working 5M row run and this one is the data. Any thoughts about how I might debug this or gather more information on what might be going on?

@jeffdk
Copy link

jeffdk commented Feb 28, 2019

@RAMitchell looks like I'm getting an error in the CI from one of the standard MGPU tests.

*** glibc detected *** /opt/python/bin/python: malloc(): memory corruption: 0x00007fac29861890 ***

Have you seen this before?

FWIW: I get a similar error in the same test, though not at the same precise test parameters:

tests/python-gpu/test_gpu_linear.py Training on dataset: Boston
<about 25 sets of test parameters passed before this>
...
Training on dataset: Digits
Using parameters: {'n_gpus': -1, 'num_class': 10, 'eval_metric': 'merror', 'gpu_id': 1, 'top_k': 10, 'coordinate_selection': 'random', 'eta': 0.5, 'updater': 'gpu_coord_descent', 'objective': 'multi:softmax', 'alpha': 0.005, 'lambda': 0.005, 'tolerance': 1e-05, 'booster': 'gblinear'}
Segmentation fault

@mtjrider
Copy link
Contributor Author

Noting here that I get the same error on XGBoost Master.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 28, 2019

@mt-jones @RAMitchell Is there a URL that I can use to install NCCL2 for CUDA 10.1? The official download page requires a login, and the CI system can't answer the login prompt.

@rongou
Copy link
Contributor

rongou commented Feb 28, 2019

@hcho3

sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt update
sudo apt install libnccl2 libnccl-dev

@hcho3
Copy link
Collaborator

hcho3 commented Feb 28, 2019

@rongou We use CentOS NVIDIA Docker image to build XGBoost-GPU, so we can't use apt. Is there a tarball alternative?

@hcho3
Copy link
Collaborator

hcho3 commented Feb 28, 2019

Currently, we run this script to install NCCL2 for CUDA 8.x and 9.x:

# NCCL2 (License: https://docs.nvidia.com/deeplearning/sdk/nccl-sla/index.html)
RUN \
export CUDA_SHORT=`echo $CUDA_VERSION | egrep -o '[0-9]+\.[0-9]'` && \
if [ "${CUDA_SHORT}" != "10.0" ]; then \
wget https://developer.download.nvidia.com/compute/redist/nccl/v2.2/nccl_2.2.13-1%2Bcuda${CUDA_SHORT}_x86_64.txz && \
tar xf "nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64.txz" && \
cp nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64/include/nccl.h /usr/include && \
cp nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64/lib/* /usr/lib && \
rm -f nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64.txz && \
rm -r nccl_2.2.13-1+cuda${CUDA_SHORT}_x86_64; fi

I found out that it doesn't work for CUDA 10.0.

@mtjrider
Copy link
Contributor Author

@rongou We use CentOS NVIDIA Docker image to build XGBoost-GPU, so we can't use apt. Is there a tarball alternative?

If you navigate to the NCCL download page, you should be able to grab the network installer for CentOS7, NCCL 2.4.2, CUDA 10.1

Instructions:

sudo yum install libnccl-2.4.2-1+cuda10.1 libnccl-devel-2.4.2-1+cuda10.1 libnccl-static-2.4.2-1+cuda10.1

For the local installer CentOS7:
https://developer.nvidia.com/compute/machine-learning/nccl/secure/v2.4/prod/nccl-repo-rhel7-2.4.2-ga-cuda10.1-1-1.x86_64.rpm

OS agnostic:
https://developer.nvidia.com/compute/machine-learning/nccl/secure/v2.4/prod//nccl_2.4.2-1%2Bcuda10.1_x86_64.txz

@hcho3
Copy link
Collaborator

hcho3 commented Feb 28, 2019

@mt-jones Awesome! Thanks a lot for the link. It should help me add a worker for CUDA 10.x.

@mtjrider
Copy link
Contributor Author

@hcho3 it may also make sense to pull the source code from NCCL's GitHub page. It will always be up to date.

https://github.com/NVIDIA/nccl

The build process is pretty simple (for the OS agnostic solution):

$ make pkg.txz.build
$ ls build/pkg/txz/

@mtjrider
Copy link
Contributor Author

@hcho3 I’m updating my response to the links question. Most of those will be behind a login wall.

To get the network RPM:
http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/nvidia-machine-learning-repo-rhel7-1.0.0-1.x86_64.rpm

RHEL7 package index here:
http://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/

To install from network RPM:

sudo yum install libnccl-2.4.2-1+cuda10.1 libnccl-devel-2.4.2-1+cuda10.1 libnccl-static-2.4.2-1+cuda10.1

If you want a different CUDA+NCCL combo, just change the versions to match anything you see in the index I linked.

@RAMitchell
Copy link
Member

Current multi-GPU test failure is for the coordinate descent updater and seems unrelated to this PR. Lets see if it keeps happening if we rerun the tests.

#!/usr/bin/python
import xgboost as xgb
import time
from collections import OrderedDict
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import statement may not be required given the correction to urllib3 version in the deployed Dockerfile.

@@ -35,7 +35,7 @@ ENV CPP=/opt/rh/devtoolset-2/root/usr/bin/cpp

# Install Python packages
RUN \
pip install numpy pytest scipy scikit-learn wheel
pip install numpy pytest scipy scikit-learn wheel kubernetes urllib3==1.22
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary additions to fix unsatisfiable package sequence errors. Kubernetes is required for the dmlc_submit script.

@trivialfis
Copy link
Member

@RAMitchell

Current multi-GPU test failure is for the coordinate descent updater and seems unrelated to this PR. Lets see if it keeps happening if we rerun the tests.

That might be related to #4194 .

@RAMitchell RAMitchell merged commit 92b7577 into dmlc:master Mar 1, 2019
@mtjrider mtjrider deleted the mnmg branch March 5, 2019 20:37
@hcho3 hcho3 mentioned this pull request Mar 8, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants