-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Enable Multi-Node Multi-GPU functionality #4095
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! In addition to my comments there are a bunch of mistakes from the merge. I will comment on these changes after you clean up the merge.
We also need to find a way to write tests for this that can run on our Jenkins multi-GPU system. |
@RAMitchell @mt-jones Let me know if you need any assistance regarding the Jenkins environment. |
- it now crashes on NCCL initialization, but at least we're attempting it properly
…ent across workers
- now the workers don't go down, but just hang - no more "wild" values of gradients - probably needs syncing in more places
- this improves performance _significantly_ (7x faster for overall training, 20x faster for xgboost proper)
OK, I believe I've removed all of the merge conflict remnants, etc. Thanks for your patience! Could you please re-review? Or apply your comments at the pertinent lines of code in the clean rebased version? |
Looks like this PR doesn't build? |
@rongou We have a merge conflict against the latest master. |
Reminder of closing #3499 after merged. |
Pulling remote changes
I'm getting some SSL connection failures in some of the builds. Can you advise? Note: I've resolved conflicts with Master, and properly updated the code-base so that it reflects the core changes we've made. |
489e499 fixes test 3. I'm going to give this build a try on some of my large data sets and see how things go in the non-distributed but multi-gpu scenario. |
@jeffdk As it happens, there was an issue where worker threads would not complete dumping their model before the file compare logic was entered. This caused a worker to compare a model file that didn't exist, or was only partially available. I added a sleep call after the model dump to equilibrate worker states. |
@RAMitchell looks like I'm getting an error in the CI from one of the standard MGPU tests.
Have you seen this before? |
@mt-jones Ahh, yeah a race condition in dumping/reading the models from each worker. I just read the commit message, tried again, figured it pertained to the failing test :) I was able to successfully train non-distributed mGPU on a 5 million row data set, but running on a scaled up version (~11M rows) results in non-sensical training:
Training completes and I have a model file Hyperparameters:
Only difference between the working 5M row run and this one is the data. Any thoughts about how I might debug this or gather more information on what might be going on? |
FWIW: I get a similar error in the same test, though not at the same precise test parameters:
|
Noting here that I get the same error on XGBoost Master. |
@mt-jones @RAMitchell Is there a URL that I can use to install NCCL2 for CUDA 10.1? The official download page requires a login, and the CI system can't answer the login prompt. |
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt update
sudo apt install libnccl2 libnccl-dev |
@rongou We use CentOS NVIDIA Docker image to build XGBoost-GPU, so we can't use |
Currently, we run this script to install NCCL2 for CUDA 8.x and 9.x: xgboost/tests/ci_build/Dockerfile.gpu Lines 20 to 29 in 74009af
I found out that it doesn't work for CUDA 10.0. |
If you navigate to the NCCL download page, you should be able to grab the network installer for CentOS7, NCCL 2.4.2, CUDA 10.1 Instructions:
For the local installer CentOS7: OS agnostic: |
@mt-jones Awesome! Thanks a lot for the link. It should help me add a worker for CUDA 10.x. |
@hcho3 it may also make sense to pull the source code from NCCL's GitHub page. It will always be up to date. https://github.com/NVIDIA/nccl The build process is pretty simple (for the OS agnostic solution):
|
@hcho3 I’m updating my response to the links question. Most of those will be behind a login wall. To get the network RPM: RHEL7 package index here: To install from network RPM:
If you want a different CUDA+NCCL combo, just change the versions to match anything you see in the index I linked. |
Current multi-GPU test failure is for the coordinate descent updater and seems unrelated to this PR. Lets see if it keeps happening if we rerun the tests. |
#!/usr/bin/python | ||
import xgboost as xgb | ||
import time | ||
from collections import OrderedDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import statement may not be required given the correction to urllib3
version in the deployed Dockerfile.
@@ -35,7 +35,7 @@ ENV CPP=/opt/rh/devtoolset-2/root/usr/bin/cpp | |||
|
|||
# Install Python packages | |||
RUN \ | |||
pip install numpy pytest scipy scikit-learn wheel | |||
pip install numpy pytest scipy scikit-learn wheel kubernetes urllib3==1.22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessary additions to fix unsatisfiable package sequence errors. Kubernetes is required for the dmlc_submit
script.
That might be related to #4194 . |
To enable multi-node multi-GPU functionality, this PR adjusts
dh::AllReducer::Init
to query allrabit::GetRank()
s for GPU device information (additional node stats). This permits NCCL to construct a unique ID for a communicator clique across many rabit workers. Exposing this unique ID, also permits custom process schema for each GPU in the comm clique (one process per GPU, or one process for multiple GPUs, etc.).In Dask, this would permit a user to construct a collection of XGBoost processes over an inhomogeneous device cluster. It would also permit resource management systems to assign a rank for each GPU in the cluster.