Enforce having more threads than GPUs. #4162

trivialfis · 2019-02-19T18:27:26Z

Brought up in #4076 . Having less threads than GPUs will lead to undefined behaviors. Another example is a hang I encountered when working on #3974. #4095 might be able to decouple the number of threads and number of GPUs. But until then, I want to make a last minute fix that when user specified nthread < n_gpus, we either decrease n_gpus to nthreads or increase nthreads to n_gpus. @RAMitchell WDYT?

The text was updated successfully, but these errors were encountered:

mtjrider · 2019-02-19T18:39:40Z

In an effort to provide some additional insight:

Currently #4095 creates two modes of operation:

A NCCL communicator clique where a single process manages a collection of GPUs
A NCCL communicator clique where there is a single process (PID) for each GPU involved in the clique

(2) is triggered by rabit::GetWorldSize(). Meaning, if within a node you assign an MPI rank per physical GPU, you will create a process for each GPU.

(1) is triggered by calling an XGBoost process with OpenMP threads, rather than MPI ranks.

(2) is well suited for multi-node operation; whereas, (1) avoids unnecessary complication on single machine setups.

In short, it is possible to adjust the code so that multiple processes are created in the NCCL communicator clique which performs a check against nthread < n_gpus, but it may conflict with potentially desired functionality in (1).

trivialfis · 2019-02-19T18:45:36Z

@mt-jones Thanks for the insight.

it is possible to adjust the code so that multiple processes are created in the NCCL communicator clique which performs a check against nthread < n_gpus

Are you saying that, in non-distributed mode, when user specifies nthread < n_gpus, we fallback to a slightly more complicated setup that uses process instead of threads. But on any other time we default to using threads?

mtjrider · 2019-02-19T19:04:58Z

@mt-jones Thanks for the insight.

it is possible to adjust the code so that multiple processes are created in the NCCL communicator clique which performs a check against nthread < n_gpus

Are you saying that, in non-distributed mode, when user specifies nthread < n_gpus, we fallback to a slightly more complicated setup that uses process instead of threads. But on any other time we default to using threads?

I'm saying that I've removed the flag for distributed. Instead, using rabit::GetWorldSize() to make an inference on whether XGB is being executed in a distributed manner.

Effectively, a check is performed to see if rabit::GetWorldSize() == 1. If so, create a NCCL communicator clique with one process; else, create a NCCL communicator clique with multiple processes. In short, the same code currently in XGB is executed GPU-side for communicator initialization.

More or less:

  void Init(const std::vector<int> &device_ordinals) {
#ifdef XGBOOST_USE_NCCL
    /** \brief this >monitor . init. */
    this->device_ordinals = device_ordinals;
    comms.resize(device_ordinals.size());

    if (1 < rabit::GetWorldSize()) {
      auto id = GetUniqueId();
      dh::safe_nccl(ncclCommInitRank(
      	&(comms[0]),
      	rabit::GetWorldSize(),
      	id, rabit::GetRank()));
    } else {
      dh::safe_nccl(ncclCommInitAll(
      	comms.data(),
      	static_cast<int>(device_ordinals.size()),
      	device_ordinals.data()));
    }
...

trivialfis · 2019-02-19T19:10:04Z

@mt-jones Ah I see. Will look into the detail. Thanks!

mtjrider · 2019-02-19T19:34:25Z

@mt-jones Ah I see. Will look into the detail. Thanks!

No problem! Let me know if you have questions. My original statement was simply that we could do the same thing based on OpenMP thread rank, but it may cause conflicts with how XGB is currently executed.

RAMitchell · 2019-02-19T20:29:20Z

@trivialfis thanks, yes a last minute fix would be great for 0.82. Another option is to fail with an error I will leave it up to you.

mtjrider · 2019-02-20T17:58:21Z

@trivialfis See the code below. NCCL requires (see below for quote) that ncclCommInitRank be either

encapsulated by ncclGroupStart and ncclGroupEnd to unblock the internal synchronous call to initialize the rank
called by a distinct thread/process to avoid the block

One potential solution is to use OpenMP threads to initialize each rank, eliminating the for loop construction and the calls to GroupStart() and GroupEnd()

Code snippet (1)

GroupStart();
for (size_t i = 0; i < device_ordinals.size(); i++) {
  int dev = device_ordinals[i];
  int ndevs = device_ordinals.size();
  int nccl_rank = rabit::GetRank() * ndevs + dev;
  int nccl_nranks = rabit::GetWorldSize() * ndevs;
      
  dh::safe_cuda(cudaSetDevice(dev));
  dh::safe_nccl(ncclCommInitRank(
    &(comms[i]),
    nccl_nranks, id, 
    nccl_rank));
}
GroupEnd();

Code snippet (2)

#pragma omp parallel num_threads(device_ordinals.size()) // ***
{
int tid = omp_get_thread_num();
int dev = device_ordinals[tid];
int ndevs = device_ordinals.size();

int nccl_rank = rabit::GetRank() * ndevs + dev;
int nccl_nranks = rabit::GetWorldSize() * ndevs;

dh::safe_cuda(cudaSetDevice(dev));
dh::safe_nccl(ncclCommInitRank(
  &(comms[tid]),
  nccl_nranks, id,
  nccl_rank));
}

At line // ***, we could implement a check to avoid initialization, and error, or we can let num_threads override nthreads to initialize the communicator clique.

I have tested the above in PR #4095, and it does work (all tests pass).

ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nranks, ncclUniqueId commId, int rank)

Creates a new communicator (multi thread/process version). rank must be between 0 and nranks-1 and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before calling ncclCommInitRank. ncclCommInitRank implicitly synchronizes with other ranks, hence it must be called by different threads/processes or use ncclGroupStart/ncclGroupEnd.

trivialfis · 2019-02-23T09:14:26Z

I dropped the idea of manipulating openmp threads. The GPU threads must be decoupled with the nthread parameter. So I will try std thread. Please give me some time to learn its implications.

trivialfis mentioned this issue Feb 28, 2019

Safe CUDA deallocation. #4194

Closed

trivialfis mentioned this issue Mar 27, 2019

Use std::thread instead of OMP for GPUs. #4302

Closed

trivialfis closed this as completed Jun 20, 2019

lock bot locked as resolved and limited conversation to collaborators Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce having more threads than GPUs. #4162

Enforce having more threads than GPUs. #4162

trivialfis commented Feb 19, 2019

mtjrider commented Feb 19, 2019 •

edited

Loading

trivialfis commented Feb 19, 2019

mtjrider commented Feb 19, 2019 •

edited

Loading

trivialfis commented Feb 19, 2019

mtjrider commented Feb 19, 2019

RAMitchell commented Feb 19, 2019

mtjrider commented Feb 20, 2019

ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nranks, ncclUniqueId commId, int rank)

trivialfis commented Feb 23, 2019

Enforce having more threads than GPUs. #4162

Enforce having more threads than GPUs. #4162

Comments

trivialfis commented Feb 19, 2019

mtjrider commented Feb 19, 2019 • edited Loading

trivialfis commented Feb 19, 2019

mtjrider commented Feb 19, 2019 • edited Loading

trivialfis commented Feb 19, 2019

mtjrider commented Feb 19, 2019

RAMitchell commented Feb 19, 2019

mtjrider commented Feb 20, 2019

ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nranks, ncclUniqueId commId, int rank)

trivialfis commented Feb 23, 2019

mtjrider commented Feb 19, 2019 •

edited

Loading

mtjrider commented Feb 19, 2019 •

edited

Loading