-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce having more threads than GPUs. #4162
Comments
In an effort to provide some additional insight: Currently #4095 creates two modes of operation:
(2) is triggered by (1) is triggered by calling an XGBoost process with OpenMP threads, rather than MPI ranks. (2) is well suited for multi-node operation; whereas, (1) avoids unnecessary complication on single machine setups. In short, it is possible to adjust the code so that multiple processes are created in the NCCL communicator clique which performs a check against |
@mt-jones Thanks for the insight.
Are you saying that, in non-distributed mode, when user specifies |
I'm saying that I've removed the flag for Effectively, a check is performed to see if More or less:
|
@mt-jones Ah I see. Will look into the detail. Thanks! |
No problem! Let me know if you have questions. My original statement was simply that we could do the same thing based on OpenMP thread rank, but it may cause conflicts with how XGB is currently executed. |
@trivialfis thanks, yes a last minute fix would be great for 0.82. Another option is to fail with an error I will leave it up to you. |
@trivialfis See the code below. NCCL requires (see below for quote) that
One potential solution is to use OpenMP threads to initialize each rank, eliminating the Code snippet (1)
Code snippet (2)
At line I have tested the above in PR #4095, and it does work (all tests pass).
|
I dropped the idea of manipulating openmp threads. The GPU threads must be decoupled with the nthread parameter. So I will try std thread. Please give me some time to learn its implications. |
Brought up in #4076 . Having less threads than GPUs will lead to undefined behaviors. Another example is a hang I encountered when working on #3974. #4095 might be able to decouple the number of threads and number of GPUs. But until then, I want to make a last minute fix that when user specified
nthread < n_gpus
, we either decreasen_gpus
tonthreads
or increasenthreads
ton_gpus
. @RAMitchell WDYT?The text was updated successfully, but these errors were encountered: