-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marian hangs on multiple GPUs, GPU utilization remains at constant 100% #597
Comments
Likely a NCCL bug, not Marian unfortunately. Can you share the complete log please? |
Here |
Hm, I am wondering if that is an instance of the integer overflow problem we have been seeing in other places. Can you try to reduce workspace by half for testing? |
I have tried running it with |
Yeah, pretty sure it's a NCCL bug, you can try |
This is very similar to the problem we had with Piz Daint where NCCL wouldn't scale to more than 2 nodes. |
What's "Piz Daint"? |
I opened a bug with NCCL for that. We cannot really find a minimal case where that happens. Switching to a newer NCCL version doesn't fix it, makes it actually worse. A bit out of my wits here. |
That has been a problem for a long time now. |
I'll try that but it might not happen any time soon because the GPUs are currently actively used in the virtualised setup. |
Closing this due to inactivity. Feel free to re-open when you have new insights. |
Tried to run Marian on a new system. When running on a single GPU, everything's fine. But when using 2 or more GPUs, the training process stalls after or during the memory reservation phase and the GPU utilization remains stuck at 100% indefinitely. This happens on the current
master
, when i try to reproduce the issue on another system with 2 GPUs, everything's fine.To investigate, i compiled Marian in debug mode, ran it with
gdb
, and paused the execution during the stall. Here's the stacktrace i gotThe GPUs remain stuck at 100% even after i had paused Marian in
gdb
. It seems that the CPU thread is waiting on synchronizing with GPUs which have entered an infinite loop or something.Another interesting detail is that the above issue was encountered on a bare metal setup (running Ubuntu), but everything was fine on the same hardware when running on an Ubuntu virtual machine (guest) with PCIe passthrough for GPUs on a Windows Server host. Nvidia driver/cuda versions were the same on both systems.
The text was updated successfully, but these errors were encountered: