Marian hangs on multiple GPUs, GPU utilization remains at constant 100% #597

rihardsk · 2020-02-07T15:43:48Z

Tried to run Marian on a new system. When running on a single GPU, everything's fine. But when using 2 or more GPUs, the training process stalls after or during the memory reservation phase and the GPU utilization remains stuck at 100% indefinitely. This happens on the current master, when i try to reproduce the issue on another system with 2 GPUs, everything's fine.

To investigate, i compiled Marian in debug mode, ran it with gdb, and paused the execution during the stall. Here's the stacktrace i got

[2020-02-07 14:47:54] [comm] NCCLCommunicator constructed successfully.
[2020-02-07 14:47:54] [training] Using 2 GPUs
[Thread 0x7fffb07fc700 (LWP 24955) exited]
[2020-02-07 14:47:54] Training started
[2020-02-07 14:47:54] [data] Shuffling data
[2020-02-07 14:48:04] [data] Done reading 12469417 sentences
[2020-02-07 14:49:07] [data] Done shuffling 12469417 sentences to temp files
[2020-02-07 14:49:12] [training] Batches are processed as 1 process(es) x 2 devices/process
[2020-02-07 14:49:12] [memory] Reserving 217 MB, device gpu3
[2020-02-07 14:49:12] [memory] Reserving 217 MB, device gpu2
[2020-02-07 14:49:12] [memory] Reserving 217 MB, device gpu3
[2020-02-07 14:49:12] [memory] Reserving 217 MB, device gpu2
[2020-02-07 14:49:13] [memory] Reserving 108 MB, device gpu2
[2020-02-07 14:49:13] [memory] Reserving 108 MB, device gpu3
^C
Thread 1 "marian" received signal SIGINT, Interrupt.
0x00007ffff7ffab62 in clock_gettime ()
(gdb) where
#0  0x00007ffff7ffab62 in clock_gettime ()
#1  0x00007fffe73c1ea6 in __GI___clock_gettime (clock_id=4, tp=0x7fffffffc620)
    at ../sysdeps/unix/clock_gettime.c:115
#2  0x00007fffe443e37e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fffe45024f7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fffe44264ac in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fffe4426609 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fffe432616d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fffe44ad739 in cuStreamSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00005555577a19e0 in cudart::cudaApiStreamSynchronize(CUstream_st*) ()
#9  0x00005555577dcaad in cudaStreamSynchronize ()
#10 0x000055555624cac2 in marian::NCCLCommunicator::synchronizeAll (this=0x55555f687660)
    at /home/rihards/src/marian-dev/src/training/communicator_nccl.h:40
#11 0x00005555562530e4 in marian::NCCLCommunicator::scatterReduceAndResetGrads (this=0x55555f687660)
    at /home/rihards/src/marian-dev/src/training/communicator_nccl.h:239
#12 0x00005555561e30ea in marian::SyncGraphGroup::update (this=0x55556064c140,
    subBatches=std::vector of length 16, capacity 16 = {...}, numReadBatches=1)
    at /home/rihards/src/marian-dev/src/training/graph_group_sync.cpp:399
#13 0x00005555561e1d88 in marian::SyncGraphGroup::update (this=0x55556064c140,
    newBatch=std::shared_ptr<marian::data::Batch> (use count 2, weak count 0) = {...})
    at /home/rihards/src/marian-dev/src/training/graph_group_sync.cpp:286
#14 0x0000555555d3ec01 in marian::Train<marian::SyncGraphGroup>::run (this=0x55555f543360)
    at /home/rihards/src/marian-dev/src/training/training.h:85
#15 0x0000555555cc9a6b in mainTrainer (argc=93, argv=0x7fffffffd308)
    at /home/rihards/src/marian-dev/src/command/marian_train.cpp:50
#16 0x0000555555ccc142 in main (argc=93, argv=0x7fffffffd308)
    at /home/rihards/src/marian-dev/src/command/marian_main.cpp:52
(gdb)

The GPUs remain stuck at 100% even after i had paused Marian in gdb. It seems that the CPU thread is waiting on synchronizing with GPUs which have entered an infinite loop or something.

Every 1.0s: nvidia-smi                                                                                                                                                           burbulis: Fri Feb  7 15:27:30 2020

Fri Feb  7 15:27:30 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     On   | 00000000:1A:00.0 Off |                  Off |
| 33%   28C    P8     3W / 260W |     12MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 6000     On   | 00000000:1B:00.0 Off |                  Off |
| 33%   25C    P8    13W / 260W |     12MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 6000     On   | 00000000:3D:00.0 Off |                  Off |
| 33%   44C    P2    97W / 260W |  22041MiB / 24220MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 6000     On   | 00000000:3E:00.0 Off |                  Off |
| 33%   52C    P2   109W / 260W |  22041MiB / 24220MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Another interesting detail is that the above issue was encountered on a bare metal setup (running Ubuntu), but everything was fine on the same hardware when running on an Ubuntu virtual machine (guest) with PCIe passthrough for GPUs on a Windows Server host. Nvidia driver/cuda versions were the same on both systems.

The text was updated successfully, but these errors were encountered:

emjotde · 2020-02-07T18:45:10Z

Likely a NCCL bug, not Marian unfortunately. Can you share the complete log please?

rihardsk · 2020-02-10T08:48:26Z

Here
train.log

emjotde · 2020-02-11T02:53:16Z

Hm, I am wondering if that is an instance of the integer overflow problem we have been seeing in other places. Can you try to reduce workspace by half for testing?

rihardsk · 2020-02-11T08:24:08Z

I have tried running it with --workspace 9000, the result was the same. Besides, in the VM scenario, that I described above, I used the same setup as in the log file. It worked fine there.

emjotde · 2020-02-11T18:26:14Z

Yeah, pretty sure it's a NCCL bug, you can try --no-nccl, you will lose some speed though.

kpu · 2020-02-11T18:26:59Z

This is very similar to the problem we had with Piz Daint where NCCL wouldn't scale to more than 2 nodes.

emjotde · 2020-02-11T18:32:06Z

What's "Piz Daint"?

kpu · 2020-02-11T18:33:18Z

#451

kpu · 2020-02-11T18:34:41Z

https://twitter.com/marian_nmt/status/1198361234782842880

emjotde · 2020-02-11T18:39:47Z

I opened a bug with NCCL for that. We cannot really find a minimal case where that happens. Switching to a newer NCCL version doesn't fix it, makes it actually worse. A bit out of my wits here.

emjotde · 2020-02-11T18:40:04Z

That has been a problem for a long time now.

rihardsk · 2020-02-27T08:52:46Z

Yeah, pretty sure it's a NCCL bug, you can try --no-nccl, you will lose some speed though.

I'll try that but it might not happen any time soon because the GPUs are currently actively used in the virtualised setup.

emjotde · 2020-04-13T01:28:06Z

Closing this due to inactivity. Feel free to re-open when you have new insights.

kpu mentioned this issue Feb 15, 2020

clang: Disable unused-function warnings for 3rd-party NCCL library #599

Merged

emjotde closed this as completed Apr 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marian hangs on multiple GPUs, GPU utilization remains at constant 100% #597

Marian hangs on multiple GPUs, GPU utilization remains at constant 100% #597

rihardsk commented Feb 7, 2020

emjotde commented Feb 7, 2020

rihardsk commented Feb 10, 2020

emjotde commented Feb 11, 2020

rihardsk commented Feb 11, 2020

emjotde commented Feb 11, 2020

kpu commented Feb 11, 2020

emjotde commented Feb 11, 2020

kpu commented Feb 11, 2020

kpu commented Feb 11, 2020

emjotde commented Feb 11, 2020

emjotde commented Feb 11, 2020

rihardsk commented Feb 27, 2020

emjotde commented Apr 13, 2020

Marian hangs on multiple GPUs, GPU utilization remains at constant 100% #597

Marian hangs on multiple GPUs, GPU utilization remains at constant 100% #597

Comments

rihardsk commented Feb 7, 2020

emjotde commented Feb 7, 2020

rihardsk commented Feb 10, 2020

emjotde commented Feb 11, 2020

rihardsk commented Feb 11, 2020

emjotde commented Feb 11, 2020

kpu commented Feb 11, 2020

emjotde commented Feb 11, 2020

kpu commented Feb 11, 2020

kpu commented Feb 11, 2020

emjotde commented Feb 11, 2020

emjotde commented Feb 11, 2020

rihardsk commented Feb 27, 2020

emjotde commented Apr 13, 2020