Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marian hangs on multiple GPUs, GPU utilization remains at constant 100% #597

Closed
rihardsk opened this issue Feb 7, 2020 · 13 comments
Closed

Comments

@rihardsk
Copy link
Contributor

rihardsk commented Feb 7, 2020

Tried to run Marian on a new system. When running on a single GPU, everything's fine. But when using 2 or more GPUs, the training process stalls after or during the memory reservation phase and the GPU utilization remains stuck at 100% indefinitely. This happens on the current master, when i try to reproduce the issue on another system with 2 GPUs, everything's fine.

To investigate, i compiled Marian in debug mode, ran it with gdb, and paused the execution during the stall. Here's the stacktrace i got

[2020-02-07 14:47:54] [comm] NCCLCommunicator constructed successfully.
[2020-02-07 14:47:54] [training] Using 2 GPUs
[Thread 0x7fffb07fc700 (LWP 24955) exited]
[2020-02-07 14:47:54] Training started
[2020-02-07 14:47:54] [data] Shuffling data
[2020-02-07 14:48:04] [data] Done reading 12469417 sentences
[2020-02-07 14:49:07] [data] Done shuffling 12469417 sentences to temp files
[2020-02-07 14:49:12] [training] Batches are processed as 1 process(es) x 2 devices/process
[2020-02-07 14:49:12] [memory] Reserving 217 MB, device gpu3
[2020-02-07 14:49:12] [memory] Reserving 217 MB, device gpu2
[2020-02-07 14:49:12] [memory] Reserving 217 MB, device gpu3
[2020-02-07 14:49:12] [memory] Reserving 217 MB, device gpu2
[2020-02-07 14:49:13] [memory] Reserving 108 MB, device gpu2
[2020-02-07 14:49:13] [memory] Reserving 108 MB, device gpu3
^C
Thread 1 "marian" received signal SIGINT, Interrupt.
0x00007ffff7ffab62 in clock_gettime ()
(gdb) where
#0  0x00007ffff7ffab62 in clock_gettime ()
#1  0x00007fffe73c1ea6 in __GI___clock_gettime (clock_id=4, tp=0x7fffffffc620)
    at ../sysdeps/unix/clock_gettime.c:115
#2  0x00007fffe443e37e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fffe45024f7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fffe44264ac in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fffe4426609 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fffe432616d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fffe44ad739 in cuStreamSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00005555577a19e0 in cudart::cudaApiStreamSynchronize(CUstream_st*) ()
#9  0x00005555577dcaad in cudaStreamSynchronize ()
#10 0x000055555624cac2 in marian::NCCLCommunicator::synchronizeAll (this=0x55555f687660)
    at /home/rihards/src/marian-dev/src/training/communicator_nccl.h:40
#11 0x00005555562530e4 in marian::NCCLCommunicator::scatterReduceAndResetGrads (this=0x55555f687660)
    at /home/rihards/src/marian-dev/src/training/communicator_nccl.h:239
#12 0x00005555561e30ea in marian::SyncGraphGroup::update (this=0x55556064c140,
    subBatches=std::vector of length 16, capacity 16 = {...}, numReadBatches=1)
    at /home/rihards/src/marian-dev/src/training/graph_group_sync.cpp:399
#13 0x00005555561e1d88 in marian::SyncGraphGroup::update (this=0x55556064c140,
    newBatch=std::shared_ptr<marian::data::Batch> (use count 2, weak count 0) = {...})
    at /home/rihards/src/marian-dev/src/training/graph_group_sync.cpp:286
#14 0x0000555555d3ec01 in marian::Train<marian::SyncGraphGroup>::run (this=0x55555f543360)
    at /home/rihards/src/marian-dev/src/training/training.h:85
#15 0x0000555555cc9a6b in mainTrainer (argc=93, argv=0x7fffffffd308)
    at /home/rihards/src/marian-dev/src/command/marian_train.cpp:50
#16 0x0000555555ccc142 in main (argc=93, argv=0x7fffffffd308)
    at /home/rihards/src/marian-dev/src/command/marian_main.cpp:52
(gdb) 

The GPUs remain stuck at 100% even after i had paused Marian in gdb. It seems that the CPU thread is waiting on synchronizing with GPUs which have entered an infinite loop or something.

Every 1.0s: nvidia-smi                                                                                                                                                           burbulis: Fri Feb  7 15:27:30 2020

Fri Feb  7 15:27:30 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     On   | 00000000:1A:00.0 Off |                  Off |
| 33%   28C    P8     3W / 260W |     12MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 6000     On   | 00000000:1B:00.0 Off |                  Off |
| 33%   25C    P8    13W / 260W |     12MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 6000     On   | 00000000:3D:00.0 Off |                  Off |
| 33%   44C    P2    97W / 260W |  22041MiB / 24220MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 6000     On   | 00000000:3E:00.0 Off |                  Off |
| 33%   52C    P2   109W / 260W |  22041MiB / 24220MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Another interesting detail is that the above issue was encountered on a bare metal setup (running Ubuntu), but everything was fine on the same hardware when running on an Ubuntu virtual machine (guest) with PCIe passthrough for GPUs on a Windows Server host. Nvidia driver/cuda versions were the same on both systems.

@emjotde
Copy link
Member

emjotde commented Feb 7, 2020

Likely a NCCL bug, not Marian unfortunately. Can you share the complete log please?

@rihardsk
Copy link
Contributor Author

Here
train.log

@emjotde
Copy link
Member

emjotde commented Feb 11, 2020

Hm, I am wondering if that is an instance of the integer overflow problem we have been seeing in other places. Can you try to reduce workspace by half for testing?

@rihardsk
Copy link
Contributor Author

I have tried running it with --workspace 9000, the result was the same. Besides, in the VM scenario, that I described above, I used the same setup as in the log file. It worked fine there.

@emjotde
Copy link
Member

emjotde commented Feb 11, 2020

Yeah, pretty sure it's a NCCL bug, you can try --no-nccl, you will lose some speed though.

@kpu
Copy link
Member

kpu commented Feb 11, 2020

This is very similar to the problem we had with Piz Daint where NCCL wouldn't scale to more than 2 nodes.

@emjotde
Copy link
Member

emjotde commented Feb 11, 2020

What's "Piz Daint"?

@kpu
Copy link
Member

kpu commented Feb 11, 2020

#451

@kpu
Copy link
Member

kpu commented Feb 11, 2020

@emjotde
Copy link
Member

emjotde commented Feb 11, 2020

I opened a bug with NCCL for that. We cannot really find a minimal case where that happens. Switching to a newer NCCL version doesn't fix it, makes it actually worse. A bit out of my wits here.

@emjotde
Copy link
Member

emjotde commented Feb 11, 2020

That has been a problem for a long time now.

@rihardsk
Copy link
Contributor Author

Yeah, pretty sure it's a NCCL bug, you can try --no-nccl, you will lose some speed though.

I'll try that but it might not happen any time soon because the GPUs are currently actively used in the virtualised setup.

@emjotde
Copy link
Member

emjotde commented Apr 13, 2020

Closing this due to inactivity. Feel free to re-open when you have new insights.

@emjotde emjotde closed this as completed Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants