DDP error after upgrading to v1.0.1 #4171

sheffier · 2020-10-15T12:43:25Z

🐛 Bug

I'll start by saying that before I've upgrade to v1.0.1, I've used v0.9.0 with no apparent DDP issues.

After I've upgraded to v1.0.1, I'm having issues training with multiple GPUs using DDP backend.

Initiating multi-gpu training in the following scenarios will result with an error:

The first GPU (i.e. ID 0) is not included in the GPUs list. for example:
python main.py --distributed_backend 'ddp' --gpus 1,2,3
The GPUs list is not sequential. For example:
python main.py --distributed_backend 'ddp' --gpus 0,2,3

The above will result with the following error message:
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1595629427478/work/torch/csrc/cuda/Module.cpp:59

Initiating multi-gpu training in the following scenarios will work as expected:

python main.py --distributed_backend 'ddp' --gpus 2
or
python main.py --distributed_backend 'ddp' --gpus 0,1,2

Environment

CUDA:
- GPU:
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- available: True
- version: 10.2
Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 1.0.1
- tqdm: 4.50.2
System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.8
- version: Names of parameters may benefit from not being abbreviated #119-Ubuntu SMP Tue Sep 8 12:30:01 UTC 2020

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-10-15T14:48:24Z

ok got it. yes, was able to reproduce on my end. it's really hard to test these cases with only 2 GPUs haha. anyhow, this is a bit more involved so it might take a few days. In the meantime set CUDA_VISIBLE_DEVICES to the gpus you need and pass in the number of GPUs.

sorry for the inconvenience!

once it's fixed, we'll ping you here

sheffier · 2020-10-15T17:45:08Z

Great, thank!

awaelchli · 2020-10-21T12:07:39Z

duplicate of #3791
I'm working on this but no big breakthrough yet. I'm facing some difficulties because there are several global/env variables that determine the GPU selection. For ddp this is quite difficult to debug.

sheffier added bug Something isn't working help wanted Open to be worked on labels Oct 15, 2020

edenlightning added this to the 1.0.3 milestone Oct 19, 2020

edenlightning added distributed Generic distributed-related topic priority: 0 High priority task and removed priority: 0 High priority task labels Oct 19, 2020

edenlightning assigned awaelchli Oct 20, 2020

awaelchli mentioned this issue Oct 21, 2020

Set correct device ids in DDP [wip] #4297

Merged

williamFalcon closed this as completed in #4297 Oct 24, 2020

wookladin mentioned this issue Jul 12, 2021

Possible bottleneck? maum-ai/assem-vc#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP error after upgrading to v1.0.1 #4171

DDP error after upgrading to v1.0.1 #4171

sheffier commented Oct 15, 2020

williamFalcon commented Oct 15, 2020

sheffier commented Oct 15, 2020

awaelchli commented Oct 21, 2020

DDP error after upgrading to v1.0.1 #4171

DDP error after upgrading to v1.0.1 #4171

Comments

sheffier commented Oct 15, 2020

🐛 Bug

Environment

williamFalcon commented Oct 15, 2020

sheffier commented Oct 15, 2020

awaelchli commented Oct 21, 2020