DDP error after upgrading to v1.0.1 #4171
Labels
bug
Something isn't working
distributed
Generic distributed-related topic
help wanted
Open to be worked on
priority: 0
High priority task
Milestone
🐛 Bug
I'll start by saying that before I've upgrade to v1.0.1, I've used v0.9.0 with no apparent DDP issues.
After I've upgraded to v1.0.1, I'm having issues training with multiple GPUs using DDP backend.
Initiating multi-gpu training in the following scenarios will result with an error:
The first GPU (i.e. ID 0) is not included in the GPUs list. for example:
python main.py --distributed_backend 'ddp' --gpus 1,2,3
The GPUs list is not sequential. For example:
python main.py --distributed_backend 'ddp' --gpus 0,2,3
The above will result with the following error message:
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1595629427478/work/torch/csrc/cuda/Module.cpp:59
Initiating multi-gpu training in the following scenarios will work as expected:
python main.py --distributed_backend 'ddp' --gpus 2
or
python main.py --distributed_backend 'ddp' --gpus 0,1,2
Environment
- GPU:
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- available: True
- version: 10.2
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 1.0.1
- tqdm: 4.50.2
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.8
- version: Names of parameters may benefit from not being abbreviated #119-Ubuntu SMP Tue Sep 8 12:30:01 UTC 2020
The text was updated successfully, but these errors were encountered: