-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trainer.fit() stuck with accelerator set to "ddp" #5961
Comments
I answered in the discussion post about usage of ddp in Jupyter environment.
The boring model runs fine for multi gpu (current code from this repo), I can confirm that on master and also with your version 1.1.8. Please post your full boring model as one script so that I can run it. |
The only modification I made just |
Have you tried pytorch-lightning 1.1.6? For me, after 1.1.7, ddp training is stuck.. I also wonder what a cause is. |
Dear @ifsheldon,
Best, |
I have had this kind of issue (to note, I'm working on a terminal on a server, so i'm not on a notebook). The training just got stuck after two epochs when using ddp. I tried out a couple of things that didn't work, including lessening the number of workers in the data loader. This only happened when I used 3 gpus. when using two gpus, this didn't happen. Cuda version 10.2. I also tried in another server, with the same issue repeating itself.
I have tried uninstalling 1.1.7 and installing 1.1.6 and it worked without any issue ! |
Hi @IhabBendidi can you check if this #5604 (comment) fixes your problem? That might be a more appropriate issue than this one |
That solved it thanks ! |
🐛 Bug
The problem is that
trainer.fit()
withaccelerator
set toddp
takes extremely long time to do something before it can get CPUs and GPUs working. And I cannot interrupt the kernel but have to restart it.Please reproduce using the BoringModel
To Reproduce
I tried the Boring Model, and I can reproduce the issue.
The only modification I made is in Define the test section. The code is below
And my code that initially encountered this issue is in the discussion post.
Expected behavior
The expected behavior is that the training should start in a couple of minutes, but
trainer.fit()
is stuck while GPUs and CPUs stay idle.Environment
My environment is below as detected by the official python script. I run my code on a shared GPU cluster after I apply for computation resources. I usually apply for 512GB memory, 32 cores and 4 V100. The environment is managed by my personal
conda
without messing with others' environment. If you want to know more about the configuration, just let me know.Additional context
If I changed the code above in the boring model to the below, the trainer "works" as expected. The trainer with
accelerator='dp'
takes less than 1min to get everything set up and keeps CPUs and GPUs busy while the one withaccelerator='ddp'
takes 10min and more and does not successfully get things running before I lose my patience.By "works" I meant it can get GPUs running, but later a runtime error is thrown. And I think this will be another issue, which maybe that the code in the boring model notebook is not runnable in multi-GPU environment. However, I don't know what is the cause, since I am just transferring from ordinary pytorch to pytorch-lightning, and the code in the notebook looks reasonably good for me.
The text was updated successfully, but these errors were encountered: