-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381
Comments
I've printed log dic from inside the
it indeed looks like every worker is doing checkpoint on own loss value :- looks like one time is called from early stopping, once from checkpoint which I did not even ask for. So far I like lightning, and would really like to express my appreciation to the developers |
@tgaddair mind taking a look? |
Sure, let me take a look. |
@tgaddair any update? |
Hey @edenlightning, not yet. I will try to take a look this week. |
Hey @undertherain, sorry for the late response. Tried looking into this earlier but couldn't repro. I suspect there may be a few things going on here:
@undertherain if you can provide a minimal repro that, will help a lot. I will also try to prioritize getting the metrics aggregation for Horovod to work in PL, which may also address this issue as a side effect. |
Thanks for looking at this |
Thanks for putting this together @undertherain! I'll take a look and get back to you. |
Hey @undertherain, here's the error I'm getting on a machine with 4 GPUs, looks like one worker is triggering early stopping but the others do not, leading to segfault:
Is this consistent with the error you're seeing? |
@undertherain can you try #3775 and see if it addresses your issue? With this change, I was able to get your test script to run to completion successfully. Note that this change requires Horovod >= v0.20.2. |
Yes it seems to work! Thanks! |
🐛 Bug
When I do early stopping with horovod distributed training, it fails with
cannot unpack non-iterable NoneType object
in tqdm.If fails only on some sets of training data. Also I see from the logs that early stopping was initiated only three times, while I'm training on 4 workers.
This makes me feel like the problem is that one of the workers did not initiate early stopping - presumably because each worker decides not by averaged, but by local validation loss.
As you can see I'm asking pytorch-lightning to average validation loss, but as was the case in my previous issue #3338 , the problem seems to be related to earlystopping using another dict.
Here's the full error message
The text was updated successfully, but these errors were encountered: