Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381

Closed
undertherain opened this issue Sep 7, 2020 · 11 comments · Fixed by #3775
Closed
Assignees
Labels
3rd party Related to a 3rd-party bug Something isn't working
Milestone

Comments

@undertherain
Copy link

🐛 Bug

When I do early stopping with horovod distributed training, it fails with cannot unpack non-iterable NoneType object in tqdm.
If fails only on some sets of training data. Also I see from the logs that early stopping was initiated only three times, while I'm training on 4 workers.
This makes me feel like the problem is that one of the workers did not initiate early stopping - presumably because each worker decides not by averaged, but by local validation loss.

        result = pl.EvalResult(early_stop_on=loss, checkpoint_on=loss)
        result.log("val_loss", loss, sync_dist=True)

As you can see I'm asking pytorch-lightning to average validation loss, but as was the case in my previous issue #3338 , the problem seems to be related to earlystopping using another dict.
Here's the full error message

Epoch 7:   0% 0/8 [00:00<?, ?it/s, loss=0.480, v_num=50]Traceback (most recent call last):
  File "main.py", line 72, in <module>       
    main()
  File "main.py", line 68, in main
    trainer.fit(model, data_module)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in fit
    results = self.accelerator_backend.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/horovod_backend.py", line 108, in train
    result = self.trainer.run_pretrain_routine(self.trainer.model)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in run_pretrain_routine
    self.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 396, in train
    self.run_training_epoch()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 484, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 832, in run_training_batch
    opt_closure_result = self.optimizer_closure(
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1065, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 312, in backward
    loss.backward()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Exception ignored in: <function tqdm.__del__ at 0x2b61156e6820>
Traceback (most recent call last):
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1086, in __del__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1293, in close
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1471, in display
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1089, in __repr__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Without early stopping it works ok. 

### Environment
  • CUDA:
    • GPU:
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.19.1
    • pyTorch_debug: False
    • pyTorch_version: 1.6.0
    • pytorch-lightning: 0.9.1rc1
    • tensorboard: 2.2.0
    • tqdm: 4.46.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.2
    • version: Proposal for help #1 SMP Fri Apr 20 16:44:24 UTC 2018
also had it on lightning 0.9.0 - actually did an upgrade to rc hoping that it will magically fix the problem
@undertherain undertherain added bug Something isn't working help wanted Open to be worked on labels Sep 7, 2020
@undertherain
Copy link
Author

I've printed log dic from inside the _validate_condition_metric class:

{'val_early_stop_on': tensor(0.4890, device='cuda:0'), 'val_checkpoint_on': tensor(0.4890, device='cuda:0')}
{'val_early_stop_on': tensor(0.4890, device='cuda:0'), 'val_checkpoint_on': tensor(0.4890, device='cuda:0')}
{'val_early_stop_on': tensor(0.5420, device='cuda:1'), 'val_checkpoint_on': tensor(0.5420, device='cuda:1')}                                 
{'val_early_stop_on': tensor(0.5420, device='cuda:1'), 'val_checkpoint_on': tensor(0.5420, device='cuda:1')}
{'val_early_stop_on': tensor(1.0173, device='cuda:2'), 'val_checkpoint_on': tensor(1.0173, device='cuda:2')}
{'val_early_stop_on': tensor(0.4354, device='cuda:3'), 'val_checkpoint_on': tensor(0.4354, device='cuda:3')}
{'val_early_stop_on': tensor(1.0173, device='cuda:2'), 'val_checkpoint_on': tensor(1.0173, device='cuda:2')}
{'val_early_stop_on': tensor(0.4354, device='cuda:3'), 'val_checkpoint_on': tensor(0.4354, device='cuda:3')}

it indeed looks like every worker is doing checkpoint on own loss value :-
how come that doing this on averaged loss is not a default behaviour and how I can enforce that?

looks like one time is called from early stopping, once from checkpoint which I did not even ask for.

So far I like lightning, and would really like to express my appreciation to the developers
but funnily, unlike with other frameworks where I struggle to make them do something sometimes, with lightning I struggle to make it not do something :D

@edenlightning
Copy link
Contributor

@tgaddair mind taking a look?

@edenlightning edenlightning added this to the 0.9.x milestone Sep 8, 2020
@tgaddair
Copy link
Contributor

tgaddair commented Sep 8, 2020

Sure, let me take a look.

@edenlightning
Copy link
Contributor

@tgaddair any update?

@tgaddair
Copy link
Contributor

Hey @edenlightning, not yet. I will try to take a look this week.

@tgaddair
Copy link
Contributor

Hey @undertherain, sorry for the late response. Tried looking into this earlier but couldn't repro. I suspect there may be a few things going on here:

  1. Early stopping criteria being applied independently on each worker. We should implement PL metrics aggregation for Horovod so the criteria can be applied consistently on every worker.
  2. Something appears to be wrong with the params in the DistributedOptimizer. Sounds like this is related but not clear to me how without being able to repro myself (see https://discuss.pytorch.org/t/how-to-do-only-one-forward-propagation-per-epoch-and-multiple-backward-propagations-on-graph/65396/9 for more context).

@undertherain if you can provide a minimal repro that, will help a lot. I will also try to prioritize getting the metrics aggregation for Horovod to work in PL, which may also address this issue as a side effect.

@undertherain
Copy link
Author

Thanks for looking at this
took a bit of time to make code not depending on my custom dataset
here I basically generate random tensors and predict 0 on horovod rank 0, and 1 in the rest of the workers

https://github.com/undertherain/pl_stopping_test/tree/01c8d98ba6136d7a9e1b26f07bb3b8d63eca2724

@tgaddair
Copy link
Contributor

Thanks for putting this together @undertherain! I'll take a look and get back to you.

@tgaddair
Copy link
Contributor

tgaddair commented Oct 1, 2020

Hey @undertherain, here's the error I'm getting on a machine with 4 GPUs, looks like one worker is triggering early stopping but the others do not, leading to segfault:

[1,0]<stderr>:Saving latest checkpoint..
[1,0]<stderr>:Epoch 00003: early stopping triggered.
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "main.py", line 71, in <module>
[1,1]<stderr>:    main()
[1,1]<stderr>:  File "main.py", line 67, in main
[1,1]<stderr>:    trainer.fit(model, data_module)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
[1,1]<stderr>:    result = fn(self, *args, **kwargs)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1068, in fit
[1,1]<stderr>:    results = self.horovod_train(model)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 242, in horovod_train
[1,1]<stderr>:    result = self.run_pretrain_routine(model)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
[1,1]<stderr>:    self.train()
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
[1,1]<stderr>:    self.run_training_epoch()
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 491, in run_training_epoch
[1,1]<stderr>:    batch_output = self.run_training_batch(batch, batch_idx)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 839, in run_training_batch
[1,1]<stderr>:    opt_closure_result = self.optimizer_closure(
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1076, in optimizer_closure
[1,1]<stderr>:    model_ref.backward(self, closure_loss, optimizer, opt_idx)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 324, in backward
[1,1]<stderr>:    loss.backward()
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
[1,1]<stderr>:    torch.autograd.backward(self, gradient, retain_graph, create_graph)
[1,1]<stderr>:  File "/Users/taddair/repos/pytorch-lightning/env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
[1,1]<stderr>:    Variable._execution_engine.run_backward(
[1,1]<stderr>:RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.

Is this consistent with the error you're seeing?

@tgaddair
Copy link
Contributor

tgaddair commented Oct 1, 2020

@undertherain can you try #3775 and see if it addresses your issue? With this change, I was able to get your test script to run to completion successfully. Note that this change requires Horovod >= v0.20.2.

@edenlightning edenlightning modified the milestones: 0.9.x, 1.0 Oct 4, 2020
@undertherain
Copy link
Author

Yes it seems to work! Thanks!

@edenlightning edenlightning removed the help wanted Open to be worked on label Oct 5, 2020
@Borda Borda modified the milestones: 1.0, 1.0.x Oct 13, 2020
@edenlightning edenlightning modified the milestones: 1.0.x, 1.1, 1.0.3 Oct 14, 2020
@edenlightning edenlightning added the 3rd party Related to a 3rd-party label Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants