EarlyStopping not working / wrong keys in log #3338

undertherain · 2020-09-03T10:33:27Z

🐛 Bug

I’m trying to implement EarlyStopping when validation loss stops decreasing. I add callback as follows:

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    loss = F.l1_loss(y_hat, y)
    result = pl.EvalResult(checkpoint_on=loss)
    result.log("val_loss", loss, sync_dist=True)
    return result

early_stop_callback = EarlyStopping(
    monitor="val_loss",
    min_delta=0.1,
    patience=1,
    verbose=True,
    mode="min")

trainer = pl.Trainer(
    gpus=-1,
    max_epochs=50,
    distributed_backend="ddp",
    early_stop_callback=early_stop_callback,
    logger=wandb_logger)

This does not work - it is returning False at from the _validate_condition_metric function
When I checked what’s in the log dictionary, the values looked like
{'val_early_stop_on': None, 'val_checkpoint_on': tensor(0.5601, device='cuda:0')} - which is slightly confusing. Where does “val_checkpoint_on” come from and why it is not called “val_loss”?

It feels like it might be slightly connected to the result = pl.EvalResult(checkpoint_on=loss) line.
I was reading documentation, but frankly speaking I found
checkpoint_on (Union[Tensor, bool, None]) – Metric to checkpoint on. to be slightly not intuitive. What does it mean for the metric to be checkpoints on? And does it really connect to keys in log being renamed in a strange way?

Code sample

https://github.com/matsuokalab/cosmoflow/blob/ac75fe317f8daf3444c96b837bb109064aa81dab/main.py

Expected behavior

Expecting EarlyStopping to work, log to have val_loss key

Environment

* CUDA:
	- GPU:
		- Tesla V100-SXM2-16GB
		- Tesla V100-SXM2-16GB
		- Tesla V100-SXM2-16GB
		- Tesla V100-SXM2-16GB
	- available:         True
	- version:           10.2
* Packages:
	- numpy:             1.19.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.6.0
	- pytorch-lightning: 0.9.0
	- tensorboard:       2.2.0
	- tqdm:              4.46.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.8.2
	- version:           #1 SMP Fri Apr 20 16:44:24 UTC 2018

The text was updated successfully, but these errors were encountered:

github-actions · 2020-09-03T10:34:07Z

Hi! thanks for your contribution!, great first issue!

ydcjeff · 2020-09-03T10:59:14Z

Hi @undertherain, currently the monitor key of EarlyStopping will have no effect when you use along with EvalResult or TrainResult.
So, you need to provide early_stop_on with the loss (metric) in validation_step if using with EvalResult

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    loss = F.l1_loss(y_hat, y)
    result = pl.EvalResult(checkpoint_on=loss, early_stop_on=loss)  # changes here
    result.log("val_loss", loss, sync_dist=True)
    return result

Yes, val_early_stop_on and val_checkpoint_on are from EvalResult.
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/pytorch_lightning/core/step_result.py#L786

~~While using with Result object, the val_loss you logged will be only for logging to the logger, I guess.~~

API docs: https://pytorch-lightning.readthedocs.io/en/latest/results.html#evalresult-api

undertherain · 2020-09-03T11:59:14Z

Aha, I saw early_stop_on=loss, but thought it's some sort of automated way without callback. It is a bit counter-intuitive still that monitor is not working, but at least the stopping works now, thanks!

undertherain · 2020-09-03T12:08:46Z

I guess can close this, if it is considered "not a bug, but a feature" :)

ydcjeff · 2020-09-03T12:12:24Z

Aha, I saw early_stop_on=loss, but thought it's some sort of automated way without callback. It is a bit counter-intuitive still that monitor is not working, but at least the stopping works now, thanks!

Glad to hear that

we also have some discussions going on #3286

awaelchli · 2020-09-04T04:05:26Z

closing it ok? since you found a workaround and ignored monitor is a known issue and discussion happens in other issue as pointed out by @ydcjeff

undertherain added bug Something isn't working help wanted Open to be worked on labels Sep 3, 2020

awaelchli added the Result label Sep 4, 2020

awaelchli closed this as completed Sep 4, 2020

This was referenced Sep 4, 2020

On the relationship between Result and Callback monitor #3286

Closed

Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EarlyStopping not working / wrong keys in log #3338

EarlyStopping not working / wrong keys in log #3338

undertherain commented Sep 3, 2020

github-actions bot commented Sep 3, 2020

ydcjeff commented Sep 3, 2020 •

edited

Loading

undertherain commented Sep 3, 2020 •

edited

Loading

undertherain commented Sep 3, 2020

ydcjeff commented Sep 3, 2020

awaelchli commented Sep 4, 2020

EarlyStopping not working / wrong keys in log #3338

EarlyStopping not working / wrong keys in log #3338

Comments

undertherain commented Sep 3, 2020

🐛 Bug

Code sample

Expected behavior

Environment

github-actions bot commented Sep 3, 2020

ydcjeff commented Sep 3, 2020 • edited Loading

undertherain commented Sep 3, 2020 • edited Loading

undertherain commented Sep 3, 2020

ydcjeff commented Sep 3, 2020

awaelchli commented Sep 4, 2020

ydcjeff commented Sep 3, 2020 •

edited

Loading

undertherain commented Sep 3, 2020 •

edited

Loading