Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EarlyStopping not working / wrong keys in log #3338

Closed
undertherain opened this issue Sep 3, 2020 · 6 comments
Closed

EarlyStopping not working / wrong keys in log #3338

undertherain opened this issue Sep 3, 2020 · 6 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@undertherain
Copy link

🐛 Bug

I’m trying to implement EarlyStopping when validation loss stops decreasing. I add callback as follows:

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    loss = F.l1_loss(y_hat, y)
    result = pl.EvalResult(checkpoint_on=loss)
    result.log("val_loss", loss, sync_dist=True)
    return result

early_stop_callback = EarlyStopping(
    monitor="val_loss",
    min_delta=0.1,
    patience=1,
    verbose=True,
    mode="min")

trainer = pl.Trainer(
    gpus=-1,
    max_epochs=50,
    distributed_backend="ddp",
    early_stop_callback=early_stop_callback,
    logger=wandb_logger)

This does not work - it is returning False at from the _validate_condition_metric function
When I checked what’s in the log dictionary, the values looked like
{'val_early_stop_on': None, 'val_checkpoint_on': tensor(0.5601, device='cuda:0')} - which is slightly confusing. Where does “val_checkpoint_on” come from and why it is not called “val_loss”?

It feels like it might be slightly connected to the result = pl.EvalResult(checkpoint_on=loss) line.
I was reading documentation, but frankly speaking I found
checkpoint_on (Union[Tensor, bool, None]) – Metric to checkpoint on. to be slightly not intuitive. What does it mean for the metric to be checkpoints on? And does it really connect to keys in log being renamed in a strange way?

Code sample

https://github.com/matsuokalab/cosmoflow/blob/ac75fe317f8daf3444c96b837bb109064aa81dab/main.py

Expected behavior

Expecting EarlyStopping to work, log to have val_loss key

Environment

* CUDA:
	- GPU:
		- Tesla V100-SXM2-16GB
		- Tesla V100-SXM2-16GB
		- Tesla V100-SXM2-16GB
		- Tesla V100-SXM2-16GB
	- available:         True
	- version:           10.2
* Packages:
	- numpy:             1.19.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.6.0
	- pytorch-lightning: 0.9.0
	- tensorboard:       2.2.0
	- tqdm:              4.46.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.8.2
	- version:           #1 SMP Fri Apr 20 16:44:24 UTC 2018
@undertherain undertherain added bug Something isn't working help wanted Open to be worked on labels Sep 3, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2020

Hi! thanks for your contribution!, great first issue!

@ydcjeff
Copy link
Contributor

ydcjeff commented Sep 3, 2020

Hi @undertherain, currently the monitor key of EarlyStopping will have no effect when you use along with EvalResult or TrainResult.
So, you need to provide early_stop_on with the loss (metric) in validation_step if using with EvalResult

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    loss = F.l1_loss(y_hat, y)
    result = pl.EvalResult(checkpoint_on=loss, early_stop_on=loss)  # changes here
    result.log("val_loss", loss, sync_dist=True)
    return result

Yes, val_early_stop_on and val_checkpoint_on are from EvalResult.
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/pytorch_lightning/core/step_result.py#L786

While using with Result object, the val_loss you logged will be only for logging to the logger, I guess.

API docs: https://pytorch-lightning.readthedocs.io/en/latest/results.html#evalresult-api

@undertherain
Copy link
Author

undertherain commented Sep 3, 2020

Aha, I saw early_stop_on=loss, but thought it's some sort of automated way without callback. It is a bit counter-intuitive still that monitor is not working, but at least the stopping works now, thanks!

@undertherain
Copy link
Author

I guess can close this, if it is considered "not a bug, but a feature" :)

@ydcjeff
Copy link
Contributor

ydcjeff commented Sep 3, 2020

Aha, I saw early_stop_on=loss, but thought it's some sort of automated way without callback. It is a bit counter-intuitive still that monitor is not working, but at least the stopping works now, thanks!

Glad to hear that

we also have some discussions going on #3286

@awaelchli
Copy link
Contributor

closing it ok? since you found a workaround and ignored monitor is a known issue and discussion happens in other issue as pointed out by @ydcjeff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants