[RFC] Remove `{running,accumulated}_loss` #9372

carmocca · 2021-09-08T13:33:21Z

Proposed refactoring or deprecation

Remove the following code: a979944

Motivation

The running loss is a running window of loss values returned by the training_step. It has been present since the very beginning of Lightning and has become legacy code.

Problems:

Users are sometimes confused by the value when they don't know it's a running window and compare it to the actual loss value they self.loged.
Often users self.log their actual loss which makes them see two "loss" values in the progress bar.
To disable it, you have to override the get_progress_bar_dict hook which is inconvenient.
The running window configuration is opaque to the user as it's hard-coded in the TrainingBatchLoop.__init__.

Alternative:

Can be entirely replaced by asking users to call self.log("loss", loss, prog_bar=True)
If the users still want to keep the "value window" functionality, this could be done by logging a torchmetrics.Metric specialized for it. (is there a Metric to replace the TensorRunningAccum already? cc @justusschock @awaelchli @akihironitta @rohitgr7 @SeanNaren @kaushikb11 @SkafteNicki)

Pitch

Remove the code, I don't think there's anything to deprecate here.

get_progress_bar_dict stays for the v_num and split_idx.
The TrainingBatchLoop.{accumulated,running}_loss attributes should be private.
The FitLoop.running_loss property seems to be there only for the Tuner and could be considered private: https://grep.app/search?q=fit_loop.running_loss
No project seems to be using the TensorRunningAccum: https://grep.app/search?q=TensorRunningAccum

cc @awaelchli @ananthsub

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

SkafteNicki · 2021-09-08T14:03:16Z

@carmocca nothing yet, but just created PR Lightning-AI/torchmetrics#506 in torchmetrics that implements simple aggregation metrics (sum, mean, max, min, cat) :]

ananthsub · 2021-09-09T03:04:03Z

+1 I'm in favor of this! Good find @carmocca !

awaelchli · 2021-09-09T12:43:01Z

I agree with the window accumulation for the regular loss, it's not really needed and the value is anyway not configurable.
However, I don't think its good to ask users to log the loss manually just to appear in the progress bar. Since automatic optimization is the default and the loss needs to be returned, Lightning should show it in the progress bar. Please keep this. The progress bar callback could take care of that.

What will we do with the loss accumulation for gradient accumulation phases? Will you also remove that?

tchaton · 2021-09-09T13:16:34Z

Yes, sounds like a great idea !

carmocca · 2021-09-09T13:36:21Z

I don't think its good to ask users to log the loss manually just to appear in the progress bar.

Almost all users (if not all?) are logging the loss explicitly already to include it in the loggers.

Since automatic optimization is the default and the loss needs to be returned, Lightning should show it in the progress bar

The progress bar and the concept of automatic optimization don't need to be linked like this. Also, it raises the question: "what about manual? do I need to return the loss there too?"
On the other hand, there's a clear relation between "logging" and showing the progress bar, given the presence of log(prog_bar=bool).

This goes with the theme of avoiding multiple ways to do the same thing.

The progress bar callback could take care of that.

It does take care of it already by getting values from progress_bar_metrics which are managed by the LoggerConnector.

What will we do with the loss accumulation for gradient accumulation phases? Will you also remove that?

Yes, the point is to show in the progress bar the same that the users will see when they open TensorBoard, whatever that is.

jinyoung-lim · 2021-09-30T05:24:14Z

i can take a look at this issue if no one has started yet.

carmocca · 2021-10-09T06:04:01Z

After some offline discussion, we decided to split this into separate PRs:

(Agreed) Remove the running accumulation mechanism. The difference in how this average is computed becomes confusing for users when they compare it to the values they logged, we should probably use an average metric instead.
(Up for debate): Remove the automatic inclusion of the loss in the progress bar by returning it from training_step.

The main arguments for (2) are:

The loss is returned for optimization, not visualization. Using this mechanism to automatically add it into the progress bar is somewhat of a leak of responsibility. This responsibility should belong to the LoggerConnector which the user interacts with by calling self.log()
The current loss mechanism is easy to opt-in (by returning the loss) but hard to opt-out (you need to subclass the progress bar and override a hook, then pass this instance to your trainer). The alternative of self.log would be easy to opt-in and opt-out (change a boolean value)
Most people self.log their loss already, and self.log is a widely known and used mechanism.
Back when the return-to-add mechanism was added, self.log did not exist which might explain why it was done that way.

Now, if (2) is approved, there are things we could do to improve the experience:
a. if the user logs something with "loss" in the key and prog_bar=True, then we disable the return mechanism
b. if the user logs something with "loss" in the key and prog_bar=False, we print an info message (only once) suggesting to set prog_bar=True

rohitgr7 · 2021-11-16T13:14:48Z

b. if the user logs something with "loss" in the key and prog_bar=False, we print an info message (only once) suggesting to set prog_bar=True

I think an info msg is not required. The user chooses not to show it in the prog_bar, so it seems reasonable that they don't want it.

carmocca added the refactor label Sep 8, 2021

carmocca added this to the v1.5 milestone Sep 8, 2021

carmocca changed the title ~~[RFC] Remove {running,accumulated}_loss~~ [RFC] Remove {running,accumulated}_loss Sep 8, 2021

This was referenced Sep 8, 2021

Share the training step output data via ClosureResult #9349

Merged

extract manual optimization loop #9266

Merged

tchaton added the let's do it! approved to implement label Sep 9, 2021

carmocca mentioned this issue Sep 14, 2021

[bugfix] Resolve logging reduction when using sync_dist + reduce_fx={mean, max} #9142

Merged

12 tasks

carmocca modified the milestones: v1.5, v1.6 Nov 2, 2021

ananthsub mentioned this issue Nov 20, 2021

Progressbar showed different value vs tensorboard and myself output #10653

Closed

carmocca added progress bar: rich progress bar: tqdm and removed let's do it! approved to implement labels Feb 1, 2022

carmocca modified the milestones: 1.6, 1.7 Feb 1, 2022

carmocca added this to Lightning RFCs Feb 28, 2022

carmocca self-assigned this Feb 28, 2022

carmocca moved this to Waiting in Lightning RFCs Feb 28, 2022

rohitgr7 mentioned this issue Mar 22, 2022

Is this a bug? - unnecessary CPU/GPU copying in supporters.py just for aggregating loss #12408

Closed

carmocca modified the milestones: pl:1.7, future Jul 19, 2022

awaelchli self-assigned this Dec 22, 2022

awaelchli mentioned this issue Dec 24, 2022

Remove special handling of loss in progress bar #16192

Merged

11 tasks

carmocca closed this as completed Jan 16, 2023

carmocca modified the milestones: future, 2.0 Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Remove `{running,accumulated}_loss` #9372

[RFC] Remove `{running,accumulated}_loss` #9372

carmocca commented Sep 8, 2021 •

edited by github-actions bot

Loading

SkafteNicki commented Sep 8, 2021

ananthsub commented Sep 9, 2021

awaelchli commented Sep 9, 2021 •

edited

Loading

tchaton commented Sep 9, 2021

carmocca commented Sep 9, 2021

jinyoung-lim commented Sep 30, 2021

carmocca commented Oct 9, 2021

rohitgr7 commented Nov 16, 2021

[RFC] Remove {running,accumulated}_loss #9372

[RFC] Remove {running,accumulated}_loss #9372

Comments

carmocca commented Sep 8, 2021 • edited by github-actions bot Loading

Proposed refactoring or deprecation

Motivation

Pitch

If you enjoy Lightning, check out our other projects! ⚡

SkafteNicki commented Sep 8, 2021

ananthsub commented Sep 9, 2021

awaelchli commented Sep 9, 2021 • edited Loading

tchaton commented Sep 9, 2021

carmocca commented Sep 9, 2021

jinyoung-lim commented Sep 30, 2021

carmocca commented Oct 9, 2021

rohitgr7 commented Nov 16, 2021

[RFC] Remove `{running,accumulated}_loss` #9372

[RFC] Remove `{running,accumulated}_loss` #9372

carmocca commented Sep 8, 2021 •

edited by github-actions bot

Loading

awaelchli commented Sep 9, 2021 •

edited

Loading