Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning Migration #837

Merged
merged 79 commits into from
Nov 12, 2022
Merged

Lightning Migration #837

merged 79 commits into from
Nov 12, 2022

Conversation

karl-richter
Copy link
Collaborator

@karl-richter karl-richter commented Oct 17, 2022

v0 Todos

  • Change TimeNet parent from PyTorch Module to PyTorch Lightning
  • Add PyTorch Lightning functions to TimeNet (eg. training_step)
  • Add Lightning Trainer in forecaster
  • Change train and predict functions in forecaster to use the Lightning Trainer
  • Change the prediction logic in forecaster.py
  • Handle minimal mode, eg. _train_minimal(...)
  • Return metrics in _train(...)
  • Define epochs, batch, lr (etc.) correctly in trainer (when not provided) + move to fit() method

v1 Todos

  • Remove outdated code
  • Change model saving and loading from checkpoint Checkpointing
  • Add support for all Lightning loggers (eg. Tensorboard)
  • Learning rate range finder
  • Use correct batch_size
  • Regularisation loss fixed
  • Pass denomralization as function
  • Improved Learning rate finder
  • Early stopping
  • Support the self.metrics.add_specific_target function when highlight_forecast_step_n is defined

v2 Todos (separate PRs)

Changes

Guiding idea: Migrate from using plain PyTorch to the PyTorch Lightning framework.

Consequent changes:

  • Training: Use the Lightning logic for training, evaluation and prediction (removes most training logic from forecaster.py and moves it to time_net.py) Docs
  • Metrics: Since the intra-epoch data is not available outside of the Lightning module, the metrics need to be calculated within the model. Instead of using the custom Metrics module, we switch to the (Lightning) library torchmetrics to calculate metrics within the training_step Docs
  • Metrics Logger: Since the Lightning default logger does not persist metrics but we want to return a metrics_df from fit(), we need to persist them in an object during runtime. Therefore we define a custom logger for Lightning that collect metrics in a dictionary (the default logger saves and checkpoints the model automatically, which we dont always want)
  • Progress bar: Since the epoch progress information is no longer availbe within the forecaster, we switch to the Lightning built-in progress bar for logging training progress (use the rich progress bar instead of the custom tqdm progress bar and the default lightning tqdm progress bar due to issues in Jupyter notebook) Docs
  • Early stopping: Added support for early stopping using the Lightning loss monitor (with that we can close Early Stopping #289)
  • Learning rate finder: From the custom learning rate finder we switch to the Lightning built-in Learning rate finder Docs

Before

The training logic is contained in forecaster.py, manually iter through epochs and batches. The train_epoch() function directly calls forward() on the TimeNet model. Optimization happens manually in train_epoch().

flowchart LR
    subgraph TimeNet
    forward
    end
    subgraph NeuralProphet
    fit --> _train
    _train --> _train_epoch
    _train_epoch --> forward
    predict --> _predict_raw
    _predict_raw --> forward
    end
Loading

After

The training loop is abstracted using the Lightning training logic. Init a Lightning Trainer object that can run the training loop automatically. Calling the fit() method on the Lightning trainer executes the training_step() function of the model using the correct epoch and batch. Optimization and parameter updates happen automatically after the training_step() function. The whole training logic is abstracted away. Lightning provides useful tools such as a progress bar, a learning rate finder, early stopping, GPU support etc.

flowchart LR
    subgraph TimeNet
    configure_optimizers
    training_step --> forward
    predict_step --> forward
    end
    subgraph LightningTrainer
    tune --> configure_optimizers
    fit_[fit] --> training_step
    predict_[predict] --> predict_step
    end
    subgraph NeuralProphet
    fit --> _train
    _train --> tune
    _train --> fit_
     predict --> _predict_raw
     _predict_raw --> predict_
    end
Loading

@karl-richter karl-richter mentioned this pull request Oct 27, 2022
@karl-richter karl-richter added the status: needs review PR needs to be reviewed by Reviewer(s) label Nov 3, 2022
@alfonsogarciadecorral
Copy link
Collaborator

Hi @karl-richter

I have a really quick comment.

In order to be able to visualize the model architecture we need to do the following change in the training_step method:

instead of

        # Run forward calculation
        predicted = self.forward(inputs, meta_name_tensor)
        # Calculate loss
        loss, reg_loss = self.loss_func(inputs, predicted, targets)
        # Metrics

we need to add:

        # Run forward calculation
        predicted = self.forward(inputs, meta_name_tensor)
        # store predictions in self for later network visualization
        self.train_epoch_prediction = predicted
        # Calculate loss
        loss, reg_loss = self.loss_func(inputs, predicted, targets)
        # Metrics

Also, in the tutorial network_architecture_visualization.ipynb
on the very last cell we need to do the change:

instead of:

fig = make_dot(m.train_epoch_prediction, params=dict(m.model.named_parameters()))
# fig_glob.render(filename='img/fig_glob')
display(fig)

we need:

fig = make_dot(m.model.train_epoch_prediction, params=dict(m.model.named_parameters()))
# fig_glob.render(filename='img/fig_glob')
display(fig)

@github-actions
Copy link

github-actions bot commented Nov 7, 2022

de87e17

Model Benchmark

Benchmark Metric main current diff
AirPassengers MAE_val 85.1099 15.2698 -82.06%
AirPassengers RMSE_val 108.276 19.4209 -82.06%
AirPassengers Loss_val nan 0.00195 0.0%
AirPassengers RegLoss_val nan 0 0.0%
AirPassengers epoch nan 89 0.0%
AirPassengers MAE 6.35364 9.82902 54.7% ⚠️
AirPassengers RMSE 7.68085 11.7005 52.33% ⚠️
AirPassengers Loss 0.00023 0.00056 140.91% ⚠️
AirPassengers RegLoss 0 0 0.0%
PeytonManning MAE_val 0.92518 0.64636 -30.14%
PeytonManning RMSE_val 1.13074 0.79276 -29.89%
PeytonManning Loss_val nan 0.01494 0.0%
PeytonManning RegLoss_val nan 0 0.0%
PeytonManning epoch nan 37 0.0%
PeytonManning MAE 0.34839 0.42701 22.57% ⚠️
PeytonManning RMSE 0.48617 0.57032 17.31% ⚠️
PeytonManning Loss 0.00464 0.00635 36.95% ⚠️
PeytonManning RegLoss 0 0 0.0%
YosemiteTemps MAE_val 1.71173 1.72949 1.04%
YosemiteTemps RMSE_val 2.2758 2.27386 -0.08%
YosemiteTemps Loss_val nan 0.00096 0.0%
YosemiteTemps RegLoss_val nan 0 0.0%
YosemiteTemps epoch nan 84 0.0%
YosemiteTemps MAE 1.43672 1.45189 1.06%
YosemiteTemps RMSE 2.14749 2.16631 0.88%
YosemiteTemps Loss 0.00064 0.00066 1.81%
YosemiteTemps RegLoss 0 0 0.0%

Model Training

PeytonManning

YosemiteTemps

AirPassengers

Copy link
Owner

@ourownstory ourownstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work!!
All points that we discussed can be addressed in later PRs.

@ourownstory ourownstory merged commit de87e17 into main Nov 12, 2022
@ourownstory ourownstory deleted the lightning branch November 12, 2022 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs review PR needs to be reviewed by Reviewer(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants