The PyTorch Lightning team and its community are excited to announce Lightning 1.5, introducing support for LightningLite, Fault-tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI V2, RichProgressBar, CheckpointIO Plugin, Trainer Strategy flag, and more!

Highlights

Lightning 1.5 marks our biggest release yet. Over 60 contributors have worked on features, bugfixes and documentation improvements for a total of 640 commits since v1.4. Here are some highlights:

Fault-tolerant Training

Fault-tolerant Training is a new internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure. This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time. Once a Lightning experiment unexpectedly exits, a temporary checkpoint is saved that contains the exact state of all loops and the model. With this new experimental feature, you will be able to restore your training mid-epoch on the exact batch and continue training as if it never got interrupted.

PL_FAULT_TOLERANT_TRAINING=1 python train.py

LightningLite

LightningLite enables pure PyTorch users to scale their existing code to any kind of hardware while retaining full control over their own loops and optimization logic.

With just a few lines of code and no large refactoring, you get support for multi-device, multi-node, running on different accelerators (CPU, GPU, TPU), native automatic mixed precision (half and bfloat16), and double precision, in just a few seconds. And no special launcher required! Check out our documentation to find out how you can get one step closer to boilerplate-free research!

class Lite(LightningLite):
    def run(self):
        # Let Lite setup your dataloader(s)
        train_loader = self.setup_dataloaders(torch.utils.data.DataLoader(...))

        model = Net()  # .to() not needed
        optimizer = optim.Adam(model.parameters())
        # Let Lite setup your model and optimizer
        model, optimizer = self.setup(model, optimizer)

        for epoch in range(5):
            for data, target in train_loader:
                optimizer.zero_grad()
                output = model(data)  # data is already on the device
                loss = F.nll_loss(output, target)
                self.backward(loss)  # instead of loss.backward()
                optimizer.step()


Lite(accelerator="gpu", devices="auto").run()

Loop Customization

The new Loop API lets advanced users swap out the default gradient descent optimization loop at the core of Lightning with a different optimization paradigm. This is part of our effort to make Lightning the simplest, most flexible framework to take any kind of deep learning research to production.

Read our comprehensive introduction to loops

New Rich Progress Bar

We integrated with Rich and created a new and improved progress bar for Lightning.
Try it out:

pip install rich

from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import RichProgressBar

trainer = Trainer(callbacks=[RichProgressBar()])

New Trainer Arguments: Strategy and Devices

With the new strategy and devices arguments in the Trainer, it is now easer to switch from one hardware to another.

Before	After
`Trainer(accelerator="ddp", gpus=2)`	`Trainer(accelerator="gpu", devices=2, strategy="ddp")`
`Trainer(accelerator="ddp_cpu", num_processes=2)`	`Trainer(accelerator="cpu", devices=2, strategy="ddp")`
`Trainer(accelerator="tpu_spawn", tpu_cores=8)`	`Trainer(accelerator="tpu", devices=8)`

The new devices argument is now agnostic to all accelerators, but the previous arguments gpus, tpu_cores, ipus are still available and work the same as before. In addition, it is now also possible to set devices="auto" or accelerator="auto" to select the best accelerator available on the hardware.

from pytorch_lightning import Trainer

trainer = Trainer(accelerator="auto", devices="auto")

LightningCLI V2

This release adds support for running not just Trainer.fit but any of the Trainer entry points!

python script.py fit
python script.py test

LightningCLI now supports registries for callbacks, optimizers, learning rate schedulers, LightningModules and LightningDataModules. This greatly improves the command line experience as only the class names and arguments are required as follows:

python script.py \
    --trainer.callbacks=EarlyStopping \
    --trainer.callbacks.patience=5 \
    --trainer.callbacks.LearningRateMonitor \
    --trainer.callbacks.logging_interval=epoch \
    --optimizer=Adam \
    --optimizer.lr=0.01 \
    --lr_scheduler=OneCycleLR \
    --lr_scheduler=anneal_strategy=linear

We've also added support for a manual mode where the CLI takes care of the instantiation but you have control over the Trainer calls:

cli = LightningCLI(MyModel, run=False)
cli.trainer.fit(cli.model)

Try out LightninCLI!

CheckpointIO Plugins

As part of our commitment to extensibility, we have abstracted the checkpointing logic into a CheckpointIO plugin. This enables users to adapt Lightning to their own infrastructure.

from pytorch_lightning.plugins import CheckpointIO

class CustomCheckpointIO(CheckpointIO):
  
    def save_checkpoint(self, checkpoint, path):
        # put all logic related to saving a checkpoint here

    def load_checkpoint(self, path):
        # put all logic related to loading a checkpoint here

    def remove_checkpoint(self, path):
        # put all logic related to deleting a checkpoint here

BFloat16 Support

PyTorch 1.10 introduces native Automatic Mixed Precision (AMP) support for torch.bfloat16 on CPU (was already supported for TPUs), enabling higher performance compared with torch.float16. Switch to bfloat16 training by setting the argument:

from pytorch_lightning import Trainer

trainer = Trainer(precision="bf16")

Enable Auto Parameters Tying

It is pretty common to share parameters within a model. However, TPUs don't retain shared parameters once moved on the devices. Lightning now supports automatic detection and re-assignement to alleviate this problem from TPUs.

Infinite Training

Infinite training is now supported by setting Trainer(max_epochs=-1) for an unlimited number of epochs, or Trainer(max_steps=-1) for an endless epoch.

Note: you will want to avoid logging with on_epoch=True in case of max_steps=-1.

DeepSpeed Stage 1

DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Lightning now also supports the DeepSpeed ZeRO Stage 1 protocol that partitions your optimizer states across your GPUs to reduce memory.

from pytorch_lightning import Trainer

trainer = Trainer(gpus=4, strategy="deepspeed_stage_1", precision=16)
trainer.fit(model)

For even more memory savings and model sharding advice, check out stage 2 & 3 as well in our multi-GPU docs.

Gradient Clipping Customization

By overriding the LightningModule.configure_gradient_clipping hook, you can customize gradient clipping to your needs:

# Perform gradient clipping on gradients associated with discriminator (optimizer_idx=1) in GAN
def configure_gradient_clipping(
    self,
    optimizer,
    optimizer_idx,
    gradient_clip_val,
    gradient_clip_algorithm
):
    if optimizer_idx == 1:
        # Lightning will handle the gradient clipping
        self.clip_gradients(
            optimizer,
            gradient_clip_val=gradient_clip_val,
            gradient_clip_algorithm=gradient_clip_algorithm
        )

This means you can now implement state-of-the-art clipping algorithms with Lightning!

Determinism

Added support for torch.use_deterministic_algorithms. Read more about how it works here. You can enable it by setting:

from pytorch_lightning import Trainer

trainer = Trainer(deterministic=True)

Anomaly Detection

Lightning makes it easier to debug your code, so we've added support for torch.set_detect_anomaly. With this, PyTorch detects numerical anomalies like NaN or inf during forward and backward. Read more about anomaly detection here

from pytorch_lightning import Trainer

trainer = Trainer(detect_anomaly=True)

DDP Debugging Improvements

Are you having a hard time debugging DDP on your remote machine? Now you can debug DDP locally on the CPU:

trainer = Trainer(accelerator="cpu", strategy="ddp", devices=2)

When everything works, switch back to GPU by changing only the accelerator. Check our documentation for more useful debugging tricks.
Note that this will not provide any speed benefits.

ModelSummary Callback

Generates a summary of all layers in a LightningModule. This currently works with the new RichProgressBar callback.

from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelSummary

trainer = Trainer(callbacks=[ModelSummary(max_depth=1)])

New Hooks

An on_exception Callback hook has been added which allows the user to perform custom exception handling.

class MyCallback(Callback):
    def on_exception(self, trainer, pl_module, exception):
        # whatever you want!
        ...

Experimental Features

Inter Batch Parallelism

The inter-batch parallelism feature aims at hiding the latency of host-to-device copy of input batches behind computationally intensive operations. In some use case, it can provide training speed up. This feature is experimental and subject to change, hence opt-in through an environment variable.

PL_INTER_BATCH_PARALLELISM=1 python train.py

Training Step With DataLoader Iterator

If your training_step signature takes a dataloader_iter, Lightning would pass it directly. This can be useful for recommendation engine optimization.

Meta Module

PyTorch 1.10 introduces the meta tensors, tensors without the data. In this continuation, PyTorch Lightning provides an init_meta_context context manager and materialize_module function to handle large sharded models.

Backward Incompatible Changes

Here is a selection of important changes that are not backward compatible with versions < 1.5. The full list of changes and removals are listed in the changelog at the bottom.

Parsing of GPU Argument

The interpretation of the gpus Trainer argument when provided as a string has changed: Trainer(gpus="n") (string) no longer selects the GPU index n and instead selects the first n devices. In order to preserve the old behavior, you will have to change your code to Trainer(gpus=[n]) (list of indices) or Trainer(gpus="n,") (string with comma separated indices).

Distributed Backend

The argument distributed_backend has been removed from the Trainer in favor of the new accelerator and strategy arguments (#10017).

# BEFORE
trainer = Trainer(distributed_backend="ddp_spawn", gpus=2)

# NOW
trainer = Trainer(strategy="ddp_spawn", accelerator="gpu", devices=2)

Trainer Argument Defaults

The default value of the max_steps Trainer argument has changed from None to -1 (#9460). You can no longer specify Trainer(max_steps=None) and if you did, you need to change the code to Trainer(max_steps=-1).
The default value of accumulate_grad_batches has changed from 1 to None (#9652).

Loading Model Weights

The model weights now get loaded in all cases when the checkpoint path is provided in Trainer.{validate,test,predict}, regardless of whether the model instance is provided or not.

# model reference provided:
trainer.test(model, ckpt_path=None) # use provided model
trainer.test(model, ckpt_path="best") # load best model
trainer.test(model, ckpt_path="my_path") # load path

# model reference not provided
trainer.fit(model)
trainer.test(ckpt_path=None) # load best model (NEW BEHAVIOR!)
trainer.test(ckpt_path="my_path") # load path (NEW BEHAVIOR!)

Users who relied on trainer.test(ckpt_path=None) to load the latest model need to change their code to trainer.test(model) and pass the model reference directly.

Lightning CLI

All CLI commands now need to include the Trainer method to run as the first command, i.e., one of fit, validate, test, predict.

# BEFORE
python script.py     --trainer.max_epochs=123

# NOW
python script.py fit --trainer.max_epochs=123

For questions and help regarding CLI, join our Lightning-CLI Slack channel.

Optimizer Hooks

Executing the optimizer_closure is now required when overriding the optimizer_step hook (#9360). If you relied on the previous behavior, we recommend to switch to Manual Optimization alltogether.
The on_before_optimizer_step hook previously ran before the entire optimization closure, including backward. This was unintended behavior and if you rely on this, move your code to the new on_before_backward` hook.

Changes in Accelerators and Plugins

Changes in Accelerators and Plugins were made without deprecation due to their experimental state. The API is expected to become stable in 1.6.

Removed attributes and methods:

Accelerator.{call_configure_sharded_model_hook, connect_training_type_plugin, connect_precision_plugin, on_reset_*_dataloader, on_train_epoch_end, on_save, post_optimizer_step, update_global_step}
TrainingTypePlugin.{call_configure_sharded_model_hook, on_reset_*_dataloader, on_save, post_optimizer_step, update_global_step}
PrecisionPlugin.{post_optimizer_step}
ParallelPlugin.teardown

Changed signatures:

The accelerator and training type plugin setup hooks no longer have a model argument.

Other changes:

The base Plugin class has been removed.
HorovodPlugin.all_gather now returns a torch.Tensor instead of a list.
The LightningModule no longer gets wrapped with data-parallel modules when not fitting in DDPPlugin, DDPSpawnPlugin, DDPShardedPlugin, DDPSpawnShardedPlugin.

Full Changelog

Added

Added support for monitoring the learning rate without schedulers in LearningRateMonitor (#9786)
Added registration of ShardedTensor state dict hooks in LightningModule.__init__ if the PyTorch version supports ShardedTensor (#8944)
Added error handling including calling of on_keyboard_interrupt() and on_exception() for all entrypoints (fit, validate, test, predict) (#8819)
Added a flavor of training_step that takes dataloader_iter as an argument (#8807)
Added a state_key property to the Callback base class (#6886)
Added progress tracking to loops:
- Integrated TrainingEpochLoop.total_batch_idx (#8598)
- Added BatchProgress and integrated TrainingEpochLoop.is_last_batch (#9657)
- Avoid optional Tracker attributes (#9320)
- Reset current progress counters when restarting an epoch loop that had already finished (#9371)
- Call reset_on_restart in the loop's reset hook instead of when loading a checkpoint (#9561)
- Use completed over processed in reset_on_restart (#9656)
- Renamed reset_on_epoch to reset_on_run (#9658)
Added batch_size and rank_zero_only arguments for log_dict to match log (#8628)
Added a check for unique GPU ids (#8666)
Added ResultCollection state_dict to the Loop state_dict and added support for distributed reload (#8641)
Added DeepSpeed collate checkpoint utility function (#8701)
Added a handles_accumulate_grad_batches property to the training type plugins (#8856)
Added a warning to WandbLogger when reusing a wandb run (#8714)
Added log_graph argument for watch method of WandbLogger (#8662)
LightningCLI additions:
- Added LightningCLI(run=False|True) to choose whether to run a Trainer subcommand (#8751)
- Added support to call any trainer function from the LightningCLI via subcommands (#7508)
- Allow easy trainer re-instantiation (#7508)
- Automatically register all optimizers and learning rate schedulers (#9565)
- Allow registering custom optimizers and learning rate schedulers without subclassing the CLI (#9565)
- Support shorthand notation to instantiate optimizers and learning rate schedulers (#9565)
- Support passing lists of callbacks via command line (#8815)
- Support shorthand notation to instantiate models (#9588)
- Support shorthand notation to instantiate datamodules (#10011)
- Added multifile option to LightningCLI to enable/disable config saving to preserve multiple files structure (#9073)
Fault-tolerant training:
- Added FastForwardSampler and CaptureIterableDataset injection to data loading utilities (#8366)
- Added DataFetcher to control fetching flow (#8890)
- Added SharedCycleIteratorState to prevent infinite loop (#8889)
- Added CaptureMapDataset for state management in map-style datasets (#8891)
- Added Fault Tolerant Training to DataFetcher (#8891)
- Replaced old prefetch iterator with new DataFetcher in training loop (#8953)
- Added partial support for global random state fault-tolerance in map-style datasets (#8950)
- Converted state to tuple explicitly when setting Python random state (#9401)
- Added support for restarting an optimizer loop (multiple optimizers) (#9537)
- Added support for restarting within Evaluation Loop (#9563)
- Added mechanism to detect that a signal has been sent so the Trainer can gracefully exit (#9566)
- Added support for skipping ahead to validation during the auto-restart of fitting (#9681)
- Added support for auto-restart if a fault-tolerant checkpoint is available (#9722)
Checkpoint saving and loading extensibility:
- Added CheckpointIO plugin to expose checkpoint IO from training type plugin (#8743)
- Refactored CheckpointConnector to offload validation logic to the CheckpointIO plugin (#9045)
- Added remove_checkpoint to CheckpointIO plugin by moving the responsibility out of the ModelCheckpoint callback (#9373)
- Added XLACheckpointIO plugin (#9972)
Loop customization:
- Added Closure and AbstractClosure classes (#8642)
- Refactored TrainingBatchLoop and extracted OptimizerLoop, splitting off automatic optimization into its own loop (#9191)
- Removed TrainingBatchLoop.backward(); manual optimization now calls directly into Accelerator.backward() and automatic optimization handles backward in new OptimizerLoop (#9265)
- Extracted ManualOptimization logic from TrainingBatchLoop into its own separate loop class (#9266)
- Added OutputResult and ManualResult classes (#9437, #9424)
- Marked OptimizerLoop.backward as protected (#9514)
- Marked FitLoop.should_accumulate as protected (#9515)
- Marked several methods in PredictionLoop as protected: on_predict_start, on_predict_epoch_end, on_predict_end, on_predict_model_eval (#9516)
- Marked several methods in EvaluationLoop as protected: get_max_batches, on_evaluation_model_eval, on_evaluation_model_train, on_evaluation_start, on_evaluation_epoch_start, on_evaluation_epoch_end, on_evaluation_end, reload_evaluation_dataloaders (#9516)
- Marked several methods in EvaluationEpochLoop as protected: on_evaluation_batch_start, evaluation_step, evaluation_step_end (#9516)
- Added yielding_training_step example (#9983)
Added support for saving and loading state of multiple callbacks of the same type (#7187)
Added DeepSpeed Stage 1 support (#8974)
Added Python dataclass support for LightningDataModule (#8272)
Added sanitization of tensors when they get logged as hyperparameters in TensorBoardLogger (#9031)
Added InterBatchParallelDataFetcher (#9020)
Added DataLoaderIterDataFetcher (#9020)
Added DataFetcher within Fit / Evaluation Loop (#9047)
Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 (#9005)
Added Rich integration:
- Added Rich progress bar (#8929, #9559)
- Added Support for iterable datasets (#9734)
- Added RichModelSummary callback (#9546)
- Added configure_columns method to RichProgressBar (#10288)
- Added leave argument to RichProgressBar (#10301)
Added input validation logic for precision (#9080)
Added support for CPU AMP autocast (#9084)
Added on_exception callback hook (#9183)
Added a warning to DeepSpeed when inferring batch size (#9221)
Added ModelSummary callback (#9344)
Added log_images, log_text and log_table to WandbLogger (#9545)
Added PL_RECONCILE_PROCESS environment variable to enable process reconciliation regardless of cluster environment settings (#9389)
Added get_device_stats to the Accelerator interface and added its implementation for GPU and TPU (#9586)
Added a warning when an unknown key is encountered in the optimizer configuration, and when OneCycleLR is used with "interval": "epoch" (#9666)
Added DeviceStatsMonitor callback (#9712)
Added enable_progress_bar to the Trainer constructor (#9664)
Added pl_legacy_patch load utility for loading old checkpoints that have pickled legacy Lightning attributes (#9166)
Added support for torch.use_deterministic_algorithms (#9121)
Added automatic parameters tying for TPUs (#9525)
Added support for torch.autograd.set_detect_anomaly through Trainer constructor argument detect_anomaly (#9848)
Added enable_model_summary flag to Trainer (#9699)
Added strategy argument to Trainer (#8597)
Added init_meta_context, materialize_module utilities (#9920)
Added TPUPrecisionPlugin (#10020)
Added torch.bfloat16 support:
- Added bfloat16 support for Lightning Trainer (#9049)
- Renamed TPUHalfPrecisionPlugin to TPUBf16PrecisionPlugin (#10026)
- Default to precision=bf16 on CPU when precision=16 is passed (#10033)
- Added support for torch.autocast (#10053)
Added kfold example for loop customization (#9965)
LightningLite:
- Added PrecisionPlugin.forward_context, making it the default implementation for all {train,val,test,predict}_step_context() methods (#9988)
- Added DDPSpawnPlugin.spawn() for spawning new processes of a given function (#10018, #10022)
- Added TrainingTypePlugin.{_setup_model, _setup_optimizer} methods (#9994, #10064)
- Implemented DataParallelPlugin._setup_model (#10010)
- Implemented DeepSpeedPlugin._setup_model_and_optimizers (#10009, #10064)
- Implemented {DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_model_and_optimizers (#10028, #10064)
- Added optional model argument to the optimizer_step methods in accelerators and plugins (#10023)
- Updated precision attributes in DeepSpeedPlugin (#10164)
- Added the ability to return a result from rank 0 in DDPSpawnPlugin.spawn (#10162)
- Added pytorch_lightning.lite package (#10175)
- Added LightningLite documentation (#10043)
- Added LightningLite examples (#9987)
- Make the _LiteDataLoader an iterator and add supports for custom dataloader (#10279)
Added use_omegaconf argument to save_hparams_to_yaml plugin (#9170)
Added ckpt_path argument for Trainer.fit() (#10061)
Added auto_device_count method to Accelerators (#10222)
Added support for devices="auto" (#10264)
Added a filename argument in ModelCheckpoint.format_checkpoint_name (#9818)
Added support for empty gpus list to run on CPU (#10246)
Added a warning if multiple batch sizes are found from ambiguous batch (#10247)

Changed

Trainer now raises a MisconfigurationException when its methods are called with ckpt_path="best" but a checkpoint callback isn't configured (#9841)
Setting Trainer(accelerator="ddp_cpu") now does not spawn a subprocess if num_processes is kept 1 along with num_nodes > 1 (#9603)
Module imports are now catching ModuleNotFoundError instead of ImportError (#9867)
pytorch_lightning.loggers.neptune.NeptuneLogger is now consistent with the new neptune-client API; the old neptune-client API is supported by NeptuneClient from the neptune-contrib repo (#6867)
Parsing of enums type hyperparameters to be saved in the haprams.yaml file by TensorBoard and CSV loggers has been fixed and made in line with how OmegaConf parses it (#9170)
Parsing of the gpus Trainer argument has changed: gpus="n" (str) no longer selects the GPU index n and instead selects the first n devices (#8770)
iteration_count and other index attributes in the loops has been replaced with progress dataclasses (#8477)
The trainer.lightning_module reference is now properly set at the very beginning of a run (#8536)
The model weights now get loaded in all cases when the checkpoint path gets provided in validate/test/predict, regardless of whether the model instance is provided or not (#8352)
The Trainer functions reset_{train,val,test,predict}_dataloader, reset_train_val_dataloaders, and request_dataloader model argument is now optional (#8536)
Saved checkpoints will no longer use the type of a Callback as the key to avoid issues with unpickling (#6886)
Improved string conversion for ResultCollection (#8622)
LightningCLI changes:
- LightningCLI.init_parser now returns the parser instance (#8721)
- LightningCLI.add_core_arguments_to_parser, LightningCLI.parse_arguments now take a parser argument (#8721)
- LightningCLI.instantiate_trainer now takes a config and a list of callbacks (#8721)
- Split LightningCLI.add_core_arguments_to_parser into LightningCLI.add_default_arguments_to_parser + LightningCLI.add_core_arguments_to_parser (#8721)
The accelerator and training type plugin setup hooks no longer have a model argument (#8536)
The accelerator and training type plugin update_global_step hook has been removed (#8856)
The coverage of self.log-ing in any LightningModule or Callback hook has been improved (#8498)
self.log-ing without a Trainer reference now raises a warning instead of an exception (#9733)
Removed restrictions in the Trainer that loggers can only log from rank 0; the existing logger behavior has not changed (#8608)
Trainer.request_dataloader now takes a RunningStage enum instance (#8858)
Changed rank_zero_warn to NotImplementedError in the {train, val, test, predict}_dataloader hooks that Lightning(Data)Module uses (#9161)
Moved block_ddp_sync_behaviour out of TrainingBatchLoop to loop utilities (#9192)
Executing the optimizer_closure is now required when overriding the optimizer_step hook (#9360)
Changed logging of LightningModule and LightningDataModule hyperparameters to raise an exception only if there are colliding keys with different values (#9496)
seed_everything now fails when an invalid seed value is passed instead of selecting a random seed (#8787)
The Trainer now calls TrainingTypePlugin collective APIs directly instead of going through the Accelerator reference (#9677, #9901)
The tuner now usees a unique filename to save a temporary checkpoint (#9682)
Changed HorovodPlugin.all_gather to return a torch.Tensor instead of a list (#9696)
Changed Trainer connectors to be protected attributes:
- Configuration Validator (#9779)
The current_epoch and global_step attributes now get restored irrespective of the Trainer task (#9413)
Trainer now raises an exception when requesting amp_level with native amp_backend (#9755)
Update the logic to check for accumulation steps with deepspeed (#9826)
pytorch_lightning.utilities.grads.grad_norm now raises an exception if parameter norm_type <= 0 (#9765)
Updated error message for interactive incompatible plugins (#9896)
Moved the optimizer_step and clip_gradients hook from the Accelerator and TrainingTypePlugin into the PrecisionPlugin (#10143, #10029)
NativeMixedPrecisionPlugin and its subclasses now take an optional GradScaler instance (#10055)
Trainer is now raising a MisconfigurationException instead of a warning if Trainer.{validate/test} is missing required methods (#10016)
Changed default value of the max_steps Trainer argument from None to -1 (#9460)
LightningModule now raises an error when calling log(on_step=False, on_epoch=False) (#10227)
Quantization aware training observers are now disabled by default during validating/testing/predicting stages (#8540)
Raised MisconfigurationException when total length of dataloader across ranks is zero, and give warning when total length is non-zero, but only local rank length is zero. (#9827)
Changed the model size calculation using ByteCounter (#10123)
Enabled on_load_checkpoint for LightningDataModule for all trainer_fn (#10238)
Allowed separate config files for parameters with class type when LightningCLI is in subclass_mode=False (#10286)

Deprecated

Deprecated Trainer argument terminate_on_nan in favor of detect_anomaly(#9175)
Deprecated Trainer.terminate_on_nan public attribute access (#9849)
Deprecated LightningModule.summarize() in favor of pytorch_lightning.utilities.model_summary.summarize() (#8513)
Deprecated LightningModule.model_size (#8343)
Deprecated DataModule properties: train_transforms, val_transforms, test_transforms, size, dims (#8851)
Deprecated add_to_queue, get_from_queue from LightningModule in favor of corresponding methods in the DDPSpawnPlugin (#9118)
Deprecated LightningModule.get_progress_bar_dict and Trainer.progress_bar_dict in favor of pytorch_lightning.callbacks.progress.base.get_standard_metrics and ProgressBarBase.get_metrics (#8985)
Deprecated prepare_data_per_node flag on Trainer and set it as a property of DataHooks, accessible in the LightningModule and LightningDataModule (#8958)
Deprecated the TestTubeLogger (#9065)
Deprecated on_{train/val/test/predict}_dataloader() from LightningModule and LightningDataModule (#9098)
Deprecated on_keyboard_interrupt callback hook in favor of new on_exception hook (#9260)
Deprecated passing process_position to the Trainer constructor in favor of adding the ProgressBar callback with process_position directly to the list of callbacks (#9222)
Deprecated passing flush_logs_every_n_steps as a Trainer argument, instead pass it to the logger init if supported (#9366)
Deprecated LightningLoggerBase.close, LoggerCollection.close in favor of LightningLoggerBase.finalize, LoggerCollection.finalize (#9422)
Deprecated passing progress_bar_refresh_rate to the Trainer constructor in favor of adding the ProgressBar callback with refresh_rate directly to the list of callbacks, or passing enable_progress_bar=False to disable the progress bar (#9616)
Deprecated LightningDistributed and moved the broadcast logic to DDPPlugin and DDPSpawnPlugin directly (#9691)
Deprecated passing stochastic_weight_avg to the Trainer constructor in favor of adding the StochasticWeightAveraging callback directly to the list of callbacks (#8989)
Deprecated Accelerator collective API barrier, broadcast, and all_gather in favor of calling the TrainingTypePlugin collective API directly (#9677)
Deprecated checkpoint_callback from the Trainer constructor in favor of enable_checkpointing (#9754)
Deprecated the LightningModule.on_post_move_to_device method (#9525)
Deprecated pytorch_lightning.core.decorators.parameter_validation in favor of pytorch_lightning.utilities.parameter_tying.set_shared_parameters (#9525)
Deprecated passing weights_summary to the Trainer constructor in favor of adding the ModelSummary callback with max_depth directly to the list of callbacks (#9699)
Deprecated log_gpu_memory, gpu_metrics, and util funcs in favor of DeviceStatsMonitor callback (#9921)
Deprecated GPUStatsMonitor and XLAStatsMonitor in favor of DeviceStatsMonitor callback (#9924)
Deprecated setting Trainer(max_steps=None); To turn off the limit, set Trainer(max_steps=-1) (default) (#9460)
Deprecated access to the AcceleratorConnector.is_slurm_managing_tasks attribute and marked it as protected (#10101)
Deprecated access to the AcceleratorConnector.configure_slurm_ddp method and marked it as protected (#10101)
Deprecated passing resume_from_checkpoint to the Trainer constructor in favor of trainer.fit(ckpt_path=) (#10061)
Deprecated ClusterEnvironment.creates_children() in favor of ClusterEnvironment.creates_processes_externally (property) (#10106)
Deprecated PrecisionPlugin.master_params() in favor of PrecisionPlugin.main_params() (#10105)
Deprecated lr_sch_names from LearningRateMonitor (#10066)
Deprecated ProgressBar callback in favor of TQDMProgressBar (#10134)

Removed

Removed deprecated metrics (#8586)
Removed the deprecated outputs argument in both the LightningModule.on_train_epoch_end and Callback.on_train_epoch_end hooks (#8587)
Removed the deprecated TrainerLoggingMixin class (#8609)
Removed the deprecated TrainerTrainingTricksMixin class (#8679)
Removed the deprecated optimizer_idx from training_step as an accepted argument in manual optimization (#8576)
Removed support for the deprecated on_save_checkpoint signature. The hook now takes a checkpoint positional parameter (#8697)
Removed support for the deprecated on_load_checkpoint signature. The hook now takes a pl_module positional parameter (#8697)
Removed the deprecated save_function property in ModelCheckpoint (#8680)
Removed the deprecated model argument from ModelCheckpoint.save_checkpoint (#8688)
Removed the deprecated sync_step argument from WandbLogger (#8763)
Removed the deprecated Trainer.truncated_bptt_steps in favor of LightningModule.truncated_bptt_steps (#8826)
Removed LightningModule.write_predictions and LightningModule.write_predictions_dict (#8850)
Removed on_reset_*_dataloader hooks in TrainingType Plugins and Accelerators (#8858)
Removed deprecated GradInformation module in favor of pytorch_lightning.utilities.grads (#8831)
Removed TrainingTypePlugin.on_save and Accelerator.on_save (#9023)
Removed {Accelerator,TrainingTypePlugin,PrecisionPlugin}.post_optimizer_step (#9746)
Removed deprecated connect_precision_plugin and connect_training_type_plugin from Accelerator (#9019)
Removed on_train_epoch_end from Accelerator (#9035)
Removed InterBatchProcessor in favor of DataLoaderIterDataFetcher (#9052)
Removed Plugin in base_plugin.py in favor of accessing TrainingTypePlugin and PrecisionPlugin directly instead (#9066)
Removed teardown from ParallelPlugin (#8943)
Removed deprecated profiled_functions argument from PyTorchProfiler (#9178)
Removed deprecated pytorch_lighting.utilities.argparse_utils module (#9166)
Removed deprecated property Trainer.running_sanity_check in favor of Trainer.sanity_checking (#9209)
Removed deprecated BaseProfiler.output_filename arg from it and its descendants in favor of dirpath and filename (#9214)
Removed deprecated property ModelCheckpoint.period in favor of ModelCheckpoint.every_n_epochs (#9213)
Removed deprecated auto_move_data decorator (#9231)
Removed deprecated property LightningModule.datamodule in favor of Trainer.datamodule (#9233)
Removed deprecated properties DeepSpeedPlugin.cpu_offload* in favor of offload_optimizer, offload_parameters and pin_memory (#9244)
Removed deprecated property AcceleratorConnector.is_using_torchelastic in favor of TorchElasticEnvironment.is_using_torchelastic() (#9729)
Removed pytorch_lightning.utilities.debugging.InternalDebugger (#9680)
Removed call_configure_sharded_model_hook property from Accelerator and TrainingTypePlugin (#9612)
Removed TrainerProperties mixin and moved property definitions directly into Trainer (#9495)
Removed a redundant warning with ModelCheckpoint(monitor=None) callback (#9875)
Remove epoch from trainer.logged_metrics (#9904)
Removed should_rank_save_checkpoint property from Trainer (#9433)
Remove deprecated distributed_backend from Trainer (#10017)
Removed process_idx from the {DDPSpawnPlugin,TPUSpawnPlugin}.new_process methods (#10022)
Removed automatic patching of {train,val,test,predict}_dataloader() on the LightningModule (#9764)
Removed pytorch_lightning.trainer.connectors.OptimizerConnector (#10120)

Fixed

Fixed ImageNet evaluation in example (#10179)
Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8685)
Fixed move_metrics_to_cpu moving the loss to CPU while training on device (#9308)
Fixed incorrect main progress bar indicator when resuming training mid-epoch (#9310)
Fixed an issue with freeing memory of datafetchers during teardown (#9387)
Fixed a bug where the training step output needed to be deepcopy-ed (#9349)
Fixed an issue with freeing memory allocated by the data iterators in Loop.on_run_end (#9386, #9915)
Fixed BasePredictionWriter not returning the batch indices in a non-distributed setting (#9432)
Fixed an error when running in XLA environments with no TPU attached (#9572)
Fixed check on torchmetrics logged whose compute() output is a multielement tensor (#9582)
Fixed gradient accumulation for DDPShardedPlugin (#9122)
Fixed missing DeepSpeed distributed call (#9540)
Fixed an issue with wrapped LightningModule during evaluation; The LightningModule no longer gets wrapped with data-parallel modules when not fitting in DDPPlugin, DDPSpawnPlugin, DDPShardedPlugin, DDPSpawnShardedPlugin (#9096)
Fixed trainer.accumulate_grad_batches to be an int on init. The default value for it is now None inside Trainer (#9652)
Fixed broadcast in DDPPlugin and DDPSpawnPlugin to respect the src input (#9691)
Fixed self.log(on_epoch=True, reduce_fx=sum)) for the on_batch_start and on_train_batch_start hooks (#9791)
Fixed self.log(on_epoch=True) for the on_batch_start and on_train_batch_start hooks (#9780)
Fixed restoring training state during Trainer.fit only (#9413)
Fixed DeepSpeed and Lightning both calling the scheduler (#9788)
Fixed missing arguments when saving hyperparameters from the parent class but not from the child class (#9800)
Fixed DeepSpeed GPU device IDs (#9847)
Reset val_dataloader in tuner/batch_size_scaling (#9857)
Fixed use of LightningCLI in computer_vision_fine_tuning.py example (#9934)
Fixed issue with non-init dataclass fields in apply_to_collection (#9963)
Reset val_dataloader in tuner/batch_size_scaling for binsearch (#9975)
Fixed logic to check for spawn in dataloader TrainerDataLoadingMixin._worker_check (#9902)
Fixed train_dataloader getting loaded twice when resuming from a checkpoint during Trainer.fit() (#9671)
Fixed LearningRateMonitor logging with multiple param groups optimizer with no scheduler (#10044)
Fixed undesired side effects being caused by Trainer patching dataloader methods on the LightningModule (#9764)
Fixed gradients not being unscaled when clipping or logging the gradient norm (#9287)
Fixed on_before_optimizer_step getting called before the optimizer closure (including backward) has run (#10167)
Fixed monitor value in ModelCheckpoint getting moved to the wrong device in a special case where it becomes NaN (#10118)
Fixed creation of dirpath in BaseProfiler if it doesn't exist (#10073)
Fixed incorrect handling of sigterm (#10189)
Fixed bug where log(on_step=True, on_epoch=True, sync_dist=True) wouldn't reduce the value on step (#10227)
Fixed an issue with pl.utilities.seed.reset_seed converting the PL_SEED_WORKERS environment variable to bool (#10099)
Fixed iterating over a logger collection when fast_dev_run > 0 (#10232)
Fixed batch_size in ResultCollection not being reset to 1 on epoch end (#10242)
Fixed distrib_type not being set when training plugin instances are being passed to the Trainer (#10251)

Contributors

@adamjstewart @akihironitta @alessiobonfiglio @ananthsub @aphedges @awaelchli @bamblebam @Benjamin-Etheredge @borchero @Borda @borisdayma @bryant1410 @carmocca @cowwoc @daniellepintz @danielykim @edward-io @eladsegal @EricWiener @ethanwharris @four4fish @gau-nernst @hankyul2 @HansolEom @himanshu-dutta @I-iBot @jjenniferdai @jstjohn @justusschock @kainoj @kaushikb11 @kingyiusuen @Knarik1 @low5545 @lsqshr @mauvilsa @michele-arrival @nasnoisaac @ninginthecloud @popfido @pre-commit-ci @PuneetDabral @qmpzzpmq @rohitgr7 @ronif @roshikouhai @s-rog @samlurye @SeanNaren @shnela @sidml @stancld @stfwn @tangbinh @tchaton @thepurpleowl @Tshimanga @twsl @victorjoos @VirajBagal @wayi1 @weiji14 @yifuwang @yopknopixx

If we forgot someone, let us know :]

PyTorch Lightning 1.5: LightningLite, Fault-Tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI v2, RichProgressBar, CheckpointIO Plugin, and Trainer Strategy Flag

Highlights

Fault-tolerant Training

LightningLite

Loop Customization

New Rich Progress Bar

New Trainer Arguments: Strategy and Devices

LightningCLI V2

CheckpointIO Plugins

BFloat16 Support

Enable Auto Parameters Tying

Infinite Training

DeepSpeed Stage 1

Gradient Clipping Customization

Determinism

Anomaly Detection

DDP Debugging Improvements

ModelSummary Callback

New Hooks

Experimental Features

Inter Batch Parallelism

Training Step With DataLoader Iterator

Meta Module

Backward Incompatible Changes

Parsing of GPU Argument

Distributed Backend

Trainer Argument Defaults

Loading Model Weights

Lightning CLI

Optimizer Hooks

Changes in Accelerators and Plugins

Full Changelog

Added

Changed

Deprecated

Removed

Fixed

Contributors

Contributors