PyTorch Lightning 1.5: LightningLite, Fault-Tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI v2, RichProgressBar, CheckpointIO Plugin, and Trainer Strategy Flag
The PyTorch Lightning team and its community are excited to announce Lightning 1.5, introducing support for LightningLite, Fault-tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI V2, RichProgressBar, CheckpointIO Plugin, Trainer Strategy flag, and more!
Highlights
Lightning 1.5 marks our biggest release yet. Over 60 contributors have worked on features, bugfixes and documentation improvements for a total of 640 commits since v1.4. Here are some highlights:
Fault-tolerant Training
Fault-tolerant Training is a new internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure. This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time. Once a Lightning experiment unexpectedly exits, a temporary checkpoint is saved that contains the exact state of all loops and the model. With this new experimental feature, you will be able to restore your training mid-epoch on the exact batch and continue training as if it never got interrupted.
PL_FAULT_TOLERANT_TRAINING=1 python train.py
LightningLite
LightningLite enables pure PyTorch users to scale their existing code to any kind of hardware while retaining full control over their own loops and optimization logic.
With just a few lines of code and no large refactoring, you get support for multi-device, multi-node, running on different accelerators (CPU, GPU, TPU), native automatic mixed precision (half
and bfloat16
), and double precision, in just a few seconds. And no special launcher required! Check out our documentation to find out how you can get one step closer to boilerplate-free research!
class Lite(LightningLite):
def run(self):
# Let Lite setup your dataloader(s)
train_loader = self.setup_dataloaders(torch.utils.data.DataLoader(...))
model = Net() # .to() not needed
optimizer = optim.Adam(model.parameters())
# Let Lite setup your model and optimizer
model, optimizer = self.setup(model, optimizer)
for epoch in range(5):
for data, target in train_loader:
optimizer.zero_grad()
output = model(data) # data is already on the device
loss = F.nll_loss(output, target)
self.backward(loss) # instead of loss.backward()
optimizer.step()
Lite(accelerator="gpu", devices="auto").run()
Loop Customization
The new Loop API lets advanced users swap out the default gradient descent optimization loop at the core of Lightning with a different optimization paradigm. This is part of our effort to make Lightning the simplest, most flexible framework to take any kind of deep learning research to production.
Read our comprehensive introduction to loops
New Rich Progress Bar
We integrated with Rich and created a new and improved progress bar for Lightning.
Try it out:
pip install rich
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import RichProgressBar
trainer = Trainer(callbacks=[RichProgressBar()])
New Trainer Arguments: Strategy and Devices
With the new strategy and devices arguments in the Trainer, it is now easer to switch from one hardware to another.
Before | After |
---|---|
Trainer(accelerator="ddp", gpus=2) |
Trainer(accelerator="gpu", devices=2, strategy="ddp") |
Trainer(accelerator="ddp_cpu", num_processes=2) |
Trainer(accelerator="cpu", devices=2, strategy="ddp") |
Trainer(accelerator="tpu_spawn", tpu_cores=8) |
Trainer(accelerator="tpu", devices=8) |
The new devices
argument is now agnostic to all accelerators, but the previous arguments gpus
, tpu_cores
, ipus
are still available and work the same as before. In addition, it is now also possible to set devices="auto"
or accelerator="auto"
to select the best accelerator available on the hardware.
from pytorch_lightning import Trainer
trainer = Trainer(accelerator="auto", devices="auto")
LightningCLI V2
This release adds support for running not just Trainer.fit
but any of the Trainer
entry points!
python script.py fit
python script.py test
LightningCLI now supports registries for callbacks, optimizers, learning rate schedulers, LightningModules and LightningDataModules. This greatly improves the command line experience as only the class names and arguments are required as follows:
python script.py \
--trainer.callbacks=EarlyStopping \
--trainer.callbacks.patience=5 \
--trainer.callbacks.LearningRateMonitor \
--trainer.callbacks.logging_interval=epoch \
--optimizer=Adam \
--optimizer.lr=0.01 \
--lr_scheduler=OneCycleLR \
--lr_scheduler=anneal_strategy=linear
We've also added support for a manual mode where the CLI takes care of the instantiation but you have control over the Trainer
calls:
cli = LightningCLI(MyModel, run=False)
cli.trainer.fit(cli.model)
CheckpointIO Plugins
As part of our commitment to extensibility, we have abstracted the checkpointing logic into a CheckpointIO plugin. This enables users to adapt Lightning to their own infrastructure.
from pytorch_lightning.plugins import CheckpointIO
class CustomCheckpointIO(CheckpointIO):
def save_checkpoint(self, checkpoint, path):
# put all logic related to saving a checkpoint here
def load_checkpoint(self, path):
# put all logic related to loading a checkpoint here
def remove_checkpoint(self, path):
# put all logic related to deleting a checkpoint here
BFloat16 Support
PyTorch 1.10 introduces native Automatic Mixed Precision (AMP) support for torch.bfloat16
on CPU (was already supported for TPUs), enabling higher performance compared with torch.float16
. Switch to bfloat16 training by setting the argument:
from pytorch_lightning import Trainer
trainer = Trainer(precision="bf16")
Enable Auto Parameters Tying
It is pretty common to share parameters within a model. However, TPUs don't retain shared parameters once moved on the devices. Lightning now supports automatic detection and re-assignement to alleviate this problem from TPUs.
Infinite Training
Infinite training is now supported by setting Trainer(max_epochs=-1)
for an unlimited number of epochs, or Trainer(max_steps=-1)
for an endless epoch.
Note: you will want to avoid logging with
on_epoch=True
in case ofmax_steps=-1
.
DeepSpeed Stage 1
DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Lightning now also supports the DeepSpeed ZeRO Stage 1 protocol that partitions your optimizer states across your GPUs to reduce memory.
from pytorch_lightning import Trainer
trainer = Trainer(gpus=4, strategy="deepspeed_stage_1", precision=16)
trainer.fit(model)
For even more memory savings and model sharding advice, check out stage 2 & 3 as well in our multi-GPU docs.
Gradient Clipping Customization
By overriding the LightningModule.configure_gradient_clipping
hook, you can customize gradient clipping to your needs:
# Perform gradient clipping on gradients associated with discriminator (optimizer_idx=1) in GAN
def configure_gradient_clipping(
self,
optimizer,
optimizer_idx,
gradient_clip_val,
gradient_clip_algorithm
):
if optimizer_idx == 1:
# Lightning will handle the gradient clipping
self.clip_gradients(
optimizer,
gradient_clip_val=gradient_clip_val,
gradient_clip_algorithm=gradient_clip_algorithm
)
This means you can now implement state-of-the-art clipping algorithms with Lightning!
Determinism
Added support for torch.use_deterministic_algorithms
. Read more about how it works here. You can enable it by setting:
from pytorch_lightning import Trainer
trainer = Trainer(deterministic=True)
Anomaly Detection
Lightning makes it easier to debug your code, so we've added support for torch.set_detect_anomaly
. With this, PyTorch detects numerical anomalies like NaN or inf during forward and backward. Read more about anomaly detection here
from pytorch_lightning import Trainer
trainer = Trainer(detect_anomaly=True)
DDP Debugging Improvements
Are you having a hard time debugging DDP on your remote machine? Now you can debug DDP locally on the CPU:
trainer = Trainer(accelerator="cpu", strategy="ddp", devices=2)
When everything works, switch back to GPU by changing only the accelerator
. Check our documentation for more useful debugging tricks.
Note that this will not provide any speed benefits.
ModelSummary Callback
Generates a summary of all layers in a LightningModule. This currently works with the new RichProgressBar
callback.
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelSummary
trainer = Trainer(callbacks=[ModelSummary(max_depth=1)])
New Hooks
An on_exception
Callback hook has been added which allows the user to perform custom exception handling.
class MyCallback(Callback):
def on_exception(self, trainer, pl_module, exception):
# whatever you want!
...
Experimental Features
Inter Batch Parallelism
The inter-batch parallelism feature aims at hiding the latency of host-to-device copy of input batches behind computationally intensive operations. In some use case, it can provide training speed up. This feature is experimental and subject to change, hence opt-in through an environment variable.
PL_INTER_BATCH_PARALLELISM=1 python train.py
Training Step With DataLoader Iterator
If your training_step
signature takes a dataloader_iter
, Lightning would pass it directly. This can be useful for recommendation engine optimization.
Meta Module
PyTorch 1.10 introduces the meta tensors, tensors without the data. In this continuation, PyTorch Lightning provides an init_meta_context
context manager and materialize_module
function to handle large sharded models.
Backward Incompatible Changes
Here is a selection of important changes that are not backward compatible with versions < 1.5. The full list of changes and removals are listed in the changelog at the bottom.
Parsing of GPU Argument
The interpretation of the gpus
Trainer argument when provided as a string has changed: Trainer(gpus="n")
(string) no longer selects the GPU index n and instead selects the first n devices. In order to preserve the old behavior, you will have to change your code to Trainer(gpus=[n])
(list of indices) or Trainer(gpus="n,")
(string with comma separated indices).
Distributed Backend
The argument distributed_backend
has been removed from the Trainer
in favor of the new accelerator
and strategy
arguments (#10017).
# BEFORE
trainer = Trainer(distributed_backend="ddp_spawn", gpus=2)
# NOW
trainer = Trainer(strategy="ddp_spawn", accelerator="gpu", devices=2)
Trainer Argument Defaults
- The default value of the
max_steps
Trainer argument has changed fromNone
to -1 (#9460). You can no longer specifyTrainer(max_steps=None)
and if you did, you need to change the code toTrainer(max_steps=-1)
. - The default value of
accumulate_grad_batches
has changed from 1 toNone
(#9652).
Loading Model Weights
The model weights now get loaded in all cases when the checkpoint path is provided in Trainer.{validate,test,predict}
, regardless of whether the model instance is provided or not.
# model reference provided:
trainer.test(model, ckpt_path=None) # use provided model
trainer.test(model, ckpt_path="best") # load best model
trainer.test(model, ckpt_path="my_path") # load path
# model reference not provided
trainer.fit(model)
trainer.test(ckpt_path=None) # load best model (NEW BEHAVIOR!)
trainer.test(ckpt_path="my_path") # load path (NEW BEHAVIOR!)
Users who relied on trainer.test(ckpt_path=None)
to load the latest model need to change their code to trainer.test(model)
and pass the model reference directly.
Lightning CLI
All CLI commands now need to include the Trainer method to run as the first command, i.e., one of fit
, validate
, test
, predict
.
# BEFORE
python script.py --trainer.max_epochs=123
# NOW
python script.py fit --trainer.max_epochs=123
For questions and help regarding CLI, join our Lightning-CLI Slack channel.
Optimizer Hooks
- Executing the
optimizer_closure
is now required when overriding theoptimizer_step
hook (#9360). If you relied on the previous behavior, we recommend to switch to Manual Optimization alltogether. - The
on_before_optimizer_step
hook previously ran before the entire optimization closure, including backward. This was unintended behavior and if you rely on this, move your code to the new on_before_backward` hook.
Changes in Accelerators and Plugins
Changes in Accelerators and Plugins were made without deprecation due to their experimental state. The API is expected to become stable in 1.6.
Removed attributes and methods:
Accelerator.{call_configure_sharded_model_hook, connect_training_type_plugin, connect_precision_plugin, on_reset_*_dataloader, on_train_epoch_end, on_save, post_optimizer_step, update_global_step}
TrainingTypePlugin.{call_configure_sharded_model_hook, on_reset_*_dataloader, on_save, post_optimizer_step, update_global_step}
PrecisionPlugin.{post_optimizer_step}
ParallelPlugin.teardown
Changed signatures:
- The accelerator and training type plugin
setup
hooks no longer have amodel
argument.
Other changes:
- The base
Plugin
class has been removed. HorovodPlugin.all_gather
now returns atorch.Tensor
instead of a list.- The LightningModule no longer gets wrapped with data-parallel modules when not fitting in
DDPPlugin
,DDPSpawnPlugin
,DDPShardedPlugin
,DDPSpawnShardedPlugin
.
Full Changelog
Added
- Added support for monitoring the learning rate without schedulers in
LearningRateMonitor
(#9786) - Added registration of
ShardedTensor
state dict hooks inLightningModule.__init__
if the PyTorch version supportsShardedTensor
(#8944) - Added error handling including calling of
on_keyboard_interrupt()
andon_exception()
for all entrypoints (fit, validate, test, predict) (#8819) - Added a flavor of
training_step
that takesdataloader_iter
as an argument (#8807) - Added a
state_key
property to theCallback
base class (#6886) - Added progress tracking to loops:
- Integrated
TrainingEpochLoop.total_batch_idx
(#8598) - Added
BatchProgress
and integratedTrainingEpochLoop.is_last_batch
(#9657) - Avoid optional
Tracker
attributes (#9320) - Reset
current
progress counters when restarting an epoch loop that had already finished (#9371) - Call
reset_on_restart
in the loop'sreset
hook instead of when loading a checkpoint (#9561) - Use
completed
overprocessed
inreset_on_restart
(#9656) - Renamed
reset_on_epoch
toreset_on_run
(#9658)
- Integrated
- Added
batch_size
andrank_zero_only
arguments forlog_dict
to matchlog
(#8628) - Added a check for unique GPU ids (#8666)
- Added
ResultCollection
state_dict to the Loopstate_dict
and added support for distributed reload (#8641) - Added DeepSpeed collate checkpoint utility function (#8701)
- Added a
handles_accumulate_grad_batches
property to the training type plugins (#8856) - Added a warning to
WandbLogger
when reusing a wandb run (#8714) - Added
log_graph
argument forwatch
method ofWandbLogger
(#8662) LightningCLI
additions:- Added
LightningCLI(run=False|True)
to choose whether to run aTrainer
subcommand (#8751) - Added support to call any trainer function from the
LightningCLI
via subcommands (#7508) - Allow easy trainer re-instantiation (#7508)
- Automatically register all optimizers and learning rate schedulers (#9565)
- Allow registering custom optimizers and learning rate schedulers without subclassing the CLI (#9565)
- Support shorthand notation to instantiate optimizers and learning rate schedulers (#9565)
- Support passing lists of callbacks via command line (#8815)
- Support shorthand notation to instantiate models (#9588)
- Support shorthand notation to instantiate datamodules (#10011)
- Added
multifile
option toLightningCLI
to enable/disable config saving to preserve multiple files structure (#9073)
- Added
- Fault-tolerant training:
- Added
FastForwardSampler
andCaptureIterableDataset
injection to data loading utilities (#8366) - Added
DataFetcher
to control fetching flow (#8890) - Added
SharedCycleIteratorState
to prevent infinite loop (#8889) - Added
CaptureMapDataset
for state management in map-style datasets (#8891) - Added Fault Tolerant Training to
DataFetcher
(#8891) - Replaced old prefetch iterator with new
DataFetcher
in training loop (#8953) - Added partial support for global random state fault-tolerance in map-style datasets (#8950)
- Converted state to tuple explicitly when setting Python random state (#9401)
- Added support for restarting an optimizer loop (multiple optimizers) (#9537)
- Added support for restarting within Evaluation Loop (#9563)
- Added mechanism to detect that a signal has been sent so the Trainer can gracefully exit (#9566)
- Added support for skipping ahead to validation during the auto-restart of fitting (#9681)
- Added support for auto-restart if a fault-tolerant checkpoint is available (#9722)
- Added
- Checkpoint saving and loading extensibility:
- Added
CheckpointIO
plugin to expose checkpoint IO from training type plugin (#8743) - Refactored
CheckpointConnector
to offload validation logic to theCheckpointIO
plugin (#9045) - Added
remove_checkpoint
toCheckpointIO
plugin by moving the responsibility out of theModelCheckpoint
callback (#9373) - Added
XLACheckpointIO
plugin (#9972)
- Added
- Loop customization:
- Added
Closure
andAbstractClosure
classes (#8642) - Refactored
TrainingBatchLoop
and extractedOptimizerLoop
, splitting off automatic optimization into its own loop (#9191) - Removed
TrainingBatchLoop.backward()
; manual optimization now calls directly intoAccelerator.backward()
and automatic optimization handles backward in newOptimizerLoop
(#9265) - Extracted
ManualOptimization
logic fromTrainingBatchLoop
into its own separate loop class (#9266) - Added
OutputResult
andManualResult
classes (#9437, #9424) - Marked
OptimizerLoop.backward
as protected (#9514) - Marked
FitLoop.should_accumulate
as protected (#9515) - Marked several methods in
PredictionLoop
as protected:on_predict_start
,on_predict_epoch_end
,on_predict_end
,on_predict_model_eval
(#9516) - Marked several methods in
EvaluationLoop
as protected:get_max_batches
,on_evaluation_model_eval
,on_evaluation_model_train
,on_evaluation_start
,on_evaluation_epoch_start
,on_evaluation_epoch_end
,on_evaluation_end
,reload_evaluation_dataloaders
(#9516) - Marked several methods in
EvaluationEpochLoop
as protected:on_evaluation_batch_start
,evaluation_step
,evaluation_step_end
(#9516) - Added
yielding_training_step
example (#9983)
- Added
- Added support for saving and loading state of multiple callbacks of the same type (#7187)
- Added DeepSpeed Stage 1 support (#8974)
- Added
Python dataclass
support forLightningDataModule
(#8272) - Added sanitization of tensors when they get logged as hyperparameters in
TensorBoardLogger
(#9031) - Added
InterBatchParallelDataFetcher
(#9020) - Added
DataLoaderIterDataFetcher
(#9020) - Added
DataFetcher
withinFit / Evaluation
Loop (#9047) - Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 (#9005)
- Added Rich integration:
- Added input validation logic for precision (#9080)
- Added support for CPU AMP autocast (#9084)
- Added
on_exception
callback hook (#9183) - Added a warning to DeepSpeed when inferring batch size (#9221)
- Added
ModelSummary
callback (#9344) - Added
log_images
,log_text
andlog_table
toWandbLogger
(#9545) - Added
PL_RECONCILE_PROCESS
environment variable to enable process reconciliation regardless of cluster environment settings (#9389) - Added
get_device_stats
to the Accelerator interface and added its implementation for GPU and TPU (#9586) - Added a warning when an unknown key is encountered in the optimizer configuration, and when
OneCycleLR
is used with"interval": "epoch"
(#9666) - Added
DeviceStatsMonitor
callback (#9712) - Added
enable_progress_bar
to the Trainer constructor (#9664) - Added
pl_legacy_patch
load utility for loading old checkpoints that have pickled legacy Lightning attributes (#9166) - Added support for
torch.use_deterministic_algorithms
(#9121) - Added automatic parameters tying for TPUs (#9525)
- Added support for
torch.autograd.set_detect_anomaly
throughTrainer
constructor argumentdetect_anomaly
(#9848) - Added
enable_model_summary
flag to Trainer (#9699) - Added
strategy
argument to Trainer (#8597) - Added
init_meta_context
,materialize_module
utilities (#9920) - Added
TPUPrecisionPlugin
(#10020) - Added
torch.bfloat16
support: - Added
kfold
example for loop customization (#9965) - LightningLite:
- Added
PrecisionPlugin.forward_context
, making it the default implementation for all{train,val,test,predict}_step_context()
methods (#9988) - Added
DDPSpawnPlugin.spawn()
for spawning new processes of a given function (#10018, #10022) - Added
TrainingTypePlugin.{_setup_model, _setup_optimizer}
methods (#9994, #10064) - Implemented
DataParallelPlugin._setup_model
(#10010) - Implemented
DeepSpeedPlugin._setup_model_and_optimizers
(#10009, #10064) - Implemented
{DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_model_and_optimizers
(#10028, #10064) - Added optional
model
argument to theoptimizer_step
methods in accelerators and plugins (#10023) - Updated precision attributes in
DeepSpeedPlugin
(#10164) - Added the ability to return a result from rank 0 in
DDPSpawnPlugin.spawn
(#10162) - Added
pytorch_lightning.lite
package (#10175) - Added
LightningLite
documentation (#10043) - Added
LightningLite
examples (#9987) - Make the
_LiteDataLoader
an iterator and add supports for custom dataloader (#10279)
- Added
- Added
use_omegaconf
argument tosave_hparams_to_yaml
plugin (#9170) - Added
ckpt_path
argument forTrainer.fit()
(#10061) - Added
auto_device_count
method toAccelerators
(#10222) - Added support for
devices="auto"
(#10264) - Added a
filename
argument inModelCheckpoint.format_checkpoint_name
(#9818) - Added support for empty
gpus
list to run on CPU (#10246) - Added a warning if multiple batch sizes are found from ambiguous batch (#10247)
Changed
- Trainer now raises a
MisconfigurationException
when its methods are called withckpt_path="best"
but a checkpoint callback isn't configured (#9841) - Setting
Trainer(accelerator="ddp_cpu")
now does not spawn a subprocess ifnum_processes
is kept1
along withnum_nodes > 1
(#9603) - Module imports are now catching
ModuleNotFoundError
instead ofImportError
(#9867) pytorch_lightning.loggers.neptune.NeptuneLogger
is now consistent with the new neptune-client API; the old neptune-client API is supported byNeptuneClient
from the neptune-contrib repo (#6867)- Parsing of
enums
type hyperparameters to be saved in thehaprams.yaml
file by TensorBoard and CSV loggers has been fixed and made in line with how OmegaConf parses it (#9170) - Parsing of the
gpus
Trainer argument has changed:gpus="n"
(str) no longer selects the GPU index n and instead selects the first n devices (#8770) iteration_count
and other index attributes in the loops has been replaced with progress dataclasses (#8477)- The
trainer.lightning_module
reference is now properly set at the very beginning of a run (#8536) - The model weights now get loaded in all cases when the checkpoint path gets provided in validate/test/predict, regardless of whether the model instance is provided or not (#8352)
- The
Trainer
functionsreset_{train,val,test,predict}_dataloader
,reset_train_val_dataloaders
, andrequest_dataloader
model
argument is now optional (#8536) - Saved checkpoints will no longer use the type of a
Callback
as the key to avoid issues with unpickling (#6886) - Improved string conversion for
ResultCollection
(#8622) LightningCLI
changes:LightningCLI.init_parser
now returns the parser instance (#8721)LightningCLI.add_core_arguments_to_parser
,LightningCLI.parse_arguments
now take aparser
argument (#8721)LightningCLI.instantiate_trainer
now takes a config and a list of callbacks (#8721)- Split
LightningCLI.add_core_arguments_to_parser
intoLightningCLI.add_default_arguments_to_parser
+LightningCLI.add_core_arguments_to_parser
(#8721)
- The accelerator and training type plugin
setup
hooks no longer have amodel
argument (#8536) - The accelerator and training type plugin
update_global_step
hook has been removed (#8856) - The coverage of
self.log
-ing in anyLightningModule
orCallback
hook has been improved (#8498) self.log
-ing without aTrainer
reference now raises a warning instead of an exception (#9733)- Removed restrictions in the Trainer that loggers can only log from rank 0; the existing logger behavior has not changed (#8608)
Trainer.request_dataloader
now takes aRunningStage
enum instance (#8858)- Changed
rank_zero_warn
toNotImplementedError
in the{train, val, test, predict}_dataloader
hooks thatLightning(Data)Module
uses (#9161) - Moved
block_ddp_sync_behaviour
out ofTrainingBatchLoop
to loop utilities (#9192) - Executing the
optimizer_closure
is now required when overriding theoptimizer_step
hook (#9360) - Changed logging of
LightningModule
andLightningDataModule
hyperparameters to raise an exception only if there are colliding keys with different values (#9496) seed_everything
now fails when an invalid seed value is passed instead of selecting a random seed (#8787)- The Trainer now calls
TrainingTypePlugin
collective APIs directly instead of going through the Accelerator reference (#9677, #9901) - The tuner now usees a unique filename to save a temporary checkpoint (#9682)
- Changed
HorovodPlugin.all_gather
to return atorch.Tensor
instead of a list (#9696) - Changed Trainer connectors to be protected attributes:
- Configuration Validator (#9779)
- The
current_epoch
andglobal_step
attributes now get restored irrespective of the Trainer task (#9413) - Trainer now raises an exception when requesting
amp_level
with nativeamp_backend
(#9755) - Update the logic to check for accumulation steps with deepspeed (#9826)
pytorch_lightning.utilities.grads.grad_norm
now raises an exception if parameternorm_type <= 0
(#9765)- Updated error message for interactive incompatible plugins (#9896)
- Moved the
optimizer_step
andclip_gradients
hook from theAccelerator
andTrainingTypePlugin
into thePrecisionPlugin
(#10143, #10029) NativeMixedPrecisionPlugin
and its subclasses now take an optionalGradScaler
instance (#10055)- Trainer is now raising a
MisconfigurationException
instead of a warning ifTrainer.{validate/test}
is missing required methods (#10016) - Changed default value of the
max_steps
Trainer argument fromNone
to -1 (#9460) - LightningModule now raises an error when calling
log(on_step=False, on_epoch=False)
(#10227) - Quantization aware training observers are now disabled by default during validating/testing/predicting stages (#8540)
- Raised
MisconfigurationException
when total length ofdataloader
across ranks is zero, and give warning when total length is non-zero, but only local rank length is zero. (#9827) - Changed the model size calculation using
ByteCounter
(#10123) - Enabled
on_load_checkpoint
forLightningDataModule
for alltrainer_fn
(#10238) - Allowed separate config files for parameters with class type when LightningCLI is in
subclass_mode=False
(#10286)
Deprecated
- Deprecated Trainer argument
terminate_on_nan
in favor ofdetect_anomaly
(#9175) - Deprecated
Trainer.terminate_on_nan
public attribute access (#9849) - Deprecated
LightningModule.summarize()
in favor ofpytorch_lightning.utilities.model_summary.summarize()
(#8513) - Deprecated
LightningModule.model_size
(#8343) - Deprecated
DataModule
properties:train_transforms
,val_transforms
,test_transforms
,size
,dims
(#8851) - Deprecated
add_to_queue
,get_from_queue
fromLightningModule
in favor of corresponding methods in theDDPSpawnPlugin
(#9118) - Deprecated
LightningModule.get_progress_bar_dict
andTrainer.progress_bar_dict
in favor ofpytorch_lightning.callbacks.progress.base.get_standard_metrics
andProgressBarBase.get_metrics
(#8985) - Deprecated
prepare_data_per_node
flag on Trainer and set it as a property ofDataHooks
, accessible in theLightningModule
andLightningDataModule
(#8958) - Deprecated the
TestTubeLogger
(#9065) - Deprecated
on_{train/val/test/predict}_dataloader()
fromLightningModule
andLightningDataModule
(#9098) - Deprecated
on_keyboard_interrupt
callback hook in favor of newon_exception
hook (#9260) - Deprecated passing
process_position
to theTrainer
constructor in favor of adding theProgressBar
callback withprocess_position
directly to the list of callbacks (#9222) - Deprecated passing
flush_logs_every_n_steps
as a Trainer argument, instead pass it to the logger init if supported (#9366) - Deprecated
LightningLoggerBase.close
,LoggerCollection.close
in favor ofLightningLoggerBase.finalize
,LoggerCollection.finalize
(#9422) - Deprecated passing
progress_bar_refresh_rate
to theTrainer
constructor in favor of adding theProgressBar
callback withrefresh_rate
directly to the list of callbacks, or passingenable_progress_bar=False
to disable the progress bar (#9616) - Deprecated
LightningDistributed
and moved the broadcast logic toDDPPlugin
andDDPSpawnPlugin
directly (#9691) - Deprecated passing
stochastic_weight_avg
to theTrainer
constructor in favor of adding theStochasticWeightAveraging
callback directly to the list of callbacks (#8989) - Deprecated Accelerator collective API
barrier
,broadcast
, andall_gather
in favor of calling theTrainingTypePlugin
collective API directly (#9677) - Deprecated
checkpoint_callback
from theTrainer
constructor in favor ofenable_checkpointing
(#9754) - Deprecated the
LightningModule.on_post_move_to_device
method (#9525) - Deprecated
pytorch_lightning.core.decorators.parameter_validation
in favor ofpytorch_lightning.utilities.parameter_tying.set_shared_parameters
(#9525) - Deprecated passing
weights_summary
to theTrainer
constructor in favor of adding theModelSummary
callback withmax_depth
directly to the list of callbacks (#9699) - Deprecated
log_gpu_memory
,gpu_metrics
, and util funcs in favor ofDeviceStatsMonitor
callback (#9921) - Deprecated
GPUStatsMonitor
andXLAStatsMonitor
in favor ofDeviceStatsMonitor
callback (#9924) - Deprecated setting
Trainer(max_steps=None)
; To turn off the limit, setTrainer(max_steps=-1)
(default) (#9460) - Deprecated access to the
AcceleratorConnector.is_slurm_managing_tasks
attribute and marked it as protected (#10101) - Deprecated access to the
AcceleratorConnector.configure_slurm_ddp
method and marked it as protected (#10101) - Deprecated passing
resume_from_checkpoint
to theTrainer
constructor in favor oftrainer.fit(ckpt_path=)
(#10061) - Deprecated
ClusterEnvironment.creates_children()
in favor ofClusterEnvironment.creates_processes_externally
(property) (#10106) - Deprecated
PrecisionPlugin.master_params()
in favor ofPrecisionPlugin.main_params()
(#10105) - Deprecated
lr_sch_names
fromLearningRateMonitor
(#10066) - Deprecated
ProgressBar
callback in favor ofTQDMProgressBar
(#10134)
Removed
- Removed deprecated
metrics
(#8586) - Removed the deprecated
outputs
argument in both theLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#8587) - Removed the deprecated
TrainerLoggingMixin
class (#8609) - Removed the deprecated
TrainerTrainingTricksMixin
class (#8679) - Removed the deprecated
optimizer_idx
fromtraining_step
as an accepted argument in manual optimization (#8576) - Removed support for the deprecated
on_save_checkpoint
signature. The hook now takes acheckpoint
positional parameter (#8697) - Removed support for the deprecated
on_load_checkpoint
signature. The hook now takes apl_module
positional parameter (#8697) - Removed the deprecated
save_function
property inModelCheckpoint
(#8680) - Removed the deprecated
model
argument fromModelCheckpoint.save_checkpoint
(#8688) - Removed the deprecated
sync_step
argument fromWandbLogger
(#8763) - Removed the deprecated
Trainer.truncated_bptt_steps
in favor ofLightningModule.truncated_bptt_steps
(#8826) - Removed
LightningModule.write_predictions
andLightningModule.write_predictions_dict
(#8850) - Removed
on_reset_*_dataloader
hooks in TrainingType Plugins and Accelerators (#8858) - Removed deprecated
GradInformation
module in favor ofpytorch_lightning.utilities.grads
(#8831) - Removed
TrainingTypePlugin.on_save
andAccelerator.on_save
(#9023) - Removed
{Accelerator,TrainingTypePlugin,PrecisionPlugin}.post_optimizer_step
(#9746) - Removed deprecated
connect_precision_plugin
andconnect_training_type_plugin
fromAccelerator
(#9019) - Removed
on_train_epoch_end
fromAccelerator
(#9035) - Removed
InterBatchProcessor
in favor ofDataLoaderIterDataFetcher
(#9052) - Removed
Plugin
inbase_plugin.py
in favor of accessingTrainingTypePlugin
andPrecisionPlugin
directly instead (#9066) - Removed
teardown
fromParallelPlugin
(#8943) - Removed deprecated
profiled_functions
argument fromPyTorchProfiler
(#9178) - Removed deprecated
pytorch_lighting.utilities.argparse_utils
module (#9166) - Removed deprecated property
Trainer.running_sanity_check
in favor ofTrainer.sanity_checking
(#9209) - Removed deprecated
BaseProfiler.output_filename
arg from it and its descendants in favor ofdirpath
andfilename
(#9214) - Removed deprecated property
ModelCheckpoint.period
in favor ofModelCheckpoint.every_n_epochs
(#9213) - Removed deprecated
auto_move_data
decorator (#9231) - Removed deprecated property
LightningModule.datamodule
in favor ofTrainer.datamodule
(#9233) - Removed deprecated properties
DeepSpeedPlugin.cpu_offload*
in favor ofoffload_optimizer
,offload_parameters
andpin_memory
(#9244) - Removed deprecated property
AcceleratorConnector.is_using_torchelastic
in favor ofTorchElasticEnvironment.is_using_torchelastic()
(#9729) - Removed
pytorch_lightning.utilities.debugging.InternalDebugger
(#9680) - Removed
call_configure_sharded_model_hook
property fromAccelerator
andTrainingTypePlugin
(#9612) - Removed
TrainerProperties
mixin and moved property definitions directly intoTrainer
(#9495) - Removed a redundant warning with
ModelCheckpoint(monitor=None)
callback (#9875) - Remove
epoch
fromtrainer.logged_metrics
(#9904) - Removed
should_rank_save_checkpoint
property from Trainer (#9433) - Remove deprecated
distributed_backend
fromTrainer
(#10017) - Removed
process_idx
from the{DDPSpawnPlugin,TPUSpawnPlugin}.new_process
methods (#10022) - Removed automatic patching of
{train,val,test,predict}_dataloader()
on theLightningModule
(#9764) - Removed
pytorch_lightning.trainer.connectors.OptimizerConnector
(#10120)
Fixed
- Fixed ImageNet evaluation in example (#10179)
- Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8685)
- Fixed
move_metrics_to_cpu
moving the loss to CPU while training on device (#9308) - Fixed incorrect main progress bar indicator when resuming training mid-epoch (#9310)
- Fixed an issue with freeing memory of datafetchers during teardown (#9387)
- Fixed a bug where the training step output needed to be
deepcopy
-ed (#9349) - Fixed an issue with freeing memory allocated by the data iterators in
Loop.on_run_end
(#9386, #9915) - Fixed
BasePredictionWriter
not returning the batch indices in a non-distributed setting (#9432) - Fixed an error when running in XLA environments with no TPU attached (#9572)
- Fixed check on torchmetrics logged whose
compute()
output is a multielement tensor (#9582) - Fixed gradient accumulation for
DDPShardedPlugin
(#9122) - Fixed missing DeepSpeed distributed call (#9540)
- Fixed an issue with wrapped LightningModule during evaluation; The LightningModule no longer gets wrapped with data-parallel modules when not fitting in
DDPPlugin
,DDPSpawnPlugin
,DDPShardedPlugin
,DDPSpawnShardedPlugin
(#9096) - Fixed
trainer.accumulate_grad_batches
to be an int on init. The default value for it is nowNone
inside Trainer (#9652) - Fixed
broadcast
inDDPPlugin
andDDPSpawnPlugin
to respect thesrc
input (#9691) - Fixed
self.log(on_epoch=True, reduce_fx=sum))
for theon_batch_start
andon_train_batch_start
hooks (#9791) - Fixed
self.log(on_epoch=True)
for theon_batch_start
andon_train_batch_start
hooks (#9780) - Fixed restoring training state during
Trainer.fit
only (#9413) - Fixed DeepSpeed and Lightning both calling the scheduler (#9788)
- Fixed missing arguments when saving hyperparameters from the parent class but not from the child class (#9800)
- Fixed DeepSpeed GPU device IDs (#9847)
- Reset
val_dataloader
intuner/batch_size_scaling
(#9857) - Fixed use of
LightningCLI
in computer_vision_fine_tuning.py example (#9934) - Fixed issue with non-init dataclass fields in
apply_to_collection
(#9963) - Reset
val_dataloader
intuner/batch_size_scaling
for binsearch (#9975) - Fixed logic to check for spawn in dataloader
TrainerDataLoadingMixin._worker_check
(#9902) - Fixed
train_dataloader
getting loaded twice when resuming from a checkpoint duringTrainer.fit()
(#9671) - Fixed
LearningRateMonitor
logging with multiple param groups optimizer with no scheduler (#10044) - Fixed undesired side effects being caused by
Trainer
patching dataloader methods on theLightningModule
(#9764) - Fixed gradients not being unscaled when clipping or logging the gradient norm (#9287)
- Fixed
on_before_optimizer_step
getting called before the optimizer closure (including backward) has run (#10167) - Fixed monitor value in
ModelCheckpoint
getting moved to the wrong device in a special case where it becomes NaN (#10118) - Fixed creation of
dirpath
inBaseProfiler
if it doesn't exist (#10073) - Fixed incorrect handling of sigterm (#10189)
- Fixed bug where
log(on_step=True, on_epoch=True, sync_dist=True)
wouldn't reduce the value on step (#10227) - Fixed an issue with
pl.utilities.seed.reset_seed
converting thePL_SEED_WORKERS
environment variable tobool
(#10099) - Fixed iterating over a logger collection when
fast_dev_run > 0
(#10232) - Fixed
batch_size
inResultCollection
not being reset to 1 on epoch end (#10242) - Fixed
distrib_type
not being set when training plugin instances are being passed to the Trainer (#10251)
Contributors
@adamjstewart @akihironitta @alessiobonfiglio @ananthsub @aphedges @awaelchli @bamblebam @Benjamin-Etheredge @borchero @Borda @borisdayma @bryant1410 @carmocca @cowwoc @daniellepintz @danielykim @edward-io @eladsegal @EricWiener @ethanwharris @four4fish @gau-nernst @hankyul2 @HansolEom @himanshu-dutta @I-iBot @jjenniferdai @jstjohn @justusschock @kainoj @kaushikb11 @kingyiusuen @Knarik1 @low5545 @lsqshr @mauvilsa @michele-arrival @nasnoisaac @ninginthecloud @popfido @pre-commit-ci @PuneetDabral @qmpzzpmq @rohitgr7 @ronif @roshikouhai @s-rog @samlurye @SeanNaren @shnela @sidml @stancld @stfwn @tangbinh @tchaton @thepurpleowl @Tshimanga @twsl @victorjoos @VirajBagal @wayi1 @weiji14 @yifuwang @yopknopixx
If we forgot someone, let us know :]