v0.13.1
🚀 Composer v0.13.1
Introducing the composer
PyPi package!
Composer v0.13.1 is released!
Composer can also now be installed using the new composer
PyPi package via pip
:
pip install composer==0.13.1
The legacy package name still works via pip
:
pip install mosaicml==0.13.1
Note: The mosaicml==0.13.0
PyPi package was yanked due to some minor packaging issues discovered after release. The package was re-released as Composer v0.13.1, thus these release notes contain details for both v0.13.0 and v0.13.1.
New Features
-
🤙 New and Updated Callbacks
-
New
HealthChecker
Callback (#2002)The callback will log a warning if the GPUs on a given node appear to be in poor health (low utilization). The callback can also be configured to send a Slack message!
from composer import Trainer from composer.callbacks import HealthChecker # Warn if GPU utilization difference drops below 10% health_checker = HealthChecker( threshold = 10 ) # Construct Trainer trainer = Trainer( ..., callbacks=health_checker, ) # Train! trainer.fit()
-
Updated
MemoryMonitor
to use GigaBytes (GB) units (#1940) -
New
RuntimeEstimator
Callback (#1991)Estimate the remaining runtime of your job! Approximates the time remaining by observing the throughput and comparing to the number of batches remaining.
from composer import Trainer from composer.callbacks import RuntimeEstimator # Construct trainer with RuntimeEstimator callback trainer = Trainer( ..., callbacks=RuntimeEestimator(), ) # Train! trainer.fit()
-
Updated
SpeedMonitor
throughput metrics (#1987)Expands throughput metrics to track relative to several different time units and per device:
throughput/batches_per_sec
andthroughput/device/batches_per_sec
throughput/tokens_per_sec
andthroughput/device/tokens_per_sec
throughput/flops_per_sec
andthroughput/device/flops_per_sec
throughput/device/samples_per_sec
Also adds
throughput/device/mfu
metric to compute per device MFU. Simply enable theSpeedMonitor
callback per usual to log these new metrics! Please see SpeedMonitor documentation for more information.
-
-
⣿ FSDP Sharded Checkpoints (#1902)
Users can now specify the
state_dict_type
in thefsdp_config
dictionary to enable sharded checkpoints. For example:from composer import Trainer fsdp_confnig = { 'sharding_strategy': 'FULL_SHARD', 'state_dict_type': 'local', } trainer = Trainer( ..., fsdp_config=fsdp_config, save_folder='checkpoints', save_filename='ba{batch}_rank{rank}.pt', save_interval='10ba', )
Please see the PyTorch FSDP docs and Composer's Distributed Training notes for more information.
-
🤗 HuggingFace Improvements
- Update
HuggingFaceModel
class to support encoder-decoder batches withoutdecoder_input_ids
(#1950) - Allow evaluation metrics to be passed to
HuggingFaceModel
directly (#1971) - Add a utility function to load a Composer checkpoint of a
HuggingFaceModel
and write out the expectedconfig.json
andpytorch_model.bin
in the HuggingFace pretrained folder (#1974)
- Update
-
🛟 Nvidia H100 Alpha Support - Added
amp_fp8
data typeIn preparation for H100's arrival, we've added the
amp_fp8
precision type. Currently settingamp_fp8
specifies a new precision context usingtransformer_engine.pytorch.fp8_autocast.
For more details, please see Nvidia's new Transformer Engine and the specific fp8 recipe we utilize.from composer import Trainer trainer = Trainer( ..., precision='amp_fp8', )
API changes
-
The
torchmetrics
package has been upgraded to 0.11.x.The
torchmetrics.Accuracy
metric now requires atask
argument which can take on a value ofbinary
,multiclass
ormultilabel
. Please see Torchmetrics Accuracy docs for details.Additonally, since specifying
value='multiclass'
requires an additional field ofnum_classes
to be specified, we've had to updateComposerClassifier
to accept the additionalnum_classes
argument. Please see PR's #2017 and #2025 for additional details -
Surgery algorithms used in functional form return a value of
None
(#1543)
Deprecations
- Deprecate HFCrossEntropy and Perplexity (#1857)
- Remove Jenkins CI (#1943, #1954)
- Change Deprecation Warnings to Warnings for specifying
ProgressBarLogger
andConsoleLogger
to loggers (#1846)
Bug Fixes
- Fixed an issue introduced in 0.12.1 where
HuggingFaceModel
crashes ifconfig.return_dict = False
(#1948) - Refactor EMA to improve memory efficiency (#1941)
- Make wandb checkpoint logging compatible with wandb model registry (#1973)
- Fix ICL race conditions (#1978)
- Update
epoch
metric name totrainer/epoch
(#1986) - reset scaler (#1999)
- Bug/sync optimization logger across ranks (#1970)
- Update Docker images to fix resolve vulnerability scan issues (#2007)
- Fix eval duplicate logging issue (#2018)
- extend test and patch bug (#2028)
- Protect for missing slack_sdk import (#2031)
Known Issues
- Docker Image Security Vulnerability
- CVE-2022-45907: The
mosaicml/pytorch:1.12.1*
,mosaicml/pytorch:1.11.0*
,mosaicml/pytorch_vision:1.12.1*
andmosaicml/pytorch_vision:1.11.0*
images are impacted and currently supported for legacy use cases. We recommend users upgrade to images with PyTorch >1.13. The affected images will be removed in the next Composer release.
- CVE-2022-45907: The
What's Changed
- Raise error if max duration is in epochs and dataloader is infinite by @dakinggg in #1942
- Bump traitlets from 5.8.0 to 5.9.0 by @dependabot in #1946
- Deprecate HFCrossEntropy and Perplexity by @dakinggg in #1857
- Change functional surgery method return values to None by @nik-mosaic in #1543
- Retire Jenkins by @bandish-shah in #1943
- Update MCP GHA Name by @mvpatel2000 in #1951
- update memory monitor by @mvpatel2000 in #1940
- Move ffcv up in test order by @dskhudia in #1953
- Fix memory monitor test by @mvpatel2000 in #1957
- Fix model surgery failure due to functional API change by @nik-mosaic in #1949
- Change how we check for forwards args in models for HF models by @bcui19 in #1955
- add return dict false test and bug fix by @dakinggg in #1948
- remove jenkins ci by @mvpatel2000 in #1954
- add support for enc-dec batches without decoder_input_ids by @dakinggg in #1950
- Refactor EMA to improve memory efficiency by @coryMosaicML in #1941
- Add warning for untrusted checkpoints by @mvpatel2000 in #1959
- permit opt tokenizer by @bmosaicml in #1958
- GHA Docker build flow for PR's by @bandish-shah in #1883
- Update download badge link to pepy by @karan6181 in #1966
- Update python version in setup.py and fixed pypi download badge by @karan6181 in #1969
- allow eval metrics to be passed in to HuggingFaceModel directly by @dakinggg in #1971
- Make wandb checkpoint logging compatible with wandb model registry by @growlix in #1973
- Add support for FP8 on H100 using NVidia's TransformerEngine by @dskhudia in #1965
- Util for writing HuggingFace save_pretrained from a composer checkpoint by @dakinggg in #1974
- Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) by @eracah in #1902
- Bump custom-inherit from 2.4.0 to 2.4.1 by @dependabot in #1981
- Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #1982
- Fix ICL race conditions by @dakinggg in #1978
- add map location to huggingface utils by @dakinggg in #1980
- fix log epoch by @mvpatel2000 in #1986
- GHA release workflow, refactor PR and Daily workflows by @bandish-shah in #1968
- Remove python-version input from Daily CPU tests by @bandish-shah in #1989
- Add some logic to pass the correct github ref to mcp script by @bandish-shah in #1990
- Fix typo in docstring for eval with missing space by @mvpatel2000 in #1992
- Fix failing sharded_checkpoint tests that fail when pytorch 1.13 is not installed by @eracah in #1988
- Add merge_group event trigger to GHA daily workflow by @bandish-shah in #1996
- Runtime estimator by @mvpatel2000 in #1991
- Reset scaler state by @mvpatel2000 in #1999
- Speed monitor refactor by @mvpatel2000 in #1987
- Test hf fsdp by @dakinggg in #1972
- Bug/sync optimization logger across ranks by @bmosaicml in #1970
- Fix optimizer monitor test gating with FSDP by @mvpatel2000 in #2000
- Low precision groupnorm by @mvpatel2000 in #1976
- Bump coverage[toml] from 7.1.0 to 7.2.1 by @dependabot in #2008
- Update docs to include runtime estimator by @mvpatel2000 in #2009
- Tag surgery algorithms LPLN and LPGN by @mvpatel2000 in #2011
- Update SpeedMonitor short-description for docs table by @mvpatel2000 in #2010
- Update Low Precision LayerNorm arguments by @nik-mosaic in #1994
- Medical Segmentation Example Typo by @mvpatel2000 in #2014
- Update wallclock logging to default hours by @mvpatel2000 in #2005
- Add HealthChecker Callback by @hanlint in #2002
- Allow FX graph mode post-training dynamic quantisation of BlurConv2d operations. by @BrettRyland in #1995
- Add multi-gpu testing to test_algorithm_resumption by @eracah in #2016
- Add backwards compatible checkpoint loading for EMA by @coryMosaicML in #2012
- fsdp with custom process groups by @vchiley in #2006
- Patch Speed Monitor MFU by @mvpatel2000 in #2013
- Remove runtime estimator state dict by @mvpatel2000 in #2015
- Update Docker images to fix resolve vulnerability scan issues by @bandish-shah in #2007
- Change Deprecation Warnings to Warnings for specifying ProgressBarLogger and ConsoleLogger to loggers by @eracah in #1846
- Fix eval duplicate logging issue by @mvpatel2000 in #2018
- Add workflow_dispatch trigger to pr-docker workflow by @bandish-shah in #2019
- Bump streaming version to less than 0.4.0 by @karan6181 in #2020
- Upgrade ipython installed in Docker images by @bandish-shah in #2021
- Upgrade torchmetrics by @nik-mosaic in #2017
- Complete upgrade of torchmetrics accuracy by @nik-mosaic in #2025
- Bump version to v0.13.0 by @bandish-shah in #2024
New Contributors
- @BrettRyland made their first contribution in #1995
Full Changelog: v0.12.1...v0.13.1