ATorch

ATorch: Make large model training more efficient and reproducible for everyone.

ATorch is an extension library of PyTorch developed by Ant Group's AI Infrastructure team. By decoupling model definition from training optimization strategy, ATorch supports efficient and easy-to-use model training experience. The design principle is to minimally disrupt the native PyTorch programming style. Through its API, ATorch provides performance optimizations in aspects such as I/O, preprocessing, computation, and communication (including automatic optimization). ATorch has supported large-scale pretraining of LLMs with over 100 billion parameters and thousands of A100/H100 GPUs.

Features

Easy-to-use interface
- auto_accelerate API
- ATorchTrainer (ongoing work)
Solutions for large-scale model training
- support efficient large model initialization, checkpoint save/load, and restart with elastic resources.
Automatic/semi-automatic optimization
- Acceleration Engine for automatic optimization
- Semi-automatic optimization supports custom optimization
Hybrid parallelism support (arbitrary combination of fsdp/zero/ddp/tp/sp/pp)
High performance operators
- Flash attention 2 with custom mask support
- Transformer ops
- High-performance MOE
- sub-graph compilation
Checkpointing
Mixed precision
Communication optimization
- Cached sharding
Effective optimizers for fast training convergence
- AGD optimizer
- WSAM optimizer
IO/Preprocessing
- CPU/GPU coworker to speedup data preprocessing
- IO optimization for different dataset
Elastic and fault tolerance
- Hardware error detection and migration (with dlrover)
- GPU elastic training support
- HangDetector (detecting and automatically restarting distributed training if it hangs)

Installation

ATorch supports PyTorch with version >= 1.12, and version 2.1 or above is preferred. For example, you can use docker image registry.cn-hangzhou.aliyuncs.com/atorch/atorch-open-20240430:pt210) which has PyTorch 2.1 installed.

Install From PyPI

Install atorch in any PyTorch-preinstalled environment (such as a container created with the docker image above) with pip:

pip install atorch

Install From Source Files

# clone repository
git clone https://github.com/intelligent-machine-learning/dlrover.git
cd dlrover/atorch
# build package, optional set version.
bash dev/scripts/build.sh [version]
# install the created package in dist directory. Note that if version is set, file name is different.
pip install dist/atorch-0.1.0.dev0-py3-none-any.whl

Getting Started

Run Examples

To run auto_accelerate examples:

cd dlrover/atorch/examples/auto_accelerate
# Single process train
python train.py --model_type toy
# Distributed train
python -m atorch.distributed.run  --nproc_per_node 2  train.py --model_type llama --distributed --load_strategy --use_fsdp --use_amp --use_module_replace --use_checkpointing

Llama2 pretrain/finetune examples
Optimizer (AGD, WSAM) Examples

Documentations

auto_accelerate

AGD optimizer

WSAM optimizer

Contributing

Contributions are welcome! If you have any suggestions, ideas, or bug reports, please open an issue or submit a pull request.

CI/CD

We leverage the power of GitHub Actions to automate our development, release and deployment workflows. Please check out this documentation on how the automated workflows are operated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ATorch

Table of Contents

Features

Installation

Install From PyPI

Install From Source Files

Getting Started

Run Examples

Documentations

Contributing

CI/CD

Files

README.md

Latest commit

History

README.md

File metadata and controls

ATorch

Table of Contents

Features

Installation

Install From PyPI

Install From Source Files

Getting Started

Run Examples

Documentations

Contributing

CI/CD