Skip to content

Commit

Permalink
fix star history (#95)
Browse files Browse the repository at this point in the history
  • Loading branch information
KKZ20 authored Mar 7, 2024
1 parent 2670969 commit 9ca5472
Showing 1 changed file with 24 additions and 22 deletions.
46 changes: 24 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@
<p align="center"><a href="https://github.com/NUS-HPC-AI-Lab/OpenDiT">[Homepage]</a> | <a href="https://discord.gg/6UzVWm9a">[Discord]</a> | <a href="./figure/wechat.png">[WeChat]</a> | <a href="https://twitter.com/YangYou1991/status/1762447718105170185">[Twitter]</a> | <a href="https://zhuanlan.zhihu.com/p/684457582">[Zhihu]</a> | <a href="https://mp.weixin.qq.com/s/IBb9vlo8hfYKrj9ztxkhjg">[Media]</a></p>
</p>

### Latest News 🔥
### Latest News 🔥

* [2024/03/01] Support DiT-based Latte for text-to-video generation.
* [2024/02/27] Officially release OpenDiT: An Easy, Fast and Memory-Efficent System for DiT Training and Inference.
- [2024/03/01] Support DiT-based Latte for text-to-video generation.
- [2024/02/27] Officially release OpenDiT: An Easy, Fast and Memory-Efficent System for DiT Training and Inference.

# About

Expand All @@ -18,18 +18,18 @@ OpenDiT is an open-source project that provides a high-performance implementatio
OpenDiT boasts the performance by the following techniques:

1. Up to 80% speedup and 50% memory reduction on GPU
* Kernel optimization including FlashAttention, Fused AdaLN, and Fused layernorm kernel.
* Hybrid parallelism methods including ZeRO, Gemini, and DDP. Also, sharding the ema model further reduces the memory cost.
- Kernel optimization including FlashAttention, Fused AdaLN, and Fused layernorm kernel.
- Hybrid parallelism methods including ZeRO, Gemini, and DDP. Also, sharding the ema model further reduces the memory cost.
2. FastSeq: A novel sequence parallelism method
* Specially designed for DiT-like workloads where the activation size is large but the parameter size is small.
* Up to 48% communication save for intra-node sequence parallel.
* Break the memory limitation of a single GPU and reduce the overall training and inference time.
- Specially designed for DiT-like workloads where the activation size is large but the parameter size is small.
- Up to 48% communication save for intra-node sequence parallel.
- Break the memory limitation of a single GPU and reduce the overall training and inference time.
3. Ease of use
* Huge performance improvement gains with a few line changes
* Users do not need to know the implementation of distributed training.
- Huge performance improvement gains with a few line changes
- Users do not need to know the implementation of distributed training.
4. Complete pipeline of text-to-image and text-to-video generation
* Researchers and engineers can easily use and adapt our pipeline to real-world applications without modifying the parallel part.
* Verify the accuracy of OpenDiT with text-to-image training on ImageNet and release checkpoint.
- Researchers and engineers can easily use and adapt our pipeline to real-world applications without modifying the parallel part.
- Verify the accuracy of OpenDiT with text-to-image training on ImageNet and release checkpoint.

<p align="center">
<img width="600px" alt="end2end" src="./figure/end2end.png">
Expand All @@ -43,9 +43,9 @@ More features are coming soon!

Prerequisites:

- Python >= 3.10
- PyTorch >= 1.13 (We recommend to use a >2.0 version)
- CUDA >= 11.6
- Python >= 3.10
- PyTorch >= 1.13 (We recommend to use a >2.0 version)
- CUDA >= 11.6

We strongly recommend using Anaconda to create a new environment (Python >= 3.10) to run our examples:

Expand Down Expand Up @@ -87,7 +87,6 @@ git checkout 741bdf50825a97664db08574981962d66436d16a
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"
```


## Usage

### Image
Expand All @@ -105,6 +104,7 @@ torchrun --standalone --nproc_per_node=2 train.py \
```

We disable all speedup methods by default. Here are details of some key arguments for training:

- `--nproc_per_node`: The GPU number you want to use for the current node.
- `--plugin`: The booster plugin used by ColossalAI, `zero2` and `ddp` are supported. The default value is `zero2`. Recommend to enable `zero2`.
- `--mixed_precision`: The data type for mixed precision training. The default value is `bf16`.
Expand All @@ -116,7 +116,6 @@ We disable all speedup methods by default. Here are details of some key argument
- `--load`: Load previous saved checkpoint dir and continue training.
- `--num_classes`: Label class number. Should be 10 for CIFAR10 and 1000 for ImageNet. Only used for label-to-image generation.


For more details on the configuration of the training process, please visit our code.

<b>Multi-Node Training.</b>
Expand Down Expand Up @@ -149,11 +148,14 @@ python sample.py \
--num_classes 10 \
--ckpt ckpt_path
```

Here are details of some addtional key arguments for inference:

- `--ckpt`: The weight of ema model `ema.pt`. To check your training progress, it can also be our saved base model `epochXX-global_stepXX/model`, it will produce better results than ema in early training stage.
- `--num_classes`: Label class number. Should be 10 for CIFAR10, and 1000 for ImageNet (including official and our checkpoint).

### Video

<b>Training.</b> We current support `VDiT` and `Latte` for video generation. VDiT adopts DiT structure and use video as inputs data. Latte further use more efficient spatial & temporal blocks based on VDiT (not exactly align with origin [Latte](https://github.com/Vchitect/Latte)).

Our video training pipeline is a faithful implementation, and we encourage you to explore your own strategies using OpenDiT. You can train the video DiT model by executing the following command:
Expand Down Expand Up @@ -203,8 +205,9 @@ Inference tips: 1) EMA model requires quite long time to converge and produce me
![fastseq_overview](./figure/fastseq_overview.png)

In the realm of visual generation models, such as DiT, sequence parallelism is indispensable for effective long-sequence training and low-latency inference. Two key features can summarize the distinctive nature of these tasks:
* The model parameter is smaller compared with LLMs, but the sequence can be very long, making communication a bottleneck.
* As the model size is relatively small, it only needs sequence parallelism within a node.

- The model parameter is smaller compared with LLMs, but the sequence can be very long, making communication a bottleneck.
- As the model size is relatively small, it only needs sequence parallelism within a node.

However, existing methods like DeepSpeed-Ulysses and Megatron-LM Sequence Parallelism face limitations when applied to such tasks. They either introduce excessive sequence communication or lack efficiency in handling small-scale sequence parallelism.

Expand All @@ -214,7 +217,6 @@ Here are the results of our experiments, more results will be coming soon:

![fastseq_exp](./figure/fastseq_exp.png)


## DiT Reproduction Result

We have trained DiT using the origin method with OpenDiT to verify our accuracy. We have trained the model from scratch on ImageNet for 80k steps on 8xA100. Here are some results generated by our trained DiT:
Expand All @@ -237,7 +239,6 @@ torchrun --standalone --nproc_per_node=8 train.py \
--num_classes 1000
```


## Acknowledgement

We extend our gratitude to [Zangwei Zheng](https://zhengzangw.github.io/) for providing valuable insights into algorithms and aiding in the development of the video pipeline. Additionally, we acknowledge [Shenggan Cheng](https://shenggan.github.io/) for his guidance on code optimization and parallelism. Our appreciation also goes to [Fuzhao Xue](https://xuefuzhao.github.io/), [Shizun Wang](https://littlepure2333.github.io/home/), [Yuchao Gu](https://ycgu.site/), [Shenggui Li](https://franklee.xyz/), and [Haofan Wang](https://haofanwang.github.io/) for their invaluable advice and contributions.
Expand All @@ -249,6 +250,7 @@ This codebase borrows from [Meta's DiT](https://github.com/facebookresearch/DiT)
If you encounter problems using OpenDiT or have a feature request, feel free to create an issue! We also welcome pull requests from the community.

## Citation

```
@misc{zhao2024opendit,
author = {Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, and Yang You},
Expand All @@ -262,4 +264,4 @@ If you encounter problems using OpenDiT or have a feature request, feel free to

## Star History

[![Star History Chart](https://api.star-history.com/repos=NUS-HPC-AI-Lab/OpenDiT&type=Date)](https://star-history.com/#NUS-HPC-AI-Lab/OpenDiT&Date)
[![Star History Chart](https://api.star-history.com/svg?repos=NUS-HPC-AI-Lab/OpenDiT&type=Date)](https://star-history.com/#NUS-HPC-AI-Lab/OpenDiT&Date)

0 comments on commit 9ca5472

Please sign in to comment.