Vote on new features! #693

tianyu-l · 2024-11-23T00:33:26Z

tianyu-l
Nov 23, 2024
Collaborator

Hi torchtitanists,

Thank you for your interests in torchtitan!

Please upvote on what features you would like to see next, and add one if it's not already there. We'll try to prioritize on the most requested features.

tianyu-l · 2024-11-23T00:33:50Z

tianyu-l
Nov 23, 2024
Collaborator Author

#625 multimodal models (e.g. Llama 3.2) (work in progress)

0 replies

tianyu-l · 2024-11-23T00:34:39Z

tianyu-l
Nov 23, 2024
Collaborator Author

#184 MoE and Expert Parallel (work in progress)

7 replies

mayank31398 Nov 26, 2024

scattermoe might be useful?
https://arxiv.org/abs/2403.08245

SeunghyunSEO Nov 26, 2024

to my best knowledge, scattermoe is compatible with FSDP (maybe 1) and IBM granite is trained with it successfully, but it does not support EP and FSDP2, right?
i want torch native EP haha

mayank31398 Nov 26, 2024

correct, we train Granite with scattermoe.
It works with FSDP-2 + TP as well.
this is the TP implementation: https://github.com/IBM/dolomite-engine/blob/main/dolomite_engine/hf_models/models/moe_dolomite_TP/moe_TP/scatter.py
this is the non-TP implementation: https://github.com/IBM/dolomite-engine/blob/main/dolomite_engine/hf_models/models/moe_dolomite/moe/scatter.py

SeunghyunSEO Nov 26, 2024

correct, we train Granite with scattermoe.

It works with FSDP-2 + TP as well.

this is the TP implementation: https://github.com/IBM/dolomite-engine/blob/main/dolomite_engine/hf_models/models/moe_dolomite_TP/moe_TP/scatter.py

this is the non-TP implementation: https://github.com/IBM/dolomite-engine/blob/main/dolomite_engine/hf_models/models/moe_dolomite/moe/scatter.py

oh, thank you so much for sharing.
ill try scatter moe with torchtitan or lingua :)

casper-hansen Nov 28, 2024

This fork has some code for "InternLM3 MoE" that implements a dropless MoE. It's not a ScatterMoE implementation, but does use Grouped GEMM.
https://github.com/zigzagcai/mytorchtitan

tianyu-l · 2024-11-23T00:35:24Z

tianyu-l
Nov 23, 2024
Collaborator Author

state space models (e.g. Mamba)

0 replies

tianyu-l · 2024-11-23T00:35:30Z

tianyu-l
Nov 23, 2024
Collaborator Author

diffusion models (e.g. DiT)

2 replies

gil2rok Nov 26, 2024

this!

tangjiasheng Nov 26, 2024

maybe this need to open a new repo.

ad8e · 2024-11-23T01:10:39Z

ad8e
Nov 23, 2024

What about varlen ring attention? There are a few variants which are not stable enough for production use; its engineering/testing obstacles make it suitable for the strengths of the PyTorch folks.

0 replies

casper-hansen · 2024-11-23T07:31:00Z

casper-hansen
Nov 23, 2024

Evaluations implementation with support for 4D parallelism

0 replies

tangjiasheng · 2024-11-23T10:42:15Z

tangjiasheng
Nov 23, 2024

I'd like to see a Deepspeed-Ulysses-Style sequence parallel used with DTensor.

0 replies

casper-hansen · 2024-11-23T15:53:50Z

casper-hansen
Nov 23, 2024

Qwen 2.5 support. Features needed on top of llama.py:

safetensors to DCP
attention bias on QKV
YaRN (implemented in qwen2.py in HF transformers)

1 reply

conceptofmind Nov 27, 2024

Definitely should add YaRN

dz1iang · 2024-11-25T12:30:25Z

dz1iang
Nov 25, 2024

Parallel training is non-intrusive to model implementation.

2 replies

tianyu-l Nov 25, 2024
Collaborator Author

Can you elaborate more on this request? I believe our current parallelisms have largely avoided intrusive changes to model implementations.

dz1iang Nov 26, 2024

Can you elaborate more on this request? I believe our current parallelisms have largely avoided intrusive changes to model implementations.

I have an idea to modify parallelize_llama.py into parallelize_model.py, allowing for quick parallel training of HuggingFace models.

samsja · 2024-11-25T15:50:02Z

samsja
Nov 25, 2024

Would be nice to have a native (and tested) transformer export script as well. There is some trick to do with the complex number rope implementation in the conversion. We have a script to do this that should be compatible with the torchtitan implementation as it was originally copied from it : https://github.com/PrimeIntellect-ai/prime/blob/main/scripts/export_dcp.py