Mixture of experts pretraining benchmark #780

ZhiyuLi-goog · 2025-01-06T10:57:55Z

Description

Add MoE benchmark to mlcommons repo.

todo list

TPU

docker image verification
run workload in small scale
run workload in large scale

GPU

update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!
pending issues in original repo: https://github.com/ZhiyuLi-goog/MoE_study/issues @hXl3s
freeze gpu/cuda docker image @hXl3s

General

Training v5.0 artifacts uploading: dataset, converted ckpt, docker images etc @ZhiyuLi-goog
Update guides with the above shared link @ZhiyuLi-goog
double check license @ZhiyuLi-goog

cc @suexu1025 @ShriyaPalsamudram

github-actions · 2025-01-06T10:58:10Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

MoE_language_model/README.md

MoE_language_model/docker/gpu/Dockerfile

MoE_language_model/docker/tpu/Dockerfile

MoE_language_model/mlperf_logging_utils.py

…pretraining

* fix(moe): Added weight decay parameter * fix(moe): Added proper handling of device count per node * refactor(moe): Data preprocessing cleanup * fix(moe): This container has more stable convergence * fix(gpu): data preprocessing * build(gpu): Fix container image to specific version of NeMo and Megatron

mixture_of_experts_pretraining/README.md

ZhiyuLi-goog · 2025-01-10T23:07:11Z

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

JustinPan-goog · 2025-01-10T23:56:03Z

For sure, I will give the current GPU guide a try over the weekend!

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

* docs(moe): GPU running and slurm docs * docs: Fixed markup

ShriyaPalsamudram · 2025-01-14T19:39:46Z

mixture_of_experts_pretraining/README.md

+    --output_dir <path to save checkpoint> --hf_token <your token to HF repository>
+```
+
+This script will download specified checkpoint from huggingface repository, preprocess it and save


Is there a step to verify checksums of the converted checkpoint to ensure correctness?

Is this converted checkpoint available for download directly from mlcommons drive? If yes, can those instructions be shared here as well?

There is no checkpoint in any valid form. The one inside the mlcommon is neither in raw HF (maybe thats a good point, we should mirror HF checkpoint inside S3 bucket?) nor Nemo compatible

We should have both the raw HF checkpoint as well as the NeMo compatible version in the S3 bucket so it lives on until we need it to and to make it easier for submitters to access all artifacts in the same place.

ShriyaPalsamudram · 2025-01-14T19:39:56Z

mixture_of_experts_pretraining/README.md

+This script will download specified checkpoint from huggingface repository, preprocess it and save
+into specified directory
+
+To preprocess dataset, use dataset_preprocessing.py script


Same as checkpoint

Preprocessed dataset can be downloaded from mlcommons bucket. I added annotation while keeping manual preprocessing step in the documentation

JustinPan-goog · 2025-01-23T01:08:32Z

The checkpoint_download.py script when using HFMixtralImporter, seems to be experiencing compatibility issues with certain NeMo/Megatron versions

The script initially failed with an ImportError: cannot import name '__version__' from 'nemo'. A PR was merged yesterday to add a try-except block: NVIDIA/NeMo@7d74e71#diff-e5559e6e42d963c2b10dcbb8c739bd185285a21f8c8f1c038f64529b9cf8aff0.
After checking out the commit 7d74e71, a new error emerged: ImportError: cannot import name 'AttnBackend' from 'megatron.core.transformer.enums'. I presume I should update Megatron version as well, but would like to double check this update with @hXl3s

ZhiyuLi-goog requested a review from a team as a code owner January 6, 2025 10:57

ZhiyuLi-goog force-pushed the lizhiyu/moe branch from dcd532b to 1cc20e7 Compare January 7, 2025 00:13

ZhiyuLi-goog added 2 commits January 9, 2025 06:06

MoE Reference Implementation

97b8b7e

update relative path in docker file for repo migration

6f0f836

ZhiyuLi-goog force-pushed the lizhiyu/moe branch from 1cc20e7 to 6f0f836 Compare January 9, 2025 14:09