Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixture of experts pretraining benchmark #780

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

ZhiyuLi-goog
Copy link

@ZhiyuLi-goog ZhiyuLi-goog commented Jan 6, 2025

Description

Add MoE benchmark to mlcommons repo.

todo list

TPU

  • docker image verification
  • run workload in small scale
  • run workload in large scale

GPU

General

cc @suexu1025 @ShriyaPalsamudram

@ZhiyuLi-goog ZhiyuLi-goog requested a review from a team as a code owner January 6, 2025 10:57
Copy link

github-actions bot commented Jan 6, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ZhiyuLi-goog and others added 7 commits January 10, 2025 03:27
* fix(moe): Added weight decay parameter

* fix(moe): Added proper handling of device count per node

* refactor(moe): Data preprocessing cleanup

* fix(moe): This container has more stable convergence

* fix(gpu): data preprocessing

* build(gpu): Fix container image to specific version of NeMo and Megatron
@ZhiyuLi-goog
Copy link
Author

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

  • update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

@ZhiyuLi-goog ZhiyuLi-goog changed the title [Draft] MoE Benchmark MoE Benchmark Jan 10, 2025
@ZhiyuLi-goog ZhiyuLi-goog changed the title MoE Benchmark mixture_of_experts_pretraining Jan 10, 2025
@ZhiyuLi-goog ZhiyuLi-goog changed the title mixture_of_experts_pretraining Mixture of experts pretraining benchmark Jan 10, 2025
@JustinPan-goog
Copy link

For sure, I will give the current GPU guide a try over the weekend!

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

  • update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

* docs(moe): GPU running and slurm docs

* docs: Fixed markup
--output_dir <path to save checkpoint> --hf_token <your token to HF repository>
```

This script will download specified checkpoint from huggingface repository, preprocess it and save
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a step to verify checksums of the converted checkpoint to ensure correctness?

Is this converted checkpoint available for download directly from mlcommons drive? If yes, can those instructions be shared here as well?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no checkpoint in any valid form. The one inside the mlcommon is neither in raw HF (maybe thats a good point, we should mirror HF checkpoint inside S3 bucket?) nor Nemo compatible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have both the raw HF checkpoint as well as the NeMo compatible version in the S3 bucket so it lives on until we need it to and to make it easier for submitters to access all artifacts in the same place.

This script will download specified checkpoint from huggingface repository, preprocess it and save
into specified directory

To preprocess dataset, use dataset_preprocessing.py script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as checkpoint

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preprocessed dataset can be downloaded from mlcommons bucket. I added annotation while keeping manual preprocessing step in the documentation

@JustinPan-goog
Copy link

The checkpoint_download.py script when using HFMixtralImporter, seems to be experiencing compatibility issues with certain NeMo/Megatron versions

  1. The script initially failed with an ImportError: cannot import name '__version__' from 'nemo'. A PR was merged yesterday to add a try-except block: NVIDIA/NeMo@7d74e71#diff-e5559e6e42d963c2b10dcbb8c739bd185285a21f8c8f1c038f64529b9cf8aff0.

  2. After checking out the commit 7d74e71, a new error emerged: ImportError: cannot import name 'AttnBackend' from 'megatron.core.transformer.enums'. I presume I should update Megatron version as well, but would like to double check this update with @hXl3s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants