-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixture of experts pretraining benchmark #780
base: master
Are you sure you want to change the base?
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
dcd532b
to
1cc20e7
Compare
1cc20e7
to
6f0f836
Compare
* fix(moe): Added weight decay parameter * fix(moe): Added proper handling of device count per node * refactor(moe): Data preprocessing cleanup * fix(moe): This container has more stable convergence * fix(gpu): data preprocessing * build(gpu): Fix container image to specific version of NeMo and Megatron
Thank you @ShriyaPalsamudram for the reviewing. Could you help me merge the PR when you think it is in good shape since I don't have authorization. Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:
|
For sure, I will give the current GPU guide a try over the weekend!
|
* docs(moe): GPU running and slurm docs * docs: Fixed markup
--output_dir <path to save checkpoint> --hf_token <your token to HF repository> | ||
``` | ||
|
||
This script will download specified checkpoint from huggingface repository, preprocess it and save |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a step to verify checksums of the converted checkpoint to ensure correctness?
Is this converted checkpoint available for download directly from mlcommons drive? If yes, can those instructions be shared here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no checkpoint in any valid form. The one inside the mlcommon is neither in raw HF (maybe thats a good point, we should mirror HF checkpoint inside S3 bucket?) nor Nemo compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have both the raw HF checkpoint as well as the NeMo compatible version in the S3 bucket so it lives on until we need it to and to make it easier for submitters to access all artifacts in the same place.
This script will download specified checkpoint from huggingface repository, preprocess it and save | ||
into specified directory | ||
|
||
To preprocess dataset, use dataset_preprocessing.py script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preprocessed dataset can be downloaded from mlcommons bucket. I added annotation while keeping manual preprocessing step in the documentation
The
|
Description
Add MoE benchmark to mlcommons repo.
todo list
TPU
GPU
General
cc @suexu1025 @ShriyaPalsamudram