Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaMA3.1 SFT 时出现 NaN #437

Closed
duchengyao opened this issue Jan 17, 2025 · 2 comments · Fixed by #439
Closed

LLaMA3.1 SFT 时出现 NaN #437

duchengyao opened this issue Jan 17, 2025 · 2 comments · Fixed by #439

Comments

@duchengyao
Copy link

参考指令微调示例执行SFT时出现NaN。其中,数据集来自小规模预处理数据下载试用。脚本为:

UB_SKIPMC=1 sh run_mcore_llama3_1.sh  \
dsw  \
8B   \
1    \
4 \
1e-5   \
1e-6   \
128  \
128  \
bf16  \
4   \
1  \
1 \
true \
true   \
true \
true \
false   \
false \
100000  \
/DATA/DATASETS/Pai/qwen-datasets/mmap_qwen2_sft_datasets_text_document   \
/DATA/DATASETS/Pai/qwen-datasets/mmap_qwen2_sft_datasets_text_document   \
/DATA/DATASETS/Pai/llama3-ckpts/Meta-Llama-3.1-8B/mcore-tp4-pp1  \
10000  \
100   \
/DATA/DATASETS/Pai/output_mcore_llama3_1-finetune

[rank3]: AssertionError: Rank 3: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3, node: Du

@lostkevin
Copy link
Contributor

对于mmap数据,由于经过分词,不能进行混用。在README中,我们提供的是qwen2模型非sequence packing下的示例数据,使用其他模型训练时可能造成NaN

@duchengyao
Copy link
Author

谢谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants