We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
参考指令微调示例执行SFT时出现NaN。其中,数据集来自小规模预处理数据下载试用。脚本为:
UB_SKIPMC=1 sh run_mcore_llama3_1.sh \ dsw \ 8B \ 1 \ 4 \ 1e-5 \ 1e-6 \ 128 \ 128 \ bf16 \ 4 \ 1 \ 1 \ true \ true \ true \ true \ false \ false \ 100000 \ /DATA/DATASETS/Pai/qwen-datasets/mmap_qwen2_sft_datasets_text_document \ /DATA/DATASETS/Pai/qwen-datasets/mmap_qwen2_sft_datasets_text_document \ /DATA/DATASETS/Pai/llama3-ckpts/Meta-Llama-3.1-8B/mcore-tp4-pp1 \ 10000 \ 100 \ /DATA/DATASETS/Pai/output_mcore_llama3_1-finetune
[rank3]: AssertionError: Rank 3: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3, node: Du
The text was updated successfully, but these errors were encountered:
对于mmap数据,由于经过分词,不能进行混用。在README中,我们提供的是qwen2模型非sequence packing下的示例数据,使用其他模型训练时可能造成NaN
Sorry, something went wrong.
谢谢
Successfully merging a pull request may close this issue.
参考指令微调示例执行SFT时出现NaN。其中,数据集来自小规模预处理数据下载试用。脚本为:
[rank3]: AssertionError: Rank 3: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3, node: Du
The text was updated successfully, but these errors were encountered: