Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross Contamination in SFT Trainer #204

Open
elichen3051 opened this issue Nov 18, 2024 · 1 comment
Open

Cross Contamination in SFT Trainer #204

elichen3051 opened this issue Nov 18, 2024 · 1 comment

Comments

@elichen3051
Copy link

Dear HuggingFace

I've noted that in run_cpt.py and run_sft.py, we introduce packing=True. However, we didn't provide DataCollatorForCompletionOnlyLM into SFTtrainer; would it introduce cross contamination in training?

referenece article: Improving Hugging Face Training Efficiency Through Packing with Flash Attention
trl issue on github: huggingface/trl#805

@lewtun
Copy link
Member

lewtun commented Nov 19, 2024

Hello @elichen3051 the task is the same whether one uses packing or not (i.e. next token prediction). The DataCollatorForCompletionOnlyLM is for the special case where you want to mask the inputs / prompts and in some cases gives a small performance boost

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants