Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setting of padding_side in Llama tokenizers #34842

Closed
ScottLiao920 opened this issue Nov 21, 2024 · 4 comments
Closed

setting of padding_side in Llama tokenizers #34842

ScottLiao920 opened this issue Nov 21, 2024 · 4 comments

Comments

@ScottLiao920
Copy link

ScottLiao920 commented Nov 21, 2024

\Hi here, just curious about the default setting of padding_side. If I understand this correctly, normally for decoder-only LLMs tokenizers should have padding_size='right', meaning the padding tokens appear after the actual input text tokens. However, I get this warning recently:
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
I am running transformers of version 4.46.2, here're a test example using llama-3.1-8B-instruct, seems left is the "right" side to go.
Screenshot 2024-11-21 at 2 30 21 PM

Originally posted by @ScottLiao920 in #25022 (comment)

@zucchini-nlp
Copy link
Member

@ScottLiao920 the padding side should be left when generating and right when training/tuning. See docs for more info

If you have any further questions, please post them on the forum. We try to preserve GH for bugs and feature requests :)

@ScottLiao920
Copy link
Author

Thanks! Will note that.

@cgr71ii
Copy link

cgr71ii commented Dec 9, 2024

@ScottLiao920 the padding side should be left when generating and right when training/tuning. See docs for more info

If you have any further questions, please post them on the forum. We try to preserve GH for bugs and feature requests :)

Hi!

Sorry for replying here and not in the forum, @zucchini-nlp , but all the post I've read in the forum or in different github repos specify one or the other configuration for the padding side without further explanation... Your comment is the only information I've found explaining the padding side.

Regarding what you said, wouldn't this affect the performance of the model since we are modifying the format of the data? The model will face different data compared to the training. But if I'm wrong about this, do you mean that when we train we configure padding_side = "right" and in evaluation time we temporally change padding_side = "left", finish evaluation and then again padding_side = "right" to resume training?

Just to clarify, I understand why right padding is frequently applied for training (common padding strategy to obtain "square" shape) and why left should be mandatory for evaluation/inference (if we were using right padding, the generated tokens would be added after padding tokens, what is odd and would likely affect strongly and negatively the performance). What I don't totally understand is if this exposure bias due to the padding side in the format of the data will affect the performance. I think it makes sense, in the case of decoder-only models to generate open-ended, to always apply left padding (of course, if we are fine-tuning, we are attached to the padding strategy that was applied if we don't want to strongly hurt the performance or slow down the training).

Thank you!

Posts I read:

https://huggingface.co/docs/transformers/en/llm_tutorial#wrong-padding-side (left)
https://discuss.huggingface.co/t/llama2-pad-token-for-batched-inference/48020 (left)
https://discuss.huggingface.co/t/fine-tuning-for-llama2-based-model-with-loftq-quantization/66737 (right)
https://discuss.huggingface.co/t/padding-side-in-instruction-fine-tuning-using-sftt/113549 (similar doubts to the ones I presented here)
#25022 (comment) (right)
#26072 (comment) (right for llama, often left for other models)
#26072 (comment) (left for inference! here it is specified)

@zucchini-nlp
Copy link
Member

@cgr71ii hey!

You can also take a look at #26569 (comment).

TL;DR; most decoder-only models are pre-trained with packing and thus have no idea about padding in the middle of generation. As you pointed out, the model distribution will be shifted if we padded on the right, and had to continue generating from the pad tokens, because the model never learned that way. Even if we try to mask out padding token my using the attention_mask, that might hurt the generation quality is some models

For the training, decoder-only models usually are pre-trained without padding but for SFT we can also use padding and feed each data element separately. It is advised to pad on the right side for SFTTrainer because there were some issues with left padding and llama models when training in half-precision (see https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da?permalink_comment_id=4636728#gistcomment-4636728). But apart from that, I don't think there is a strong reason for choosing one side over the other as the loss is calculated only on non-pad tokens anyways

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants