-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
setting of padding_side in Llama tokenizers #34842
Comments
@ScottLiao920 the padding side should be If you have any further questions, please post them on the forum. We try to preserve GH for bugs and feature requests :) |
Thanks! Will note that. |
Hi! Sorry for replying here and not in the forum, @zucchini-nlp , but all the post I've read in the forum or in different github repos specify one or the other configuration for the padding side without further explanation... Your comment is the only information I've found explaining the padding side. Regarding what you said, wouldn't this affect the performance of the model since we are modifying the format of the data? The model will face different data compared to the training. But if I'm wrong about this, do you mean that when we train we configure Just to clarify, I understand why right padding is frequently applied for training (common padding strategy to obtain "square" shape) and why left should be mandatory for evaluation/inference (if we were using right padding, the generated tokens would be added after padding tokens, what is odd and would likely affect strongly and negatively the performance). What I don't totally understand is if this exposure bias due to the padding side in the format of the data will affect the performance. I think it makes sense, in the case of decoder-only models to generate open-ended, to always apply left padding (of course, if we are fine-tuning, we are attached to the padding strategy that was applied if we don't want to strongly hurt the performance or slow down the training). Thank you! Posts I read: https://huggingface.co/docs/transformers/en/llm_tutorial#wrong-padding-side (left) |
@cgr71ii hey! You can also take a look at #26569 (comment). TL;DR; most decoder-only models are pre-trained with packing and thus have no idea about padding in the middle of generation. As you pointed out, the model distribution will be shifted if we padded on the right, and had to continue generating from the pad tokens, because the model never learned that way. Even if we try to mask out padding token my using the For the training, decoder-only models usually are pre-trained without padding but for SFT we can also use padding and feed each data element separately. It is advised to pad on the right side for SFTTrainer because there were some issues with left padding and llama models when training in half-precision (see https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da?permalink_comment_id=4636728#gistcomment-4636728). But apart from that, I don't think there is a strong reason for choosing one side over the other as the loss is calculated only on non-pad tokens anyways |
\Hi here, just curious about the default setting of padding_side. If I understand this correctly, normally for decoder-only LLMs tokenizers should have padding_size='right', meaning the padding tokens appear after the actual input text tokens. However, I get this warning recently:

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
I am running transformers of version 4.46.2, here're a test example using llama-3.1-8B-instruct, seems left is the "right" side to go.
Originally posted by @ScottLiao920 in #25022 (comment)
The text was updated successfully, but these errors were encountered: