setting of padding_side in Llama tokenizers #34842

ScottLiao920 · 2024-11-21T06:44:58Z

\Hi here, just curious about the default setting of padding_side. If I understand this correctly, normally for decoder-only LLMs tokenizers should have padding_size='right', meaning the padding tokens appear after the actual input text tokens. However, I get this warning recently:
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
I am running transformers of version 4.46.2, here're a test example using llama-3.1-8B-instruct, seems left is the "right" side to go.

Originally posted by @ScottLiao920 in #25022 (comment)

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2024-11-21T12:21:40Z

@ScottLiao920 the padding side should be left when generating and right when training/tuning. See docs for more info

If you have any further questions, please post them on the forum. We try to preserve GH for bugs and feature requests :)

ScottLiao920 · 2024-11-21T12:44:34Z

Thanks! Will note that.

cgr71ii · 2024-12-09T13:17:14Z

@ScottLiao920 the padding side should be left when generating and right when training/tuning. See docs for more info

If you have any further questions, please post them on the forum. We try to preserve GH for bugs and feature requests :)

Hi!

Sorry for replying here and not in the forum, @zucchini-nlp , but all the post I've read in the forum or in different github repos specify one or the other configuration for the padding side without further explanation... Your comment is the only information I've found explaining the padding side.

Regarding what you said, wouldn't this affect the performance of the model since we are modifying the format of the data? The model will face different data compared to the training. But if I'm wrong about this, do you mean that when we train we configure padding_side = "right" and in evaluation time we temporally change padding_side = "left", finish evaluation and then again padding_side = "right" to resume training?

Just to clarify, I understand why right padding is frequently applied for training (common padding strategy to obtain "square" shape) and why left should be mandatory for evaluation/inference (if we were using right padding, the generated tokens would be added after padding tokens, what is odd and would likely affect strongly and negatively the performance). What I don't totally understand is if this exposure bias due to the padding side in the format of the data will affect the performance. I think it makes sense, in the case of decoder-only models to generate open-ended, to always apply left padding (of course, if we are fine-tuning, we are attached to the padding strategy that was applied if we don't want to strongly hurt the performance or slow down the training).

Thank you!

Posts I read:

https://huggingface.co/docs/transformers/en/llm_tutorial#wrong-padding-side (left)
https://discuss.huggingface.co/t/llama2-pad-token-for-batched-inference/48020 (left)
https://discuss.huggingface.co/t/fine-tuning-for-llama2-based-model-with-loftq-quantization/66737 (right)
https://discuss.huggingface.co/t/padding-side-in-instruction-fine-tuning-using-sftt/113549 (similar doubts to the ones I presented here)
#25022 (comment) (right)
#26072 (comment) (right for llama, often left for other models)
#26072 (comment) (left for inference! here it is specified)

zucchini-nlp · 2024-12-09T16:21:04Z

@cgr71ii hey!

You can also take a look at #26569 (comment).

TL;DR; most decoder-only models are pre-trained with packing and thus have no idea about padding in the middle of generation. As you pointed out, the model distribution will be shifted if we padded on the right, and had to continue generating from the pad tokens, because the model never learned that way. Even if we try to mask out padding token my using the attention_mask, that might hurt the generation quality is some models

For the training, decoder-only models usually are pre-trained without padding but for SFT we can also use padding and feed each data element separately. It is advised to pad on the right side for SFTTrainer because there were some issues with left padding and llama models when training in half-precision (see https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da?permalink_comment_id=4636728#gistcomment-4636728). But apart from that, I don't think there is a strong reason for choosing one side over the other as the loss is calculated only on non-pad tokens anyways

ScottLiao920 closed this as completed Nov 21, 2024

This was referenced Dec 13, 2024

不太理解为什么要padding on left LianjiaTech/BELLE#240

Closed

padding_side的设置 hiyouga/LLaMA-Factory#6316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setting of padding_side in Llama tokenizers #34842

setting of padding_side in Llama tokenizers #34842

ScottLiao920 commented Nov 21, 2024 •

edited

Loading

zucchini-nlp commented Nov 21, 2024

ScottLiao920 commented Nov 21, 2024

cgr71ii commented Dec 9, 2024 •

edited

Loading

zucchini-nlp commented Dec 9, 2024

setting of padding_side in Llama tokenizers #34842

setting of padding_side in Llama tokenizers #34842

Comments

ScottLiao920 commented Nov 21, 2024 • edited Loading

zucchini-nlp commented Nov 21, 2024

ScottLiao920 commented Nov 21, 2024

cgr71ii commented Dec 9, 2024 • edited Loading

zucchini-nlp commented Dec 9, 2024

ScottLiao920 commented Nov 21, 2024 •

edited

Loading

cgr71ii commented Dec 9, 2024 •

edited

Loading