Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect padding_side Setting as 'left' in Llama Family Model #25022

Closed
2 of 4 tasks
voidful opened this issue Jul 23, 2023 · 5 comments
Closed
2 of 4 tasks

Incorrect padding_side Setting as 'left' in Llama Family Model #25022

voidful opened this issue Jul 23, 2023 · 5 comments

Comments

@voidful
Copy link
Contributor

voidful commented Jul 23, 2023

System Info

  • transformers version: 4.30.2
  • Platform: Linux-5.15.0-1041-azure-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.16.2
  • Safetensors version: 0.3.1

Who can help?

text models: @ArthurZucker and @younesbelkada generate: @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When utilizing the Llama Family Model for batch generation, an issue arises due to the lack of a padding token. To clarify, the original model uses pad_id = -1, implying the absence of a padding token. This logic is infeasible for our scenario.

Here is our proposed solution:

Firstly, a padding token should be added using the command tokenizer.add_special_tokens({"pad_token":""}), following which the token embedding must be resized accordingly. It's essential to also set model.config.pad_token_id. The embed_tokens layer of the model is initialized with self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx). This ensures that encoding the padding token outputs zeros. Therefore, passing it during initialization is recommended.

Expected behavior

Another important aspect is setting the padding_side to 'right'. This is crucial for correct padding direction.

@ArthurZucker
Copy link
Collaborator

Hey! Indeed, as it was written in the documentation a padding token is required. Seems that by default the padding side is set to left. We cannot update the tokenization file (for backward compatibility reasons) but we can update the tokenizers online to make sure they use padding_side = right by default.

@voidful
Copy link
Contributor Author

voidful commented Jul 24, 2023

Hey! Indeed, as it was written in the documentation a padding token is required. Seems that by default the padding side is set to left. We cannot update the tokenization file (for backward compatibility reasons) but we can update the tokenizers online to make sure they use padding_side = right by default.

Great, I would be nice to update the default padding_side of those model.

@anmolagarwal999
Copy link

There does not seem to be any documentation regarding what the correct padding_side should be for CodeLLAMA family. Is there a way to find this out ? @ArthurZucker I also opened a related issue here.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Sep 21, 2023

CodeLlama is Llama family so same padding side. I answered on your issue 🤗

@ScottLiao920
Copy link

ScottLiao920 commented Nov 21, 2024

Hi there, just curious about the default setting of padding_side. If I understand this correctly, normally for decoder-only LLMs tokenizers should have padding_size='right', meaning the padding tokens appear after the actual input text tokens. However, I get this warning recently:
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
I am running transformers of version 4.46.2, here're a test example using llama-3.1-8B-instruct, seems left is the "right" side to go.
Screenshot 2024-11-21 at 2 30 21 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants