-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding padding and batched inference for LLAMA-2 and CodeLLAMA #26072
Comments
Hey! The warning is a general warning. |
Yes it would help. Should I create a PR for this? |
Sure 😉 |
Hey @anmolagarwal999 👋 Out of curiosity, have you passed the attention mask that came out of the tokenizer to |
Hi @rafa852 👋 Have a look at this doc sections about padding sides: https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side As for the padding token, it's common to set |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Is there any solution to close the warning? |
If you read the issue you will see that you can simply do |
is it the same as setting |
No, you cannot set a |
sorry, I mean setting |
It should not really affect inference not, by default it is what is used. Feel free to use the eos token as it is common practice |
@ArthurZucker Hi, just to make sure I understood correctly from this issue, to run batched generation with llama 2 models is this enough?
I can't be 100% sure neither reading this issue or the tip section |
@gpucce left-padding should be used for batched inference (see this comment) |
@gante thank you very much, would this be the case also for T5 models? |
@gpucce nope, encoder-decoder models should use right-padding :) |
@gpucce Decoder-only models continue generating from the input prompt and can't have gaps between the end of the prompt and the start of generation. They were not trained to handle these gaps. Encoder-decoder models convert the input prompt into an encoded vector, which is fed to a decoder. In this case, the decoder starts with an embedded input and |
Hi @gante and @ArthurZucker , your responses above are really helpful! Could you point me to the code how positional embedding deals with the left padding? I am asking because if absolute positional embedding is used, the positional embedding also needs to be left padded, i.e., right shifted, so that the first position can be correctly added to the first input token. For instance, the sinusoid embedding in the vanilla transformer and the rope embedding in llama all need such type of shifting. I also found an earlier discussion here which was quite helpful in illustration. Since I tried both left and right padding in llama2-7b-chat (curious why llama2 also works with right padding, which shouldn't be the case for all decoder only LLM), and found out the output was quite good, I guess this type of absolute positional shifting was implemented somewhere in the codebase, but I cannot find it. Can you point me to where it is in the code? |
Thanks! Oh, I mean when |
@ShengYun-Peng with If you call |
System Info
Platform:
Who can help?
@ArthurZucker @younesbelkada @gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Regarding LLAMA-2 CHAT
I have been using
LLAMA-2 13B chat
for batched inference. I have the followed the steps in the TIPS section here. My question is regarding the padding_side to be chosen. I have tried setting the padding_side to be both left and right and my observations are as follows:A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
What is the
padding_side
to be used ?Regarding CodeLLAMA
No guidelines on how to deal with the absence of a padding token is to be dealt with is present on the model page for CodeLLAMA. It would be good to have some documentation on parameters such as "what padding token is to be set", what is the 'padding_side' to be kept, etc.
Expected behavior
Consistent behaviour ie better results to come during the case when there is no warning.
The text was updated successfully, but these errors were encountered: