Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding padding and batched inference for LLAMA-2 and CodeLLAMA #26072

Closed
1 of 4 tasks
anmolagarwal999 opened this issue Sep 10, 2023 · 23 comments
Closed
1 of 4 tasks

Comments

@anmolagarwal999
Copy link

anmolagarwal999 commented Sep 10, 2023

System Info

Platform:

  • transformers_version: "4.33.0.dev0"
  • Python: 3.8

Who can help?

@ArthurZucker @younesbelkada @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Regarding LLAMA-2 CHAT

I have been using LLAMA-2 13B chat for batched inference. I have the followed the steps in the TIPS section here. My question is regarding the padding_side to be chosen. I have tried setting the padding_side to be both left and right and my observations are as follows:

  • The results with padding_side = left and really very bad. The results with padding_side = right seem to be coherent and very good. This also seems to be backed up with here.
  • However, on using the model with padding_side = right, I get the warning: A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

What is the padding_side to be used ?

Regarding CodeLLAMA

No guidelines on how to deal with the absence of a padding token is to be dealt with is present on the model page for CodeLLAMA. It would be good to have some documentation on parameters such as "what padding token is to be set", what is the 'padding_side' to be kept, etc.

Expected behavior

Consistent behaviour ie better results to come during the case when there is no warning.

@ArthurZucker
Copy link
Collaborator

Hey! The warning is a general warning. Left padding is the usual recommendation, but the original Llama codebase (and Code-Llama is part of the Llama codebase) use a default right padding. Our goal is to have similar results out of the box (so right padding) but still allow users to have the best results and we thus give recommendation on the padding side.
There is a guideline: CodeLlama is the same as Llama. Would it be clearer if the tip section is copied over to CodeLlama?

@anmolagarwal999
Copy link
Author

Would it be clearer if the tip section is copied over to CodeLlama?

Yes it would help. Should I create a PR for this?

@ArthurZucker
Copy link
Collaborator

Sure 😉

@huggingface huggingface deleted a comment from github-actions bot Oct 16, 2023
@gante
Copy link
Member

gante commented Oct 16, 2023

Hey @anmolagarwal999 👋

Out of curiosity, have you passed the attention mask that came out of the tokenizer to model.generate? That is a common cause of performance degradation that would explain what you're seeing :)

@rafa852
Copy link

rafa852 commented Oct 16, 2023

#26072 (comment)

@gante
Copy link
Member

gante commented Oct 24, 2023

Hi @rafa852 👋 Have a look at this doc sections about padding sides: https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side

As for the padding token, it's common to set tokenizer.pad_token = tokenizer.eos_token :)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@HackGiter
Copy link

Is there any solution to close the warning?

@ArthurZucker
Copy link
Collaborator

If you read the issue you will see that you can simply do tokenizer.pad_token = tokenizer.eos_token. Could you read the issue?

@jianglong-he-Infrrd
Copy link

is it the same as setting tokenizer.pad_token = 0 ?

@ArthurZucker
Copy link
Collaborator

No, you cannot set a token to an id. It is the same as tokenzier.pad_token_id = 0 if tokenizer.eos_token_id is 0

@jianglong-he-Infrrd
Copy link

jianglong-he-Infrrd commented Dec 18, 2023

No, you cannot set a token to an id. It is the same as tokenzier.pad_token_id = 0 if tokenizer.eos_token_id is 0

sorry, I mean setting tokenizer.pad_token_id = 0 and tokenizer.eos_token_id != 0 0 is actually the id for <unk> token in llama2 config. Would this affect the inference time result?

@ArthurZucker
Copy link
Collaborator

It should not really affect inference not, by default it is what is used. Feel free to use the eos token as it is common practice

@gpucce
Copy link

gpucce commented Jan 16, 2024

@ArthurZucker Hi, just to make sure I understood correctly from this issue, to run batched generation with llama 2 models is this enough?

tokenizer = AutoTokenizer.from_pretrained("llama-2-7b-hf") # padding_side will default to "right"
tokenizer.pad_token_id = tokenized.eos_token_id

I can't be 100% sure neither reading this issue or the tip section

@gante
Copy link
Member

gante commented Jan 16, 2024

@gpucce left-padding should be used for batched inference (see this comment)

@gpucce
Copy link

gpucce commented Jan 17, 2024

@gante thank you very much, would this be the case also for T5 models?

@gante
Copy link
Member

gante commented Jan 18, 2024

@gpucce nope, encoder-decoder models should use right-padding :)

@gpucce
Copy link

gpucce commented Jan 18, 2024

@gpucce nope, encoder-decoder models should use right-padding :)

@gante can you explain a bit why this is the case?

@gante
Copy link
Member

gante commented Jan 19, 2024

@gpucce Decoder-only models continue generating from the input prompt and can't have gaps between the end of the prompt and the start of generation. They were not trained to handle these gaps.

Encoder-decoder models convert the input prompt into an encoded vector, which is fed to a decoder. In this case, the decoder starts with an embedded input and input_ids = [bos_token_id]. The encoder was trained to handle padding on the right, but not padding on the left.

@ShengYun-Peng
Copy link

ShengYun-Peng commented Apr 16, 2024

Hi @gante and @ArthurZucker , your responses above are really helpful! Could you point me to the code how positional embedding deals with the left padding?

I am asking because if absolute positional embedding is used, the positional embedding also needs to be left padded, i.e., right shifted, so that the first position can be correctly added to the first input token. For instance, the sinusoid embedding in the vanilla transformer and the rope embedding in llama all need such type of shifting. I also found an earlier discussion here which was quite helpful in illustration. Since I tried both left and right padding in llama2-7b-chat (curious why llama2 also works with right padding, which shouldn't be the case for all decoder only LLM), and found out the output was quite good, I guess this type of absolute positional shifting was implemented somewhere in the codebase, but I cannot find it. Can you point me to where it is in the code?

@ArthurZucker
Copy link
Collaborator

@ShengYun-Peng
Copy link

ShengYun-Peng commented Apr 21, 2024

Do you mean this: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L132-L145 ?

Thanks! Oh, I mean when attention_mask generated by the tokenizer is provided as shown here, how does the rope positional embedding deal with the padding in the attention_maske.g., right shift the embedding?

@gante
Copy link
Member

gante commented May 8, 2024

@ShengYun-Peng with generate, we derive the correct position indexes from the attention_mask, regardless of padding (here). The position ids are what's actually used to get the position embeddings.

If you call forward (e.g. if you're building your own custom generation loop), you need to provide position_ids if there is left-padding. If there is no padding or right-padding, we default to the correct position_ids (here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants