-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inputs left-padded passed to Instruct-Mistral-7B, with FlashAttention-2, causes garbage outputs for the padded sequences #29075
Comments
Update: so downgrading to This is still a problem though with the |
Having a look right now, but the padding and the attention should not be manually changed, the tokenizer is supposed to take care of that |
Yes I know of course, this is just an example to replicate what's happening and to visualize the bug for you guys, but in my actual code, the tokenizer takes care of the padding and attention. (in my own code, my batch size is > 1, but this is one example that I've scoped down to showcase the issue. Generally, the trend is, in any batched input, the only sample that has coherent output is the one without padding). |
That would be very surprising as we try to make sure padding influences as little as possible. Of course you have this but it should be limited and we test left padded generation |
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer_kwargs = {
"add_bos_token": True,
"add_eos_token": False,
"padding_side": "left"
}
model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, attn_implementation="flash_attention_2",device_map="balanced")
tokenizer = AutoTokenizer.from_pretrained(model_name, **tokenizer_kwargs)
tokenizer.pad_token_id = tokenizer.eos_token_id
inputs = tokenizer(["Hey! How are you doing?", "My favorite condiment is definitely:"], return_tensors="pt", padding = True).to(model.device)
outputs = model.generate(**inputs,num_beams=2,no_repeat_ngram_size=3,max_new_tokens=256,pad_token_id=tokenizer.pad_token_id)
print(tokenizer.batch_decode(outputs))
I used the above reproducible snippet, which generates coherent and good text for padded sequence. |
System Info
transformers version: 4.36.2
Pytorch version: 2.2.0
Platform: Rocky Linux release 8.8 (Green Obsidian), 4.18.0-477.27.1.el8_8.x86_64
Python version: Python 3.9.18
Accelerate version: 0.26.1
FlashAttention-2 version: 2.5.3
Who can help?
@ArthurZucker, @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Inference on Mistral-7B seems to vary wildly with padding when using FlashAttention-2, versus having no padding with FlashAttention-2.
The behavior for inference with FA-2 seems to be dependent on the complexity of the task -- in my case, I'm doing multi-document summarization, and my example is a multi-document example. I didn't try too hard to find a simpler example because a simple input text didn't seem to exhibit the same issues.
In addition for the reproduction, I've included the text I use (the examples below will take in the text for debugging).
text.txt
Example (minimal reproduction):
With FlashAttention-2
The output:
Without FlashAttention-2
The output:
Discovered this issue by debugging and removing the padding from the beginning of the sequence; if the padding is gone from the beginning of the sequence, then the behavior w/ and w/o FA-2 is similar. Other attempts at debugging: upgraded FA-2 version to the latest, and torch version to 2.2.0, but neither solution fixed the problem. Did the Pytorch upgrade because of pytorch/pytorch#112577 but this didn't seem to be the problem. Also upgraded transformers to be 4.37.2 and it was still a problem there.
Expected behavior
The behavior for inference with FA-2 should be similar as inference without FA-2, but it's wildly different.
The text was updated successfully, but these errors were encountered: