FIX: Add safe guards for static cache + llama on transformers latest #401

younesbelkada · 2024-03-18T13:52:11Z

This PR makes autoawq + transformers compatible with recent changes on Llama architecture. In transformers llama the causal mask is pre-allocated with the size bsz, 1, max_seq_len, max_seq_len, thus needing to slice the attention mask here in the fused attention module.

Also fixes an issue where running this script fails:

Script to repro:

from transformers import AutoModelForCausalLM, AwqConfig, AutoTokenizer

awq_config = AwqConfig(do_fuse=True, fuse_max_seq_len=512)
model = AutoModelForCausalLM.from_pretrained(
    "casperhansen/tinyllama-1b-awq",
    quantization_config=awq_config,
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained("casperhansen/tinyllama-1b-awq")
input_ids = tokenizer("Hello, my dog is cute", return_tensors="pt").input_ids.to("cuda")

model.forward(input_ids, use_cache=False)
model.generate(input_ids, max_new_tokens=100)

We defer to not using the caching logic if use_cache=False

cc @casper-hansen

younesbelkada added 2 commits March 18, 2024 14:46

Add safe guards for static cache + llama

78bbb54

Update attn.py

55811ee

younesbelkada requested a review from casper-hansen March 18, 2024 13:52

younesbelkada mentioned this pull request Mar 18, 2024

Running a forward pass before generate with AWQ fused modules breaks it huggingface/transformers#28470

Closed

4 tasks

casper-hansen mentioned this pull request Apr 6, 2024

v0.2.5 issue tracker #425

Closed

13 tasks

casper-hansen merged commit 1f07200 into main Apr 6, 2024

casper-hansen deleted the younesbelkada-patch-1 branch December 30, 2024 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Add safe guards for static cache + llama on transformers latest #401

FIX: Add safe guards for static cache + llama on transformers latest #401

younesbelkada commented Mar 18, 2024

FIX: Add safe guards for static cache + llama on transformers latest #401

FIX: Add safe guards for static cache + llama on transformers latest #401

Conversation

younesbelkada commented Mar 18, 2024