if i use cache in gpt2 model from transformers , the logits are different if i do a forward pass from scratch #27040

juanKersul · 2023-10-24T14:04:20Z

System Info

transformers version: 4.33.1
Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: YES
Using distributed or parallel set-up in script?: NO

Who can help?

@ArthurZucker
@younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

just run the code

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

torch.set_default_device("cuda")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
model.to("cuda")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
seq = torch.tensor([1, 2, 3, 4, 5])
original_out = model(input_ids=seq).logits
seq2 = torch.tensor([1, 2, 3])
key_values = model(input_ids=seq2, use_cache=True).past_key_values
new_seq = torch.tensor([4, 5])
magic = model(input_ids=new_seq, past_key_values=key_values).logits
print(torch.equal(original_out[-1, :], magic[-1, :]))

Expected behavior

i expected return true

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-10-24T16:20:27Z

Hi, thanks for the issue!
I think that this is related to: #25420 (comment) and is expected - probably @gante can confirm if I am mistaken or no 🙏 Thanks!

gante · 2023-10-24T17:06:10Z

@younesbelkada yes, it's the same expected behavior!

@juanKersul I recommend reading the comment linked above if you'd like to understand why this difference exists :)

juanKersul · 2023-10-24T17:42:15Z

@gante @younesbelkada thanks for the answer
do you know how big the error can be in this model?

ArthurZucker · 2023-10-25T07:54:50Z

Mmm for gpt2 you need to make sur you pass the position ids otherwise they are not created. See #21080 as well seems more like a duplicate than linked to the KV Cache this time

gante · 2023-10-25T10:15:31Z

@ArthurZucker I don't think the position IDs are a problem in the specific example above -- for batches with a single row without padding, when position_ids are not passed, they are correctly inferred in this line (which is present in most, if not all models)

gante · 2023-10-25T10:19:26Z

do you know how big the error can be in this model?

@juanKersul it is model and input-dependent, but as a rule of thumb, it is imperceptible in FP32, and quite small in 16-bit (but big enough to occasionally result in slightly different generated text)

ArthurZucker · 2023-10-25T10:33:13Z

Ah right no padding so no problem

juanKersul · 2023-10-25T16:23:24Z

@gante If I use multiple rows without using padding , do I have to do anything else with the positions ids?

gante · 2023-10-25T16:29:15Z

No, multiple rows without padding is also okay :)

With padding, you must explicitly build the position ids (e.g. from the attention mask), otherwise you will get a performance drop

juanKersul closed this as completed Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

if i use cache in gpt2 model from transformers , the logits are different if i do a forward pass from scratch #27040

if i use cache in gpt2 model from transformers , the logits are different if i do a forward pass from scratch #27040

juanKersul commented Oct 24, 2023

younesbelkada commented Oct 24, 2023

gante commented Oct 24, 2023

juanKersul commented Oct 24, 2023 •

edited

Loading

ArthurZucker commented Oct 25, 2023

gante commented Oct 25, 2023 •

edited

Loading

gante commented Oct 25, 2023 •

edited

Loading

ArthurZucker commented Oct 25, 2023

juanKersul commented Oct 25, 2023

gante commented Oct 25, 2023

if i use cache in gpt2 model from transformers , the logits are different if i do a forward pass from scratch #27040

if i use cache in gpt2 model from transformers , the logits are different if i do a forward pass from scratch #27040

Comments

juanKersul commented Oct 24, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

younesbelkada commented Oct 24, 2023

gante commented Oct 24, 2023

juanKersul commented Oct 24, 2023 • edited Loading

ArthurZucker commented Oct 25, 2023

gante commented Oct 25, 2023 • edited Loading

gante commented Oct 25, 2023 • edited Loading

ArthurZucker commented Oct 25, 2023

juanKersul commented Oct 25, 2023

gante commented Oct 25, 2023

juanKersul commented Oct 24, 2023 •

edited

Loading

gante commented Oct 25, 2023 •

edited

Loading

gante commented Oct 25, 2023 •

edited

Loading