Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if i use cache in gpt2 model from transformers , the logits are different if i do a forward pass from scratch #27040

Closed
2 of 4 tasks
juanKersul opened this issue Oct 24, 2023 · 9 comments

Comments

@juanKersul
Copy link

System Info

  • transformers version: 4.33.1
  • Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: YES
  • Using distributed or parallel set-up in script?: NO

Who can help?

@ArthurZucker
@younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

just run the code

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

torch.set_default_device("cuda")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
model.to("cuda")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
seq = torch.tensor([1, 2, 3, 4, 5])
original_out = model(input_ids=seq).logits
seq2 = torch.tensor([1, 2, 3])
key_values = model(input_ids=seq2, use_cache=True).past_key_values
new_seq = torch.tensor([4, 5])
magic = model(input_ids=new_seq, past_key_values=key_values).logits
print(torch.equal(original_out[-1, :], magic[-1, :]))

Expected behavior

i expected return true

@younesbelkada
Copy link
Contributor

Hi, thanks for the issue!
I think that this is related to: #25420 (comment) and is expected - probably @gante can confirm if I am mistaken or no 🙏 Thanks!

@gante
Copy link
Member

gante commented Oct 24, 2023

@younesbelkada yes, it's the same expected behavior!

@juanKersul I recommend reading the comment linked above if you'd like to understand why this difference exists :)

@juanKersul
Copy link
Author

juanKersul commented Oct 24, 2023

@gante @younesbelkada thanks for the answer
do you know how big the error can be in this model?

@ArthurZucker
Copy link
Collaborator

Mmm for gpt2 you need to make sur you pass the position ids otherwise they are not created. See #21080 as well seems more like a duplicate than linked to the KV Cache this time

@gante
Copy link
Member

gante commented Oct 25, 2023

@ArthurZucker I don't think the position IDs are a problem in the specific example above -- for batches with a single row without padding, when position_ids are not passed, they are correctly inferred in this line (which is present in most, if not all models)

@gante
Copy link
Member

gante commented Oct 25, 2023

do you know how big the error can be in this model?

@juanKersul it is model and input-dependent, but as a rule of thumb, it is imperceptible in FP32, and quite small in 16-bit (but big enough to occasionally result in slightly different generated text)

@ArthurZucker
Copy link
Collaborator

Ah right no padding so no problem

@juanKersul
Copy link
Author

@gante If I use multiple rows without using padding , do I have to do anything else with the positions ids?

@gante
Copy link
Member

gante commented Oct 25, 2023

No, multiple rows without padding is also okay :)

With padding, you must explicitly build the position ids (e.g. from the attention mask), otherwise you will get a performance drop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants