Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent generation #35276

Closed
2 of 4 tasks
slatter666 opened this issue Dec 14, 2024 · 3 comments
Closed
2 of 4 tasks

inconsistent generation #35276

slatter666 opened this issue Dec 14, 2024 · 3 comments
Labels

Comments

@slatter666
Copy link

slatter666 commented Dec 14, 2024

System Info

  • transformers version: 4.45.2
  • Python version: 3.8.18
  • Huggingface_hub version: 0.26.3
  • Safetensors version: 0.4.1
  • Accelerate version: 0.32.1
  • PyTorch version (GPU?): 2.1.0+cu121 (True)
  • GPU type: NVIDIA A10

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I used the same input, but changed the code logic slightly, resulting in different results

here is the context of code(mainly load model)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, DynamicCache

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_path, attn_implementation="flash_attention_2", device_map=device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)

encoded_input = tokenizer("what is your name", return_tensors='pt').to(device)
window_size = 1
front_input = {key: value[:, :-window_size] for key, value in encoded_input.items()}
rear_input = {key: value[:, -window_size:] for key, value in encoded_input.items()}

and here is the first generation code

past_key_values = DynamicCache()
generation = model.generate(**encoded_input, past_key_values=past_key_values, max_new_tokens=32, do_sample=False)
generation = tokenizer.batch_decode(generation)[0]
print(generation)

the generation is as below:

what is your name?" and "what is your occupation?" are not necessary. The form is designed to be as simple and easy to fill out as possible, while still gathering the

and the seconde generation code is:

past_key_values = DynamicCache()
with torch.no_grad():
  _ = model(**front_input, past_key_values=past_key_values, use_cache=True)
generation = model.generate(**encoded_input, past_key_values=past_key_values, max_new_tokens=32, do_sample=False)
generation = tokenizer.batch_decode(generation)[0]

the generation is as below:

what is your name?" and "what is your occupation?" are not necessary. The form is designed to be as simple and easy to fill out as possible, so that you can

Expected behavior

well, it's weird, I think these two generation process is the same since I do not use sampling, but why the results are different. Is there anything wrong with my operation?

@slatter666 slatter666 added the bug label Dec 14, 2024
@slatter666
Copy link
Author

but when I change to use A100, the result is the same, OMG why is that

@zucchini-nlp
Copy link
Member

Hey @slatter666 ,

Since you are generating with precomputed cache of size window length in one of the examples, while in another you generate from the whole input, it might lead to tiny numerical precision errors.

See #25420 (comment) for more on why caching can accumulate numerical precision errors

@slatter666
Copy link
Author

Thank you so much, that solves my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants