inconsistent generation #35276

slatter666 · 2024-12-14T15:50:00Z

System Info

transformers version: 4.45.2
Python version: 3.8.18
Huggingface_hub version: 0.26.3
Safetensors version: 0.4.1
Accelerate version: 0.32.1
PyTorch version (GPU?): 2.1.0+cu121 (True)
GPU type: NVIDIA A10

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I used the same input, but changed the code logic slightly, resulting in different results

here is the context of code(mainly load model)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, DynamicCache

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_path, attn_implementation="flash_attention_2", device_map=device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)

encoded_input = tokenizer("what is your name", return_tensors='pt').to(device)
window_size = 1
front_input = {key: value[:, :-window_size] for key, value in encoded_input.items()}
rear_input = {key: value[:, -window_size:] for key, value in encoded_input.items()}

and here is the first generation code

past_key_values = DynamicCache()
generation = model.generate(**encoded_input, past_key_values=past_key_values, max_new_tokens=32, do_sample=False)
generation = tokenizer.batch_decode(generation)[0]
print(generation)

the generation is as below:

what is your name?" and "what is your occupation?" are not necessary. The form is designed to be as simple and easy to fill out as possible, while still gathering the

and the seconde generation code is:

past_key_values = DynamicCache()
with torch.no_grad():
  _ = model(**front_input, past_key_values=past_key_values, use_cache=True)
generation = model.generate(**encoded_input, past_key_values=past_key_values, max_new_tokens=32, do_sample=False)
generation = tokenizer.batch_decode(generation)[0]

the generation is as below:

what is your name?" and "what is your occupation?" are not necessary. The form is designed to be as simple and easy to fill out as possible, so that you can

Expected behavior

well, it's weird, I think these two generation process is the same since I do not use sampling, but why the results are different. Is there anything wrong with my operation?

The text was updated successfully, but these errors were encountered:

slatter666 · 2024-12-14T15:53:01Z

but when I change to use A100, the result is the same, OMG why is that

zucchini-nlp · 2024-12-14T18:16:42Z

Hey @slatter666 ,

Since you are generating with precomputed cache of size window length in one of the examples, while in another you generate from the whole input, it might lead to tiny numerical precision errors.

See #25420 (comment) for more on why caching can accumulate numerical precision errors

slatter666 · 2024-12-15T05:34:54Z

Thank you so much, that solves my issue.

slatter666 added the bug label Dec 14, 2024

slatter666 closed this as completed Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent generation #35276

inconsistent generation #35276

slatter666 commented Dec 14, 2024 •

edited

Loading

slatter666 commented Dec 14, 2024

zucchini-nlp commented Dec 14, 2024

slatter666 commented Dec 15, 2024

inconsistent generation #35276

inconsistent generation #35276

Comments

slatter666 commented Dec 14, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

slatter666 commented Dec 14, 2024

zucchini-nlp commented Dec 14, 2024

slatter666 commented Dec 15, 2024

slatter666 commented Dec 14, 2024 •

edited

Loading