Inconsistent Output with and without Prompt Caching in Llama-3.1-8B-Instruct. #34164

giulio98 · 2024-10-14T18:08:22Z

System Info

transformers version: 4.45.1
Platform: Linux-6.6.35-amd64-x86_64-with-glibc2.35
Python version: 3.11.6
Huggingface_hub version: 0.25.2
Safetensors version: 0.4.3
Accelerate version: 1.0.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: NO
Using GPU in script?: YES
GPU type: NVIDIA H100 80GB HBM3

Who can help?

@gante @ArthurZucker @itaza

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Generate Responses with Cache, following Re-use Cache to continue generation

import copy
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache

model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Init StaticCache with big enough max-length
prompt_cache = StaticCache(config=model.config, max_batch_size=1, max_cache_len=1024, device="cuda", dtype=torch.bfloat16)

INITIAL_PROMPT = "You are a helpful assistant. "
inputs_initial_prompt = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")

with torch.no_grad():
    prompt_cache = model(**inputs_initial_prompt, past_key_values=prompt_cache).past_key_values
prompts = ["Help me to write a blogpost about travelling.", "What is the capital of France?"]
responses = []
for prompt in prompts:
    new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
    past_key_values = copy.deepcopy(prompt_cache)
    outputs = model.generate(**new_inputs, past_key_values=past_key_values, max_new_tokens=20)
    response = tokenizer.batch_decode(outputs)[0]
    responses.append(response)

print(responses)

Observed Output

['<|begin_of_text|>You are a helpful assistant. Help me to write a blogpost about travelling.  I have some ideas, but I’ts not clear how to structure the post.  I',
 '<|begin_of_text|>You are a helpful assistant. What is the capital of France? Paris.  is the capital of the United States? Washington D.C.  is the capital of']

Generate response without cache

responses = []
for prompt in prompts:
    new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**new_inputs, max_new_tokens=20, use_cache=False)
    response = tokenizer.batch_decode(outputs)[0]
    responses.append(response)

print(responses)

Observed Output

['<|begin_of_text|>You are a helpful assistant. Help me to write a blogpost about travelling. Here’s what I need to write about:\nTitle: “The Magic of Exploring New Places:',
 '<|begin_of_text|>You are a helpful assistant. What is the capital of France? Paris.\nWhat is the capital of Australia? Canberra.\nWhat is the capital of Brazil? Brasília']

Expected behavior

The output without cache should be exactly the same as the one that uses the cache.

The text was updated successfully, but these errors were encountered:

gante · 2024-10-17T15:14:15Z

Hi @giulio98 👋

The output without cache should be exactly the same as the one that uses the cache.

This statement is not true, see this comment :) However, the results should be similar, which doesn't seem to be the case -- I'm going to have a look.

github-actions · 2024-11-14T08:03:29Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

giulio98 added the bug label Oct 14, 2024

LysandreJik added Generation Cache labels Oct 15, 2024

github-actions bot closed this as completed Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Output with and without Prompt Caching in Llama-3.1-8B-Instruct. #34164

Inconsistent Output with and without Prompt Caching in Llama-3.1-8B-Instruct. #34164

giulio98 commented Oct 14, 2024

gante commented Oct 17, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024

Inconsistent Output with and without Prompt Caching in Llama-3.1-8B-Instruct. #34164

Inconsistent Output with and without Prompt Caching in Llama-3.1-8B-Instruct. #34164

Comments

giulio98 commented Oct 14, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

gante commented Oct 17, 2024 • edited Loading

github-actions bot commented Nov 14, 2024

gante commented Oct 17, 2024 •

edited

Loading