Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Output with and without Prompt Caching in Llama-3.1-8B-Instruct. #34164

Closed
2 of 4 tasks
giulio98 opened this issue Oct 14, 2024 · 2 comments
Closed
2 of 4 tasks

Comments

@giulio98
Copy link

System Info

  • transformers version: 4.45.1
  • Platform: Linux-6.6.35-amd64-x86_64-with-glibc2.35
  • Python version: 3.11.6
  • Huggingface_hub version: 0.25.2
  • Safetensors version: 0.4.3
  • Accelerate version: 1.0.1
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    • distributed_type: MULTI_GPU
    • mixed_precision: no
    • use_cpu: False
    • debug: False
    • num_processes: 2
    • machine_rank: 0
    • num_machines: 1
    • gpu_ids: 0,1
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • enable_cpu_affinity: False
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: NO
  • Using GPU in script?: YES
  • GPU type: NVIDIA H100 80GB HBM3

Who can help?

@gante @ArthurZucker @itaza

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Generate Responses with Cache, following Re-use Cache to continue generation
import copy
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache

model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Init StaticCache with big enough max-length
prompt_cache = StaticCache(config=model.config, max_batch_size=1, max_cache_len=1024, device="cuda", dtype=torch.bfloat16)

INITIAL_PROMPT = "You are a helpful assistant. "
inputs_initial_prompt = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")

with torch.no_grad():
    prompt_cache = model(**inputs_initial_prompt, past_key_values=prompt_cache).past_key_values
prompts = ["Help me to write a blogpost about travelling.", "What is the capital of France?"]
responses = []
for prompt in prompts:
    new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
    past_key_values = copy.deepcopy(prompt_cache)
    outputs = model.generate(**new_inputs, past_key_values=past_key_values, max_new_tokens=20)
    response = tokenizer.batch_decode(outputs)[0]
    responses.append(response)

print(responses)
  1. Observed Output
['<|begin_of_text|>You are a helpful assistant. Help me to write a blogpost about travelling.  I have some ideas, but I’ts not clear how to structure the post.  I',
 '<|begin_of_text|>You are a helpful assistant. What is the capital of France? Paris.  is the capital of the United States? Washington D.C.  is the capital of']
  1. Generate response without cache
responses = []
for prompt in prompts:
    new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**new_inputs, max_new_tokens=20, use_cache=False)
    response = tokenizer.batch_decode(outputs)[0]
    responses.append(response)

print(responses)
  1. Observed Output
['<|begin_of_text|>You are a helpful assistant. Help me to write a blogpost about travelling. Here’s what I need to write about:\nTitle: “The Magic of Exploring New Places:',
 '<|begin_of_text|>You are a helpful assistant. What is the capital of France? Paris.\nWhat is the capital of Australia? Canberra.\nWhat is the capital of Brazil? Brasília']

Expected behavior

The output without cache should be exactly the same as the one that uses the cache.

@gante
Copy link
Member

gante commented Oct 17, 2024

Hi @giulio98 👋

The output without cache should be exactly the same as the one that uses the cache.

This statement is not true, see this comment :) However, the results should be similar, which doesn't seem to be the case -- I'm going to have a look.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants