Sampled/inconsistent output despite do_sample set to False #448

tahirazim · 2024-01-30T13:13:24Z

Despite do_sample being set to False, we are occasionally (1-2% of the time) seeing TGI running Llama-7B models on INF2-Sagemaker return different outputs, despite being passed identical inputs. It seems there is sampling happening, even when TGI is being asked not to.

The following simple piece of code should reproduce the problem within 300-400 iterations:

import requests

request_parameters = {
    "best_of": None,
    "max_new_tokens": 64,
    "return_full_text": False,
    "do_sample": False,
}

prompt = "Write code to implement the merge sort algorithm."
response = requests.post(TGI_LLAMA_7B_URL, json = {'inputs': prompt, 'parameters': request_parameters})
response_text = response.json()[0]["generated_text"]

i=0
while True:
    response = requests.post(TGI_LLAMA_7B_URL, json = {'inputs': prompt, 'parameters': request_parameters})
    new_response_text = response.json()[0]["generated_text"]
    
    if response_text != new_response_text:
        print("Problem", i)
        print(response_text)
        print(new_response_text)
        break
    else:
        i = i + 1
        if i % 10 == 0:
            print("Good", i)

I'm running a TGI Docker image built from source from this repository at the following commit: 3b3afa4dad

The text was updated successfully, but these errors were encountered:

dacorvo · 2024-01-30T16:40:11Z

Can you try to reproduce this on a model without using TGI ? Just repeatedly call generate like you are doing here.
This will make it easier to sort things out.

jimburtoft · 2024-01-30T17:00:10Z

Using this code to replicate without TGI and using CodeLlama. Still running.

#num_cores should be changed based on the instance.  inf2.24xlarge has 6 neuron processors (they have two cores each) so 12 total
#larger models will need more cores.  You can make your model smaller by changing fp16 to f8.  Some models may requre num_cores to be a power of 2 
compiler_args = {"num_cores": 2, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}

#Put in the model name from Hugging Face.  The example model comes from https://huggingface.co/codellama/CodeLlama-7b-hf
model_to_test = "codellama/CodeLlama-7b-hf"

model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes) 

from optimum.neuron import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_to_test)

p = pipeline('text-generation', model, tokenizer)
p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)

prompt = "Write code to implement the merge sort algorithm."
gold_response = p(prompt,max_new_tokens=64, do_sample=False, best_of=None)


i=0
while True:
    response = p(prompt,max_new_tokens=64, do_sample=False, best_of=None)
    
    if gold_response != response:
        print("Problem", i)
        print(gold_response)
        print(response)
        break
    else:
        i = i + 1
        if i % 10 == 0:
            print("Good", i)

gante · 2024-01-30T17:27:16Z

Hi there 👋

Batching, which TGI does under the hood, may change the output of the models, regardless of do_sample=False (which runs a deterministic algorithm). This property is also present in transformers and in APIs like OpenAI. I've written the technical details behind it in this comment 🤗

It explains why @tahirazim's script fails (dynamic batching with TGI) and @jimburtoft's doesn't (static batch size with transformers)

dacorvo · 2024-01-31T08:11:09Z

I agree dynamic batching will introduce padding, which in turns might lead to subtle differences.
However, @tahirazim script seems synchronous, which means the TGI server would process one request at a time: hence, no batching and no padding.
@tahirazim can you confirm this (and also that you're the only one accessing the TGI server during your test).
@jimburtoft did you reproduce the issue with static batching ?

tahirazim · 2024-01-31T08:17:47Z

I tested calling a TGI-hosted Llama-7B model on Inf2 with batch_size=1 and MAX_CONCURRENT_REQUESTS=1, and it's always returning identical outputs when given identical inputs and do_sample set to False.

I've also deployed the model into Sagemaker, where multiple clients are invoking the model concurrently, with random exponential backoff in case a 429 is returned (MAX_CONCURRENT_REQUESTS is still set to 1). TGI is now behaving exactly as expected.

So it does seem like the problem is with TGI's continuous/dynamic batching.

gante · 2024-01-31T09:03:31Z

So it does seem like the problem is with TGI's continuous/dynamic batching.

I'd rephrase "TGI's continuous/dynamic batching" to "continuous/dynamic batching" 😄

jimburtoft · 2024-01-31T14:17:02Z

@dacorvo I tried the code I sent and a few minor variations. I let it run 600-2000 times over multiple runs and never saw a difference. Is there an easy way to test dynamic batching with the pipeline alone to confirm, or do we need multiple simultaneous requests? I think @tahirazim confirmed it with his test, but I'm happy to run anything that might be helpful.

dacorvo · 2024-01-31T14:21:28Z

You can easily test the effect of dynamic batching by encoding in the same batch the prompt and the prompt plus a truncated gold_response to simulate a generation in progress (I wonder at what level of generated inputs we start seeing differences).

jimburtoft · 2024-02-01T23:13:59Z

@dacorvo Wonder no longer. 2 words in, but it comes and goes.

I'm having some problems comparing the results automatically, but thankfully I noticed a difference manually, and it is consistently at the same place. As the prompt gets longer, it eventually goes away.

from optimum.neuron import NeuronModelForCausalLM

#num_cores should be changed based on the instance.  inf2.24xlarge has 6 neuron processors (they have two cores each) so 12 total
#larger models will need more cores.  You can make your model smaller by changing fp16 to f8.  Some models may requre num_cores to be a power of 2 
compiler_args = {"num_cores": 12, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}


#Put in the model name from Hugging Face.  The example model comes from https://huggingface.co/codellama/CodeLlama-7b-hf
model_to_test = "codellama/CodeLlama-7b-hf"

model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes) 

from optimum.neuron import pipeline
from transformers import AutoTokenizer
import json
tokenizer = AutoTokenizer.from_pretrained(model_to_test)

p = pipeline('text-generation', model, tokenizer)
p("My favorite place on earth is", max_new_tokens=10, do_sample=True, top_k=50)

prompt = "Write code to implement the merge sort algorithm."
gold_response = p(prompt,max_new_tokens=65, do_sample=False, best_of=None, return_full_text=True)
print(gold_response)
gold_response = gold_response[0]["generated_text"]

to_concatinate = p(prompt,max_new_tokens=65, do_sample=False, best_of=None, return_full_text=False)
#print(gold_response)
to_concatinate = to_concatinate[0]["generated_text"]

print (gold_response)

#gold_response = {gold_response['generated_text']}
#print("Initial Gold response: ", gold_response)

i=0
problem=0
while problem==0:
    concatenated_prompt=prompt
    for word in to_concatinate.split(" "):
        concatenated_prompt = concatenated_prompt + " " + word
        response = p(concatenated_prompt,max_new_tokens=75, do_sample=False, best_of=None, return_full_text=True)
        print("Word:", word)
        new_response_text = response[0]["generated_text"]
        if gold_response != new_response_text[:len(gold_response)+1]:
            print("Problem", i)
            print("Original response:")
            print(gold_response)
            print("New response:")
            print(new_response_text[:len(gold_response)+1])
            problem=1
        i = i + 1

Results from an inf2.24xlarge:

Word: Approach
Problem 1
Original response:
Write code to implement the merge sort algorithm.

## Approach & Efficiency

### Big O

- Time: O(n log n)
- Space: O(n)

## Solution

![merge sort](../../assets/merge-sort.jpg)

### Code

```javascript
function merge
New response:
Write code to implement the merge sort algorithm.

## Approach & Efficiency

### Big O

- Time: O(n log n)
- Space: O(n)

## Solution

![Whiteboard](./assets/merge-sort.jpg)

### Code

```javascript
const mergeSort =

dacorvo · 2024-02-02T08:36:17Z

@jimburtoft thank you for checking that out: this confirms that even with a batch size of two and padding of one the problem arises.
The only solution if you want something deterministic for the whole generation would be to forget about continuous batching, use multiple TGI servers each accepting only one request, and do the load-balancing at the SageMaker level (see https://aws.amazon.com/de/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/).
cc @philschmid

dacorvo · 2024-02-02T08:38:22Z

Another option would be to take a completely different approach and cache the results in a front-end, but that is not really a TGI issue then.

HuggingFaceDocBuilderDev · 2024-04-21T08:04:27Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev · 2024-05-15T08:04:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

tahirazim changed the title ~~Sampled/inconsistent output despite do_sampling set to False~~ Sampled/inconsistent output despite do_sample set to False Jan 30, 2024

dacorvo self-assigned this Mar 27, 2024

tahirazim closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampled/inconsistent output despite do_sample set to False #448

Sampled/inconsistent output despite do_sample set to False #448

tahirazim commented Jan 30, 2024 •

edited

Loading

dacorvo commented Jan 30, 2024

jimburtoft commented Jan 30, 2024

gante commented Jan 30, 2024 •

edited

Loading

dacorvo commented Jan 31, 2024

tahirazim commented Jan 31, 2024

gante commented Jan 31, 2024

jimburtoft commented Jan 31, 2024

dacorvo commented Jan 31, 2024 •

edited

Loading

jimburtoft commented Feb 1, 2024

dacorvo commented Feb 2, 2024

dacorvo commented Feb 2, 2024

HuggingFaceDocBuilderDev commented Apr 21, 2024

HuggingFaceDocBuilderDev commented May 15, 2024

Sampled/inconsistent output despite do_sample set to False #448

Sampled/inconsistent output despite do_sample set to False #448

Comments

tahirazim commented Jan 30, 2024 • edited Loading

dacorvo commented Jan 30, 2024

jimburtoft commented Jan 30, 2024

gante commented Jan 30, 2024 • edited Loading

dacorvo commented Jan 31, 2024

tahirazim commented Jan 31, 2024

gante commented Jan 31, 2024

jimburtoft commented Jan 31, 2024

dacorvo commented Jan 31, 2024 • edited Loading

jimburtoft commented Feb 1, 2024

dacorvo commented Feb 2, 2024

dacorvo commented Feb 2, 2024

HuggingFaceDocBuilderDev commented Apr 21, 2024

HuggingFaceDocBuilderDev commented May 15, 2024

tahirazim commented Jan 30, 2024 •

edited

Loading

gante commented Jan 30, 2024 •

edited

Loading

dacorvo commented Jan 31, 2024 •

edited

Loading