Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampled/inconsistent output despite do_sample set to False #448

Closed
tahirazim opened this issue Jan 30, 2024 · 13 comments
Closed

Sampled/inconsistent output despite do_sample set to False #448

tahirazim opened this issue Jan 30, 2024 · 13 comments
Assignees

Comments

@tahirazim
Copy link

tahirazim commented Jan 30, 2024

Despite do_sample being set to False, we are occasionally (1-2% of the time) seeing TGI running Llama-7B models on INF2-Sagemaker return different outputs, despite being passed identical inputs. It seems there is sampling happening, even when TGI is being asked not to.

The following simple piece of code should reproduce the problem within 300-400 iterations:

import requests

request_parameters = {
    "best_of": None,
    "max_new_tokens": 64,
    "return_full_text": False,
    "do_sample": False,
}

prompt = "Write code to implement the merge sort algorithm."
response = requests.post(TGI_LLAMA_7B_URL, json = {'inputs': prompt, 'parameters': request_parameters})
response_text = response.json()[0]["generated_text"]

i=0
while True:
    response = requests.post(TGI_LLAMA_7B_URL, json = {'inputs': prompt, 'parameters': request_parameters})
    new_response_text = response.json()[0]["generated_text"]
    
    if response_text != new_response_text:
        print("Problem", i)
        print(response_text)
        print(new_response_text)
        break
    else:
        i = i + 1
        if i % 10 == 0:
            print("Good", i)

I'm running a TGI Docker image built from source from this repository at the following commit: 3b3afa4dad

@tahirazim tahirazim changed the title Sampled/inconsistent output despite do_sampling set to False Sampled/inconsistent output despite do_sample set to False Jan 30, 2024
@dacorvo
Copy link
Collaborator

dacorvo commented Jan 30, 2024

Can you try to reproduce this on a model without using TGI ? Just repeatedly call generate like you are doing here.
This will make it easier to sort things out.

@jimburtoft
Copy link
Contributor

Using this code to replicate without TGI and using CodeLlama. Still running.

#num_cores should be changed based on the instance.  inf2.24xlarge has 6 neuron processors (they have two cores each) so 12 total
#larger models will need more cores.  You can make your model smaller by changing fp16 to f8.  Some models may requre num_cores to be a power of 2 
compiler_args = {"num_cores": 2, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}

#Put in the model name from Hugging Face.  The example model comes from https://huggingface.co/codellama/CodeLlama-7b-hf
model_to_test = "codellama/CodeLlama-7b-hf"

model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes) 

from optimum.neuron import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_to_test)

p = pipeline('text-generation', model, tokenizer)
p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)

prompt = "Write code to implement the merge sort algorithm."
gold_response = p(prompt,max_new_tokens=64, do_sample=False, best_of=None)


i=0
while True:
    response = p(prompt,max_new_tokens=64, do_sample=False, best_of=None)
    
    if gold_response != response:
        print("Problem", i)
        print(gold_response)
        print(response)
        break
    else:
        i = i + 1
        if i % 10 == 0:
            print("Good", i)

@gante
Copy link
Member

gante commented Jan 30, 2024

Hi there 👋

Batching, which TGI does under the hood, may change the output of the models, regardless of do_sample=False (which runs a deterministic algorithm). This property is also present in transformers and in APIs like OpenAI. I've written the technical details behind it in this comment 🤗

It explains why @tahirazim's script fails (dynamic batching with TGI) and @jimburtoft's doesn't (static batch size with transformers)

@dacorvo
Copy link
Collaborator

dacorvo commented Jan 31, 2024

I agree dynamic batching will introduce padding, which in turns might lead to subtle differences.
However, @tahirazim script seems synchronous, which means the TGI server would process one request at a time: hence, no batching and no padding.
@tahirazim can you confirm this (and also that you're the only one accessing the TGI server during your test).
@jimburtoft did you reproduce the issue with static batching ?

@tahirazim
Copy link
Author

I tested calling a TGI-hosted Llama-7B model on Inf2 with batch_size=1 and MAX_CONCURRENT_REQUESTS=1, and it's always returning identical outputs when given identical inputs and do_sample set to False.

I've also deployed the model into Sagemaker, where multiple clients are invoking the model concurrently, with random exponential backoff in case a 429 is returned (MAX_CONCURRENT_REQUESTS is still set to 1). TGI is now behaving exactly as expected.

So it does seem like the problem is with TGI's continuous/dynamic batching.

@gante
Copy link
Member

gante commented Jan 31, 2024

So it does seem like the problem is with TGI's continuous/dynamic batching.

I'd rephrase "TGI's continuous/dynamic batching" to "continuous/dynamic batching" 😄

@jimburtoft
Copy link
Contributor

@dacorvo I tried the code I sent and a few minor variations. I let it run 600-2000 times over multiple runs and never saw a difference. Is there an easy way to test dynamic batching with the pipeline alone to confirm, or do we need multiple simultaneous requests? I think @tahirazim confirmed it with his test, but I'm happy to run anything that might be helpful.

@dacorvo
Copy link
Collaborator

dacorvo commented Jan 31, 2024

You can easily test the effect of dynamic batching by encoding in the same batch the prompt and the prompt plus a truncated gold_response to simulate a generation in progress (I wonder at what level of generated inputs we start seeing differences).

@jimburtoft
Copy link
Contributor

@dacorvo Wonder no longer. 2 words in, but it comes and goes.

I'm having some problems comparing the results automatically, but thankfully I noticed a difference manually, and it is consistently at the same place. As the prompt gets longer, it eventually goes away.

from optimum.neuron import NeuronModelForCausalLM

#num_cores should be changed based on the instance.  inf2.24xlarge has 6 neuron processors (they have two cores each) so 12 total
#larger models will need more cores.  You can make your model smaller by changing fp16 to f8.  Some models may requre num_cores to be a power of 2 
compiler_args = {"num_cores": 12, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}


#Put in the model name from Hugging Face.  The example model comes from https://huggingface.co/codellama/CodeLlama-7b-hf
model_to_test = "codellama/CodeLlama-7b-hf"

model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes) 

from optimum.neuron import pipeline
from transformers import AutoTokenizer
import json
tokenizer = AutoTokenizer.from_pretrained(model_to_test)

p = pipeline('text-generation', model, tokenizer)
p("My favorite place on earth is", max_new_tokens=10, do_sample=True, top_k=50)

prompt = "Write code to implement the merge sort algorithm."
gold_response = p(prompt,max_new_tokens=65, do_sample=False, best_of=None, return_full_text=True)
print(gold_response)
gold_response = gold_response[0]["generated_text"]

to_concatinate = p(prompt,max_new_tokens=65, do_sample=False, best_of=None, return_full_text=False)
#print(gold_response)
to_concatinate = to_concatinate[0]["generated_text"]

print (gold_response)

#gold_response = {gold_response['generated_text']}
#print("Initial Gold response: ", gold_response)

i=0
problem=0
while problem==0:
    concatenated_prompt=prompt
    for word in to_concatinate.split(" "):
        concatenated_prompt = concatenated_prompt + " " + word
        response = p(concatenated_prompt,max_new_tokens=75, do_sample=False, best_of=None, return_full_text=True)
        print("Word:", word)
        new_response_text = response[0]["generated_text"]
        if gold_response != new_response_text[:len(gold_response)+1]:
            print("Problem", i)
            print("Original response:")
            print(gold_response)
            print("New response:")
            print(new_response_text[:len(gold_response)+1])
            problem=1
        i = i + 1

Results from an inf2.24xlarge:

Word: Approach
Problem 1
Original response:
Write code to implement the merge sort algorithm.

## Approach & Efficiency

### Big O

- Time: O(n log n)
- Space: O(n)

## Solution

![merge sort](../../assets/merge-sort.jpg)

### Code

```javascript
function merge
New response:
Write code to implement the merge sort algorithm.

## Approach & Efficiency

### Big O

- Time: O(n log n)
- Space: O(n)

## Solution

![Whiteboard](./assets/merge-sort.jpg)

### Code

```javascript
const mergeSort =

@dacorvo
Copy link
Collaborator

dacorvo commented Feb 2, 2024

@jimburtoft thank you for checking that out: this confirms that even with a batch size of two and padding of one the problem arises.
The only solution if you want something deterministic for the whole generation would be to forget about continuous batching, use multiple TGI servers each accepting only one request, and do the load-balancing at the SageMaker level (see https://aws.amazon.com/de/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/).
cc @philschmid

@dacorvo
Copy link
Collaborator

dacorvo commented Feb 2, 2024

Another option would be to take a completely different approach and cache the results in a front-end, but that is not really a TGI issue then.

@dacorvo dacorvo self-assigned this Mar 27, 2024
@HuggingFaceDocBuilderDev

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

1 similar comment
@HuggingFaceDocBuilderDev

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants