-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampled/inconsistent output despite do_sample set to False #448
Comments
Can you try to reproduce this on a model without using TGI ? Just repeatedly call generate like you are doing here. |
Using this code to replicate without TGI and using CodeLlama. Still running.
|
Hi there 👋 Batching, which TGI does under the hood, may change the output of the models, regardless of It explains why @tahirazim's script fails (dynamic batching with TGI) and @jimburtoft's doesn't (static batch size with |
I agree dynamic batching will introduce padding, which in turns might lead to subtle differences. |
I tested calling a TGI-hosted Llama-7B model on Inf2 with I've also deployed the model into Sagemaker, where multiple clients are invoking the model concurrently, with random exponential backoff in case a 429 is returned ( So it does seem like the problem is with TGI's continuous/dynamic batching. |
I'd rephrase "TGI's continuous/dynamic batching" to "continuous/dynamic batching" 😄 |
@dacorvo I tried the code I sent and a few minor variations. I let it run 600-2000 times over multiple runs and never saw a difference. Is there an easy way to test dynamic batching with the pipeline alone to confirm, or do we need multiple simultaneous requests? I think @tahirazim confirmed it with his test, but I'm happy to run anything that might be helpful. |
You can easily test the effect of dynamic batching by encoding in the same batch the prompt and the prompt plus a truncated gold_response to simulate a generation in progress (I wonder at what level of generated inputs we start seeing differences). |
@dacorvo Wonder no longer. 2 words in, but it comes and goes. I'm having some problems comparing the results automatically, but thankfully I noticed a difference manually, and it is consistently at the same place. As the prompt gets longer, it eventually goes away.
Results from an inf2.24xlarge:
|
@jimburtoft thank you for checking that out: this confirms that even with a batch size of two and padding of one the problem arises. |
Another option would be to take a completely different approach and cache the results in a front-end, but that is not really a TGI issue then. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you! |
1 similar comment
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you! |
Despite
do_sample
being set toFalse
, we are occasionally (1-2% of the time) seeing TGI running Llama-7B models on INF2-Sagemaker return different outputs, despite being passed identical inputs. It seems there is sampling happening, even when TGI is being asked not to.The following simple piece of code should reproduce the problem within 300-400 iterations:
I'm running a TGI Docker image built from source from this repository at the following commit: 3b3afa4dad
The text was updated successfully, but these errors were encountered: