-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope)) #10336
Comments
https://docs.vllm.ai/en/latest/serving/faq.html#:~:text=In%20vLLM%2C%20the,divergence%20is%20likely. huggingface/transformers#25921 weird how i never faced such a problem in older vllm version. |
Maybe similar issue: #9567 |
i 've been looking for some more on this.. the way i meant different is sometimes the model would give garbage output in the batching but won't in single inference. i tried running it with max-len 4096 and. the garbage output issue was gone. and possibly the reason of that might be due to. rope scaling or fp8 kv cache in that particular model phi-3-mini-128k-instruct: |
@jeejeelee it's because of mixed batching .. even with if all batches are longer than 4096 it doesn't give garbage and if all batches are shorter than 4096 than also no garbage. it s when there are mixed batches, i think the commit also mentions the same where @caiom mentioned
is there something we can do to avoid this or any suggestion from side. |
@bhupendrathore I currently don't have any ideas - perhaps @DarkLight1337 could provide something more insightful |
@WoosukKwon may be more familiar with this part of the code. |
I have the same probem with 0.6.3.post1 . I run the model like that vllm serve neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic --tensor-parallel-size 2 --max-model-len 8192 The max-model-len have been set to avoid the problem... but it is a shame. |
@WoosukKwon any direction for me it depends on model.original_max_position_embeddings (in my case4096 ), and mixed batches is giving garbage that is disabling me to use multi concurrency. if at a time infer all prompts < 4096 or all prompts> 4096 then no garbage is coming out. is there anything i can change in Phi3LongRoPEScaledRotaryEmbedding to avoid this. |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
vllm version (latest was failing due to some issues like can not decode) :
0.6.1.post1
hosted the model :
what can be possible reason of that. does the model's pad tokens can affect that ?
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: