[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope)) #10336

bhupendrathore · 2024-11-14T16:28:09Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

vllm version (latest was failing due to some issues like can not decode) :
0.6.1.post1
hosted the model :

CUDA_VISIBLE_DEVICES=0 python3 -m  vllm.entrypoints.openai.api_server --model csp-phi-3-mini-128k-ft-outputs/qlora_merged_model_csp_phi-ckp-23850 --dtype bfloat16 --gpu-memory-utilization 0.9 --disable-log-requests --max-model-len 14000

import requests
import json
import time
VLLM_INFER_URL = "http://0.0.0.0:8000/v1/completions"
def infer_vllm(prompt:str,max_new_tokens = 800,temp=0.0) -> str:
    '''Infer from hosted vllm server'''
        payload = json.dumps({
        "model": "csp-phi-3-mini-128k-ft-outputs/qlora_merged_model_csp_phi-ckp-23850",
        "prompt": prompt,
        "temperature": temp,
        # "top_k": 50,
        "top_p": 1,
        "max_tokens": max_new_tokens
    })
    headers = {
        'Content-Type': 'application/json'
    }
   
    try:
        status_code_failure = False
        start_time = time.time()
        response =  requests.request("POST", VLLM_INFER_URL, headers=headers, data=payload)
        if response.status_code == 200:
            resp = json.loads(response.text)["choices"][0]["text"]
            return resp
        else:
            print(response.json())
           
            return "None"
        

from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
prompts = data.prompt.tolist()
if True:
    with ThreadPoolExecutor(max_workers=5) as executor:
        list_of_results5 = list(tqdm(executor.map(infer_vllm, prompts[:10]), total=len(prompts[:10])))
 
 #first output sample - lest check second response
print(list_of_results5[2])

#vs 

print(infer_vllm(prompts[2]))

#is different i initially thought this might be due to pad tokens but i don't think so

what can be possible reason of that. does the model's pad tokens can affect that ?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

bhupendrathore · 2024-11-14T16:55:01Z

https://docs.vllm.ai/en/latest/serving/faq.html#:~:text=In%20vLLM%2C%20the,divergence%20is%20likely.

huggingface/transformers#25921

weird how i never faced such a problem in older vllm version.

jeejeelee · 2024-11-16T01:05:17Z

Maybe similar issue: #9567

bhupendrathore · 2024-11-22T10:48:18Z

i 've been looking for some more on this.. the way i meant different is sometimes the model would give garbage output in the batching but won't in single inference.

i tried running it with max-len 4096 and. the garbage output issue was gone. and possibly the reason of that might be due to. rope scaling or fp8 kv cache in that particular model phi-3-mini-128k-instruct:
huggingface/transformers#33129 (when i infer with transformers it runs garbage free.)
#6135

bhupendrathore · 2024-11-26T04:53:14Z

@jeejeelee it's because of mixed batching .. even with if all batches are longer than 4096 it doesn't give garbage and if all batches are shorter than 4096 than also no garbage. it s when there are mixed batches, i think the commit also mentions the same where @caiom mentioned

when a batch contains long and short sequences, it will always use long factor, even for short samples. Currently we don't support such mixed batches.

#4298 (comment)

is there something we can do to avoid this or any suggestion from side.

jeejeelee · 2024-11-26T08:40:29Z

@bhupendrathore I currently don't have any ideas - perhaps @DarkLight1337 could provide something more insightful

DarkLight1337 · 2024-11-26T09:43:52Z

@WoosukKwon may be more familiar with this part of the code.

Galigator · 2024-11-28T10:15:51Z

I have the same probem with 0.6.3.post1 . I run the model like that vllm serve neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic --tensor-parallel-size 2 --max-model-len 8192

The max-model-len have been set to avoid the problem... but it is a shame.

bhupendrathore · 2024-11-29T06:29:50Z

@WoosukKwon any direction for me it depends on model.original_max_position_embeddings (in my case4096 ), and mixed batches is giving garbage that is disabling me to use multi concurrency. if at a time infer all prompts < 4096 or all prompts> 4096 then no garbage is coming out. is there anything i can change in Phi3LongRoPEScaledRotaryEmbedding to avoid this.

bhupendrathore added the bug Something isn't working label Nov 14, 2024

bhupendrathore closed this as completed Nov 14, 2024

bhupendrathore reopened this Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope)) #10336

[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope)) #10336

bhupendrathore commented Nov 14, 2024 •

edited

Loading

bhupendrathore commented Nov 14, 2024 •

edited

Loading

jeejeelee commented Nov 16, 2024

bhupendrathore commented Nov 22, 2024 •

edited

Loading

bhupendrathore commented Nov 26, 2024 •

edited

Loading

jeejeelee commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024

Galigator commented Nov 28, 2024

bhupendrathore commented Nov 29, 2024 •

edited

Loading

[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope)) #10336

[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope)) #10336

Comments

bhupendrathore commented Nov 14, 2024 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

bhupendrathore commented Nov 14, 2024 • edited Loading

jeejeelee commented Nov 16, 2024

bhupendrathore commented Nov 22, 2024 • edited Loading

bhupendrathore commented Nov 26, 2024 • edited Loading

jeejeelee commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024

Galigator commented Nov 28, 2024

bhupendrathore commented Nov 29, 2024 • edited Loading

bhupendrathore commented Nov 14, 2024 •

edited

Loading

bhupendrathore commented Nov 14, 2024 •

edited

Loading

bhupendrathore commented Nov 22, 2024 •

edited

Loading

bhupendrathore commented Nov 26, 2024 •

edited

Loading

bhupendrathore commented Nov 29, 2024 •

edited

Loading