-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592
Comments
How did you get the response? I'm struggling to figure out how to post a request to the llama cpp server running. Would you be able to provide an example? E.g. what url (/v1/chat/completions??) |
the url i used is host:port/completions What errors are you getting? |
I got it to work! Thank you. For reference to anyone who finds this thread:
The part I was messing up was the PNG buffered format and the endpoint I was posting to. |
@ggerganov Any updates here? When setting to same id. The effect is basically as same as what you posted in the PR for llava and batch processing of server. #3677 After setting to different id. It looks like only 1 slot (in my case, slot-0 too) have image data include in inference. Here is some example:
Inference code import asyncio
import base64
import copy
from httpx import AsyncClient
from objprint import objprint
client = AsyncClient(timeout=3600)
URL = "http://127.0.0.1:8080/completion"
DATA = {
"image_data": [],
"n_predict": 400,
"prompt": "",
"repeat_last_n": 128,
"repeat_penalty": 1.2,
"slot_id": -1,
"stop": ["</s>", "ASSISTANT:", "USER:"],
"top_k": 40,
"top_p": 0.9,
"temperature": 0.1,
}
SLOTS = 4
rq_count = 0
def construct_data(prompt, image, slot_id):
if slot_id == -1:
slot_id = rq_count % SLOTS
img_id = 10+slot_id
prompt = prompt.replace("<img>", f"[img-{img_id}]")
img_str = base64.b64encode(open(image, "rb").read()).decode("utf-8")
data = copy.deepcopy(DATA)
data["image_data"] = [{
"id": img_id,
"data": img_str
}]
data["prompt"] = "prompt"
data["slot_id"] = slot_id
return data
async def rq_img(image):
global rq_count
data = construct_data(
"USER: <img> Describe this Image with short sentence.\nASSISTANT:",
"./test.jpg",
-1
)
rq_count += 1
resp = await client.post(URL, json=data)
try:
resp = resp.json()
except:
resp = resp.text
return resp, data["slot_id"]
async def main():
image = "./test.jpg"
result = await asyncio.gather(*(rq_img(image) for _ in range(4)))
for res in result:
print(f"slot={res[1]}")
objprint(res[0]["content"])
print("\n\n")
asyncio.run(main()) And the result:
|
Server log is here:
|
No updates. Short term we will drop multimodal support from |
I'm using a custom model and observe different results. Batching did provide the image to each slot, but messed with the generation process. Slot 0 was as expected, but all other slots responded in Simplified Chinese (which is very unexpected). At first I thought it was garbly goop, but then I realized that it translates correctly to a near expected output. So, in some way, batch generation is working (just somehow messing with the generation process). |
Thx for your information!!! |
Thank you for your update. I will try fixing the issue on my own time and let you know if there are any changes. Thank you for your work on llama cpp. It is amazing! |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Has Multimodal support been re-introduced for server? |
When I am trying to do parallel inferencing on llama cpp server for multimodal, I am getting the correct output for slot 0, but for other slots, I am not. Does that mean that clip is only being loaded on one slot? I can see some clip layers failing to load.
Here is my llama cpp server code that I use.
./server -m ../models/llava13b1_5/llava13b1_5_f16.gguf -c 40960 --n-gpu-layers 41 --port 8001 --mmproj ../models/llava13b1_5/llava13b1_5_mmproj_f16.gguf -np 10 -cb --host 0.0.0.0 --threads 24
The model I am using -
https://huggingface.co/mys/ggml_llava-v1.5-13b/tree/main
I am using the F16 model with mmproj file.
Documentation reference
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
My GPU specs
My CPU specs
Loading llama cpp server for llava, using slot 0 for inference.
When using the other slot, that is parallel inferencing -
Prompt
model_type parameter in my payload is only for a proxy server that is rerouting all the requests.
Image looks like this
The text was updated successfully, but these errors were encountered: