-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : parallel decoding and multimodal #3589
Conversation
You should rebase and adopt the latest sampling changes from: #3543 |
@KerfuffleV2 In the new way of working with samplers, if each slot has different Because if there were a function llama_sampling_context llama_sampling_context_init(
const struct gpt_params & params,
llama_grammar * grammar) {
llama_sampling_context result;
result.params = params.sampling_params;
result.grammar = grammar;
return result;
} It should be, to reduce headache pain.: llama_sampling_context llama_sampling_context_init(
const struct llama_sampling_params & params,
llama_grammar * grammar) {
llama_sampling_context result;
result.params = params;
result.grammar = grammar;
return result;
} mmmmmmm |
The |
Yes, I agree with that. Looks like there are also likely more changes coming: #3601
Maybe try compiling with address sanitizing turned on? (Support recently got added to normal |
It was a null pointer; I was accessing an object outside of a lambda function, which always confuses me because the compiler never throws any error in that case. |
@ggerganov I need to know what to do in case the context gets crowded, given that the server is being used by multiple clients who share the entire context window. I assume that the context swap should only occur when a single slot is fully utilizing the context, or put a context limit per slot. When I use the embeddings, it doesn't return the same values for some reason. I've already checked the embedding pipeline to compare it with the server implementation, but I'm not getting good results. |
Am I wrong? Content: {
"content": "Which is your dog?"
} PR server: {
"embedding": [
0.0013885498046875,
0.02568817138671875,
-0.02291107177734375,
0.0013885498046875,
0.0013885498046875,
-0.02291107177734375,
0.0013885498046875,
0.0013885498046875,
0.0013885498046875,
-0.02291107177734375,
-0.02291107177734375,
-0.02291107177734375,
0.0013885498046875,
0.0013885498046875,
]
} master server: {
"embedding": [
0.0013885498046875,
0.02568817138671875,
-0.02291107177734375,
0.0013885498046875,
0.0013885498046875,
-0.02291107177734375,
0.0013885498046875,
0.0013885498046875,
0.0013885498046875,
-0.02291107177734375,
-0.02291107177734375,
-0.02291107177734375,
0.0013885498046875,
0.0013885498046875,
]
} Embedding example: 2.071708 0.094213 -0.879323 -0.868794 -1.877190 0.149166 0.456107 -1.180170 -1.899972 0.324895 0.137798 1.375898 -0.929520 -0.724631 -3.895488 0.835485 2.534587 -1.334827 -1.068944 -1.522295 -2.574473 -1.089735 0.195261 0.207192 0.607332 -0.160022 -2.255088 0.348395 -0.843356 0.270982 -1.236908 -0.836289 -1.600230 -0.213995 -0.724451 1.241112 0.571537 -1.612993 -0.924750 -1.141107 0.301533 0.476756 0.925662 -0.778871 0.644059 -2.298353 -1.250245 0.888944 1.218905 1.417424 -1.571156 0.380106 1.282110 1.155357 0.254458 -0.391138 4.462999 -2.220449 -0.675895 2.643535 -0.059323 -0.518516 -1.783109 1.390705 1.135090 2.257272 .... Edit: This is a problem when I offload the model to cuda, related issue #3625 |
@ggerganov @monatis Server example now support LlaVA model Response: |
Params {
"stream": false,
"n_predict": -1,
"temperature": 0.1,
"prompt": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER: [(img)]describe the image in detail.\nASSISTANT:",
"image_data": "image data in base64",
"stop": ["</s>"]
} Response: The image features a desk with two computer monitors on it. One of the monitors has an unusual decoration: a small toy lamb figurine is sitting on top of it, adding a playful touch to the workspace. A keyboard and mouse are also present on the desk, indicating that this is likely a computer setup.\n\nIn addition to the main desk area, there's another smaller desk visible in the background. The room appears to be well-equipped for work or study purposes, with various items such as books and a TV placed around the space. |
Awesome work! Thanks @FSSRepo for taking this |
@Seikho Can you test now? |
At the time it was a silent error, although I didn't run with |
I'll make a review tomorrow. First will merge #3624 and then will push changes to this PR to adapt to the new sampling API.
There should be a max context per-client (
Might be fixed via #3657
Enable by default |
I'm somewhat confused about this. Is context_per_slot a parameter for each slot or a general one? The context_per_slot should not exceed the model's n_ctx, as a user might assign 512 to one slot and 768 to another slot, but the established context is only 1024, leaving 256 remaining that cannot be allocated. Also, I have no idea how the |
@ggerganov Are you reviewing this, now? I will try to adapt to the new changes in sampling API |
I didn't find the time today, but tomorrow this will be first thing. Until then, you can make changes if you want to. I'll mark the PR when I start reviewing. |
@FSSRepo I can't push into your branch: $ ▶ git push FSSRepo HEAD:master
error: Authentication error: Authentication required: You must have push access to verify locks
error: failed to push some refs to 'https://github.com/FSSRepo/llama.cpp' Let's continue in #3677 - sent you a collaborator invite |
Based on the server I've already created, I'm working on this PR. There are commented-out sections to save myself from unnecessary headaches. I haven't even tested the results of this PR yet.
Here's the plan for a complete implementation:
completion
work in streaming mode.embedding
endpoint. Inherent issue with CUDA backend Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625completion
.infill
function and perform the tests, it would make my life easier if someone could provide a way to test that endpoint.master
branch.