-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in prompt processing speed using a batch size of 1024 #6075
Comments
Probably happened after #6017 From that PR
What happens if you run |
Yes, I can confirm this fixes it. Although I have the feeling it uses more VRAM than before. Needs more testing. Edit: Nope, my testing shows no increase of VRAM. All is good. |
Looks like you already figured it out, the parameter to change the physical batch size is now |
Hello,
I've noticed a significant speed reduction in prompt processing when comparing the latest llama.cpp builds to slightly older ones.
I think it has something to do with the batch size. The speed at a batch size of 512 is the same as it always has been, but if I'm using -b 1024 it's significantly slower.
Comparison latest llama.cpp: -n 180 -c 4096 -t 6 --gpu-layers 5 --ignore-eos -b 1024, Mixtral IQ4_XS, Core i7 9750H, 32 GB RAM, RTX 2060
version: 2431 (4755afd)
version: 2405 (5cdb371)
@slaren Do you think there is a commit that could have caused this? Listening to the coil whine of my laptop while processing the prompt, there's a very noticeable different in the sound. With the recent commit, it sounds like it's processing two 512 batches instead of one 1024 batch (there's a noticeable pause in the coil whine at some point) even though in the terminal it looks like the usual 1024 batch size. With the older commit, there is no such pause and the sound is continuous for the whole 1024 tokens.
The speed difference is quite stark (20 ms/t vs 14 ms/t). I hope you can take a look at this! Thank you
The text was updated successfully, but these errors were encountered: