-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for batch size to --perplexity
#407
Conversation
--perplexity
Ok, very interesting result. From #406 (reply in thread), there was a delta between 10 and 32 threads. So i tried rerunning my experiment with 1 thread. It's perfectly consistent with the different batch sizes now! Interestingly, the 1 thread results both match the batch size 512 result (with 32 threads). |
I confirm the observation - looking into this Edit: The source of the variation looks to be in the "self-attention" section of Edit2: Pretty sure I've pin-pointed the cause: Line 737 in 9ea43d4
This matrix multiplication |
@glinscott Seems like the "transposed X" branch is more efficient, but not numerically stable. Will try to see if I can resolve it and if not, I will simply remove it all together from |
kind of colliding with #438 , also the batch size in your case has the opposite meaning of what it normally (non perplexity mode) does. and thus very deceptive. |
@ggerganov awesome! thank you, very nice find. @Green-Sky ah, i must admit, i don't quite understand the main mode batch size parameter. I thought it does evaluation after |
@ggerganov |
Ok, so after 483bab2, I see results are consistent at a given batch_size, but different across batch_sizes. Eg. tested Then tested Then, did a few more: |
if the memory management would actually take it into account.
see #438 for a competing attempt |
@Green-Sky I think I know how to fix the memory issues and reduce token memory usage drastically. Will try to do this later today @glinscott Are you on a Mac? I think at batch size >= 32 the BLAS Accelerate Lines 5722 to 5729 in 3cd8dde
If you disable BLAS with |
@ggerganov thanks for the suggestion - I'm on an AMD 5950X. I did try building with |
I'm doing a run to compare batch size 8 vs 512 for the default context size with BLAS on, and if that looks close, this is ready to go. Otherwise, I'd probably switch the batch size for people to min(512, context_size) automatically. |
Some very interesting results here. I'm building with
|
Indeed, hardcoding
|
Hi @glinscott, I have tested "hardcoding ggml_compute_forward_mul_mat_use_blas to return true" using IntelMKL:
And OpenBLAS:
Also, perplexity slightly better ( |
Yes, using BLAS during perplexity computation can be deceiving (I think I noted this somewhere earlier). Let's have matrix multiplication
When using BLAS, I think the BLAS computation will be more precise, because we lose precision when quantizing |
This PR would be nice to get finalized |
Sorry for the delay, I've updated it so the batch size is defaulted to 512, which is much faster. Ready to go! |
adds the hipBlas gpu_target $(shell $(ROCM_PATH)/llvm/bin/amdgpu-arch) back to the gpu_target line, possibly allowing misc gpu arch's like gfx1031 or gfx1032 etc to be built
Adds support for batch size to
perplexity
. This allows you to runperplexity
on context sizes of 2048 (although still limited to a batch size of 512 there, as that's the max currently supported). We set batch size to the min of user defined batch size and the context size.Note though, the small batch sizes 8 are significantly slower than 512, and also that perplexity takes a hit (from switching off BLAS):
For larger batch sizes, things match well though:
Also, fixes the batch size passed into
llama_eval
so it's not off by one, although I didn't notice much speed difference for large sizes.