-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantized: Use cublas for prompt #238
Quantized: Use cublas for prompt #238
Conversation
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== Dockerfile 1 34 25 0 9 Happy 1 442 369 0 73 JSON 5 9 9 0 0 Python 21 741 622 21 98 TOML 16 419 378 1 40 ------------------------------------------------------------------------------- Jupyter Notebooks 1 0 0 0 0 |- Markdown 1 60 30 22 8 |- Python 1 96 87 1 8 (Total) 156 117 23 16 ------------------------------------------------------------------------------- Markdown 16 1026 0 758 268 |- BASH 6 205 192 0 13 |- Python 6 121 110 0 11 |- Rust 3 185 172 9 4 (Total) 1537 474 767 296 ------------------------------------------------------------------------------- Rust 81 26376 24282 334 1760 |- Markdown 38 359 0 354 5 (Total) 26735 24282 688 1765 =============================================================================== Total 143 29047 25685 1114 2248 =============================================================================== |
After direct f16 dequant: This
Master
|
@lucasavila00, that's amazing! I am looking forward to merging this. Can you please add the special matmul function to the layer.rs file so we can use it in all models? Ideally, we can also implement it in QLinear so ISQ can benefit, too. |
1e7f3c8
to
e069667
Compare
I need to integrate these candle changes huggingface/candle#2141 And this PR should be ready I'll try to do it tonight |
@EricLBuehler would you mind updating your fork? It does not include the precision changes |
@lucasavila00 sure, I just updated it. |
I did not implement it for the other models. I can try to do it later in another MR. I don't have the bandwidth to work with different models and setups today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this, I'm looking forward to merging it. Once I do, I'll implement the F16 gemm for the rest of the models.
None | ||
} else { | ||
Some(self.mask(seq_len, x.device())?) | ||
}; | ||
|
||
let via_f16 = if is_prompt { seq_len > 32 } else { false }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let via_f16 = if is_prompt { seq_len > 32 } else { false }; | |
let via_f16 = is_prompt && seq_len > 32; |
Hi @lucasavila00, I'm looking forward to merging this PR! I left one requested change, and there is a conflict which should be pretty easy to resolve. Please let me know if you need any help. |
There were changes to the I'm sorry but I don't have time to benchmark it soon. |
Can we check if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good already. If I understand correctly, the Candle backend already doesn't support GPUs which don't support F16/BF16 as there is no gating on those dtypes, so I think this is good to merge. If in the future we decide to add more compat for those devices, it can be easily changed here.
I think this is ready to merge.
@lucasavila00 thank you! |
This
Master
Llama.cpp
Honestly, I don't think it's worth it to merge it just for this small win. This breaks usage for GPUs that don't support F16...
But it's an improvement...