-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marlin Mixed Input Kernel Productionization #3008
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D61408771 |
This pull request was exported from Phabricator. Differential Revision: D61408771 |
Summary: Pull Request resolved: pytorch#3008 X-link: facebookresearch/FBGEMM#103 This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include: * Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling. * Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation. * Fixes BF16 Dequantization issue. * Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`. * Adds these new ops to our quantize benchmarks. * New tests and better directory structure. One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs. Reviewed By: jianyuh Differential Revision: D61408771
d52b777
to
6cb663c
Compare
This pull request was exported from Phabricator. Differential Revision: D61408771 |
Summary: Pull Request resolved: pytorch#3008 X-link: facebookresearch/FBGEMM#103 This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include: * Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling. * Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation. * Fixes BF16 Dequantization issue. * Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`. * Adds these new ops to our quantize benchmarks. * New tests and better directory structure. One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs. Reviewed By: jianyuh Differential Revision: D61408771
6cb663c
to
2ac7716
Compare
Summary: Pull Request resolved: pytorch#3008 X-link: facebookresearch/FBGEMM#103 This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include: * Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling. * Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation. * Fixes BF16 Dequantization issue. * Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`. * Adds these new ops to our quantize benchmarks. * New tests and better directory structure. One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs. Reviewed By: jianyuh, jiawenliu64 Differential Revision: D61408771
This pull request was exported from Phabricator. Differential Revision: D61408771 |
2ac7716
to
d5db962
Compare
This pull request has been merged in 162cc69. |
Summary:
This diff does quite a bit of facelifting to our Marlin BF16 X I4 kernels. These improvements include:
torch.ops.marlin.marlin_gemm
and convenient helpers for quantizing to the marlin formatmarlin_quantize
.One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs.
Differential Revision: D61408771