Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Marlin Mixed Input Kernel Productionization (pytorch#3008)
Summary: Pull Request resolved: pytorch#3008 X-link: facebookresearch/FBGEMM#103 This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include: * Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling. * Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation. * Fixes BF16 Dequantization issue. * Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`. * Adds these new ops to our quantize benchmarks. * New tests and better directory structure. One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs. Reviewed By: jianyuh Differential Revision: D61408771
- Loading branch information