Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Marlin Mixed Input Kernel Productionization (#3008)
Summary: Pull Request resolved: #3008 X-link: facebookresearch/FBGEMM#103 This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include: * Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling. * Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation. * Fixes BF16 Dequantization issue. * Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`. * Adds these new ops to our quantize benchmarks. * New tests and better directory structure. One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs. Reviewed By: jianyuh, jiawenliu64 Differential Revision: D61408771 fbshipit-source-id: 66b651ce794309a408f30244cac20a3c9ab0ce5a
- Loading branch information