Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marlin Mixed Input Kernel Productionization #3008

Closed
wants to merge 1 commit into from

Commits on Aug 19, 2024

  1. Marlin Mixed Input Kernel Productionization (pytorch#3008)

    Summary:
    Pull Request resolved: pytorch#3008
    
    X-link: facebookresearch/FBGEMM#103
    
    This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include:
    
    * Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling.
    * Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation.
    * Fixes BF16 Dequantization issue.
    * Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`.
    * Adds these new ops to our quantize benchmarks.
    * New tests and better directory structure.
    
    One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs.
    
    Reviewed By: jianyuh, jiawenliu64
    
    Differential Revision: D61408771
    jwfromm authored and facebook-github-bot committed Aug 19, 2024
    Configuration menu
    Copy the full SHA
    d5db962 View commit details
    Browse the repository at this point in the history