Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marlin Mixed Input Kernel Productionization #3008

Closed
wants to merge 1 commit into from

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Aug 17, 2024

Summary:
This diff does quite a bit of facelifting to our Marlin BF16 X I4 kernels. These improvements include:

  • Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling.
  • Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation.
  • Fixes BF16 Dequantization issue.
  • Exposes a simplified torch custom op torch.ops.marlin.marlin_gemm and convenient helpers for quantizing to the marlin format marlin_quantize.
  • Adds these new ops to our quantize benchmarks.
  • New tests and better directory structure.

One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs.

Differential Revision: D61408771

Copy link

netlify bot commented Aug 17, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit d5db962
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66c3d09495815f0008d169be
😎 Deploy Preview https://deploy-preview-3008--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61408771

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61408771

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Aug 19, 2024
Summary:
Pull Request resolved: pytorch#3008

X-link: facebookresearch/FBGEMM#103

This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include:

* Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling.
* Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation.
* Fixes BF16 Dequantization issue.
* Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`.
* Adds these new ops to our quantize benchmarks.
* New tests and better directory structure.

One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs.

Reviewed By: jianyuh

Differential Revision: D61408771
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61408771

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Aug 19, 2024
Summary:
Pull Request resolved: pytorch#3008

X-link: facebookresearch/FBGEMM#103

This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include:

* Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling.
* Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation.
* Fixes BF16 Dequantization issue.
* Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`.
* Adds these new ops to our quantize benchmarks.
* New tests and better directory structure.

One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs.

Reviewed By: jianyuh

Differential Revision: D61408771
Summary:
Pull Request resolved: pytorch#3008

X-link: facebookresearch/FBGEMM#103

This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include:

* Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling.
* Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation.
* Fixes BF16 Dequantization issue.
* Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`.
* Adds these new ops to our quantize benchmarks.
* New tests and better directory structure.

One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs.

Reviewed By: jianyuh, jiawenliu64

Differential Revision: D61408771
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61408771

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 162cc69.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants