Marlin Mixed Input Kernel Productionization #3008

jwfromm · 2024-08-17T18:24:53Z

Summary:
This diff does quite a bit of facelifting to our Marlin BF16 X I4 kernels. These improvements include:

Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling.
Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation.
Fixes BF16 Dequantization issue.
Exposes a simplified torch custom op torch.ops.marlin.marlin_gemm and convenient helpers for quantizing to the marlin format marlin_quantize.
Adds these new ops to our quantize benchmarks.
New tests and better directory structure.

One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs.

Differential Revision: D61408771

netlify · 2024-08-17T18:25:13Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`d5db962`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66c3d09495815f0008d169be
😎 Deploy Preview	https://deploy-preview-3008--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2024-08-17T18:25:18Z

This pull request was exported from Phabricator. Differential Revision: D61408771

facebook-github-bot · 2024-08-19T22:56:21Z

This pull request was exported from Phabricator. Differential Revision: D61408771

Summary: Pull Request resolved: pytorch#3008 X-link: facebookresearch/FBGEMM#103 This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include: * Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling. * Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation. * Fixes BF16 Dequantization issue. * Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`. * Adds these new ops to our quantize benchmarks. * New tests and better directory structure. One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs. Reviewed By: jianyuh Differential Revision: D61408771

facebook-github-bot · 2024-08-19T23:02:52Z

This pull request was exported from Phabricator. Differential Revision: D61408771

Summary: Pull Request resolved: pytorch#3008 X-link: facebookresearch/FBGEMM#103 This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include: * Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling. * Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation. * Fixes BF16 Dequantization issue. * Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`. * Adds these new ops to our quantize benchmarks. * New tests and better directory structure. One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs. Reviewed By: jianyuh Differential Revision: D61408771

Summary: Pull Request resolved: pytorch#3008 X-link: facebookresearch/FBGEMM#103 This diff does quite a bit of facelifting to our [Marlin](https://github.com/IST-DASLab/marlin) BF16 X I4 kernels. These improvements include: * Upgrading the kernel with the latest improvements from VLLM. This helps quite a bit with stability and fixes issues with group scaling. * Adds template specializations so that the marlin kernel supports both BF16 and FP16 using a single implementation. * Fixes BF16 Dequantization issue. * Exposes a simplified torch custom op `torch.ops.marlin.marlin_gemm` and convenient helpers for quantizing to the marlin format `marlin_quantize`. * Adds these new ops to our quantize benchmarks. * New tests and better directory structure. One downside of this work is that we have diverged a bit from VLLM so it may be harder to stay in sync going forward. However, I think the benefits of the improvements in this diff outweigh potential sync costs. Reviewed By: jianyuh, jiawenliu64 Differential Revision: D61408771

facebook-github-bot · 2024-08-19T23:09:03Z

This pull request was exported from Phabricator. Differential Revision: D61408771

facebook-github-bot · 2024-08-20T16:39:31Z

This pull request has been merged in 162cc69.

facebook-github-bot added the cla signed label Aug 17, 2024

facebook-github-bot added the fb-exported label Aug 17, 2024

jwfromm force-pushed the export-D61408771 branch from d52b777 to 6cb663c Compare August 19, 2024 22:56

jwfromm force-pushed the export-D61408771 branch from 6cb663c to 2ac7716 Compare August 19, 2024 23:02

jwfromm force-pushed the export-D61408771 branch from 2ac7716 to d5db962 Compare August 19, 2024 23:09

facebook-github-bot closed this in 162cc69 Aug 20, 2024

facebook-github-bot added the Merged label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marlin Mixed Input Kernel Productionization #3008

Marlin Mixed Input Kernel Productionization #3008

jwfromm commented Aug 17, 2024

netlify bot commented Aug 17, 2024 •

edited

Loading

facebook-github-bot commented Aug 17, 2024

facebook-github-bot commented Aug 19, 2024

facebook-github-bot commented Aug 19, 2024

facebook-github-bot commented Aug 19, 2024

facebook-github-bot commented Aug 20, 2024

Marlin Mixed Input Kernel Productionization #3008

Marlin Mixed Input Kernel Productionization #3008

Conversation

jwfromm commented Aug 17, 2024

netlify bot commented Aug 17, 2024 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Aug 17, 2024

facebook-github-bot commented Aug 19, 2024

facebook-github-bot commented Aug 19, 2024

facebook-github-bot commented Aug 19, 2024

facebook-github-bot commented Aug 20, 2024

netlify bot commented Aug 17, 2024 •

edited

Loading