Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triton MX4 Quantize Rounding Mode Support. #2821

Closed
wants to merge 4 commits into from

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Jul 10, 2024

Summary: This diff adds the rounding_mode argument to triton quantize. We support (almost) all the rounding described in the best practices doc. Stochastic coming soon.

Differential Revision: D59562029

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

Copy link

netlify bot commented Jul 10, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit a044115
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66944289a1fb13000870a244
😎 Deploy Preview https://deploy-preview-2821--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 10, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support (almost) all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr). Stochastic coming soon.

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 10, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support (almost) all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr). Stochastic coming soon.

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Differential Revision: D59562029
Josh Fromm added 3 commits July 12, 2024 13:04
Summary:
Pull Request resolved: pytorch#2836

Rather than try to reshape inputs to 2D matrices with each thread operating on one row, this refactor uses 1D inputs and has each thread operate on an offset of the array.

The main benefit of this is that it avoid ragged tensors where we cant divide an input into even sized rows. This should enable us to be compatible with more shapes.

Differential Revision: D59653809

Reviewed By: sryap
Summary:
We previously had to use python to unravel values from exponents and feed them to triton as two separate tensors. This introduced a lot of overhead as it introduced large copies.

This diff does a bunch of fancy indexing to directly operate on a tensor with mixed elements and exponents. The result is that triton dequantize is now slightly faster than the cuda kernel. My hope is that this allows us to standardize on a single implementation.

Differential Revision: D59661776
Summary:
We apply a similar technique as we did to dequantize in D59661776 to MX4 quantization. Specifically we do fancy indexing to be able to write both exponents and values to the same output tensor within the triton kernel. This allows us to only allocate a single output and do no extra copies, giving a sizeable 40% performance boost.

Before this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7563us
input_size=1073741824 MX4 dequantized time per iter: 2756us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 5110us
input_size=1073741824 MX4 triton dequantized time per iter: 2417us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4223us
```

After this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7560us
input_size=1073741824 MX4 dequantized time per iter: 2758us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 3138us
input_size=1073741824 MX4 triton dequantized time per iter: 2418us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4226us
```

Differential Revision: D59688150
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
Summary:
Pull Request resolved: pytorch#2821

X-link: facebookresearch/FBGEMM#22

This diff adds the `rounding_mode` argument to triton quantize. We support all the rounding described in the [best practices doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr).

Reviewed By: summerdengfb

Differential Revision: D59562029
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59562029

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 5bf8ce9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants