-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zyphra Mamba2 and Flash Attention backward kernel blog #25
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass review, @Ehud-Sharlin please take a look.
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
@BerenMillidge Thank you for the contribution. I will work to guide this blog to publication. This process may take until mid next week. |
…EADME.md Co-authored-by: Saad Rahim (AMD) <[email protected]>
…EADME.md Co-authored-by: Saad Rahim (AMD) <[email protected]>
…EADME.md Co-authored-by: Saad Rahim (AMD) <[email protected]>
@saadrahim Thanks for reviewing. Would it be possible to commit to a firm date for publication? Such as the 10th? We will try to be as responsive as possible on our side. |
@saadrahim and @Ehud-Sharlin, thanks for guiding this blog to publication. Are we able to get a date locked-in? Possibly next Monday or Tuesday (December 9th or 10th)? ~Cheers |
A December 10 release sounds reasonable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding revisions and suggestions to the blog’s title, thumbnail image, snippet text, introductory title and text, and summary.
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
myst: | ||
html_meta: | ||
"description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm" | ||
"keywords": "Mamba, PyTorch, S4, S6, Mamba2, Transformer, Flash Attention, Optimization, Hardware-aware, Transformer, Attention, ROCm, Mi210, MI250, MI300, AI/ML, Generative AI" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a thumbnail image, format:
thumbnail: ' image name.jpg'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add this today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the path in 327f685 and we sent the thumbnail over email. Does anything else need done for the thumbnail?
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
|
||
*By Quentin Anthony and Beren Millidge from Zyphra* | ||
|
||
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth. | |
## Harnessing the MI300 Superior Hardware Specs | |
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a subsection with the more neutral wording of ## Introduction
language: English | ||
myst: | ||
html_meta: | ||
"description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my suggestion for the blog snippet text:
"description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm" | |
"description lang=en": "This blog presents Zyphra’s vision of training transformers and hybrid models at a lower cost, and its realization by utilizing and optimizing the superior hardware specs of the MI300x." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was handled and is now outdated
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
…EADME.md Co-authored-by: Ehud Sharlin <[email protected]>
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
…EADME.md Co-authored-by: Ehud Sharlin <[email protected]>
Co-authored-by: Ehud Sharlin <[email protected]>
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Ehud Sharlin <[email protected]>
Co-authored-by: Saad Rahim (AMD) <[email protected]>
…cm-blogs into zyphra_blogpost
blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md
Outdated
Show resolved
Hide resolved
@saadrahim -- Does anything else remain? Do we need to update |
index.md is mostly autogenerated. I am checking what else is left. |
We can take care of blogs/authors/data/Quentin-Anthony.jpg blogs/authors/quentin-anthony.md blogs/ecosystems-and-partners/zyphra/README.md blogs/ecosystems-and-partners/zyphra/images/Flash_attention_AMD_kernel_blog.png blogs/ecosystems-and-partners/zyphra/images/Mamba2_kernel_backward_AMD_blog.png |
Some other minor linting issues that @Danny213123 and I can deal with tomorrow. No further blockers from my perspective. |
Co-authored-by: Saad Rahim (AMD) <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final touch-ups
language: English | ||
myst: | ||
html_meta: | ||
"description lang=en": "In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300X accelerators." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This snippet is capped at 150 characters, I would like this to start with Zyphra & AMD work, with the message of suppressing others taking the back seat. Please see my revision:
"description lang=en": "In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300X accelerators." | |
"description lang=en": "This blog presents the training of Zyphra's backwards kernels for transformers and hybrid models on AMD Instinct MI300X accelerators, suppressing the H100s performance" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved by f069a05
--- | ||
blogpost: true | ||
date: 10 December 2024 | ||
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the same (simpler to understand...) title we already use in the blog's body:
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators" | |
blog_title: "Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed over email earlier today, but we prefer to keep the original title
|
||
# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators | ||
|
||
## Introduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the "Introduction" section title. We follow a magazine-like (non-academic) approach in our blogs where each post starts with a brief introductory text, not explicitly titled Introduction. ```suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved by f069a05
Zyphra is designing MaiaOS, a multimodal agent system that combines next-gen neural network architectures (SSM hybrids), long-term memory, and reinforcement learning. | ||
|
||
In this blog we motivate our vision of training transformers and hybrid models at a lower cost using AMD technology. We explain how Zyphra harnessed the hardware advantages of the MI300x hardware for training both dense transformers and Zyphra's hybrid models. Specifically, the model blocks of interest are Mamba2 and Flash Attention v2. We conclude the blog by sharing benchmarks results showing the speedups we achieved on the MI300X using ROCm, compared to the competition. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a new section title here
## Harnessing the MI300 Hardware Specs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved by f069a05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New title suggestion
--- | ||
blogpost: true | ||
date: 10 December 2024 | ||
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting back to an earlier suggested title, please use:
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators" | |
blog_title: "Zyphra Speeding Up Training on AMD Instinct MI300X Accelerators" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer the previous title because:
- We want readers to know we've achieved state of the art ("frontier")
- We wrote kernels for both transformers and SSMs
- We understand and resolved the main blocker from ROCm training: backward kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, let's work with your preferred title!
--- | ||
|
||
|
||
# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting back to an earlier suggested title, please use:
# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators | |
# Zyphra Speeding Up Training on AMD Instinct MI300X Accelerators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer the previous title for this one as well
Hey @BerenMillidge , can you please take a look at BerenMillidge#1? |
Just merged this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going with your preferred title @Quentin-Anthony! :-)
But, the "Harnessing the MI300 Hardware Specs" section title needs to be move downwards, thanks!
|
||
# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators | ||
|
||
## Harnessing the MI300 Hardware Specs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No section title here, please. The post just starts with the brief intro text, with no section title (this section title goes in, but only a little further, please see below). Thanks!
## Harnessing the MI300 Hardware Specs |
--- | ||
blogpost: true | ||
date: 10 December 2024 | ||
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, let's work with your preferred title!
|
||
In this blog we motivate our vision of training transformers and hybrid models at a lower cost using AMD technology. We explain how Zyphra harnessed the hardware advantages of the MI300x hardware for training both dense transformers and Zyphra's hybrid models. Specifically, the model blocks of interest are Mamba2 and Flash Attention v2. We conclude the blog by sharing benchmarks results showing the speedups we achieved on the MI300X using ROCm, compared to the competition. | ||
|
||
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Section title goes here, please:
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth. | |
## Harnessing the MI300 Hardware Specs | |
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth. |
@Danny213123 @Ehud-Sharlin @saadrahim -- What remains before this can be merged? |
@Danny213123 is fixing it. Shouldn't be much longer. |
Objective of the new blog:
To describe Zyphra's work on developing optimized and highly performant ROCm kernels for Flash-Attention backward and Mamba2.
Signoff section must be completed prior to publishing.