Zyphra Mamba2 and Flash Attention backward kernel blog #25

BerenMillidge · 2024-12-04T07:05:21Z

Objective of the new blog:

To describe Zyphra's work on developing optimized and highly performant ROCm kernels for Flash-Attention backward and Mamba2.

Signoff section must be completed prior to publishing.

Technical reviewer approves publishing: (edit and replace with @githubid)
Editorial team approved publishing: (edit and replace with @githubid)
Add a thumbnail image for your blog if one is available
Text nugget summarizing your article. 2-3 lines to draw the reader's attention. Possibly the opening paragraph can be used.
Blog author team signoffs
- Licenses file included for content is correct: (edit and replace with @githubid)
- Changes from technical review and editorial team are acceptable: (edit and replace with @githubid)

saadrahim

First pass review, @Ehud-Sharlin please take a look.

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

saadrahim · 2024-12-04T18:51:05Z

@BerenMillidge Thank you for the contribution. I will work to guide this blog to publication. This process may take until mid next week.

…EADME.md Co-authored-by: Saad Rahim (AMD) <[email protected]>

BerenMillidge · 2024-12-05T03:52:36Z

@saadrahim Thanks for reviewing. Would it be possible to commit to a firm date for publication? Such as the 10th? We will try to be as responsive as possible on our side.

dannymartinelli1 · 2024-12-05T19:49:15Z

@saadrahim and @Ehud-Sharlin, thanks for guiding this blog to publication. Are we able to get a date locked-in? Possibly next Monday or Tuesday (December 9th or 10th)?

~Cheers

saadrahim · 2024-12-05T22:41:57Z

A December 10 release sounds reasonable.

Ehud-Sharlin

Adding revisions and suggestions to the blog’s title, thumbnail image, snippet text, introductory title and text, and summary.

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

Ehud-Sharlin · 2024-12-08T19:36:59Z

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

+myst:
+  html_meta:
+    "description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm"
+    "keywords": "Mamba, PyTorch, S4, S6, Mamba2, Transformer, Flash Attention, Optimization, Hardware-aware, Transformer, Attention, ROCm, Mi210, MI250, MI300, AI/ML, Generative AI"


Please add a thumbnail image, format:
thumbnail: ' image name.jpg'

Will add this today

I added the path in 327f685 and we sent the thumbnail over email. Does anything else need done for the thumbnail?

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

Ehud-Sharlin · 2024-12-08T20:14:12Z

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

+
+*By Quentin Anthony and Beren Millidge from Zyphra*
+
+On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth.


Suggested change

On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth.

## Harnessing the MI300 Superior Hardware Specs

On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth.

I created a subsection with the more neutral wording of ## Introduction

Ehud-Sharlin · 2024-12-08T20:20:29Z

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

+language: English
+myst:
+  html_meta:
+    "description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm"


See my suggestion for the blog snippet text:

Suggested change

"description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm"

"description lang=en": "This blog presents Zyphra’s vision of training transformers and hybrid models at a lower cost, and its realization by utilizing and optimizing the superior hardware specs of the MI300x."

This was handled and is now outdated

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

…EADME.md Co-authored-by: Ehud Sharlin <[email protected]>

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

…EADME.md

…EADME.md Co-authored-by: Ehud Sharlin <[email protected]>

Co-authored-by: Ehud Sharlin <[email protected]>

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

Co-authored-by: Ehud Sharlin <[email protected]>

Co-authored-by: Saad Rahim (AMD) <[email protected]>

…cm-blogs into zyphra_blogpost

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md

…EADME.md

…cm-blogs into zyphra_blogpost

Quentin-Anthony · 2024-12-10T01:24:11Z

@saadrahim -- Does anything else remain? Do we need to update index.md?

saadrahim · 2024-12-10T01:53:49Z

@saadrahim -- Does anything else remain? Do we need to update index.md?

index.md is mostly autogenerated. I am checking what else is left.

saadrahim · 2024-12-10T01:55:12Z

We can take care of

blogs/authors/data/Quentin-Anthony.jpg blogs/authors/quentin-anthony.md blogs/ecosystems-and-partners/zyphra/README.md blogs/ecosystems-and-partners/zyphra/images/Flash_attention_AMD_kernel_blog.png blogs/ecosystems-and-partners/zyphra/images/Mamba2_kernel_backward_AMD_blog.png
Checking metadata in blogs/authors/quentin-anthony.md
blogs/authors/quentin-anthony.md is missing a metadata field: blog_title author category language thumbnail date tags with error 1, please take a look at guide-to-blogs-metadata.md

blogs/authors/quentin-anthony.md

saadrahim · 2024-12-10T01:56:56Z

Some other minor linting issues that @Danny213123 and I can deal with tomorrow. No further blockers from my perspective.

Co-authored-by: Saad Rahim (AMD) <[email protected]>

Ehud-Sharlin

Final touch-ups

Ehud-Sharlin · 2024-12-10T15:39:06Z

blogs/ecosystems-and-partners/zyphra/README.md

+language: English
+myst:
+  html_meta:
+    "description lang=en": "In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300X accelerators."


This snippet is capped at 150 characters, I would like this to start with Zyphra & AMD work, with the message of suppressing others taking the back seat. Please see my revision:

Suggested change

"description lang=en": "In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300X accelerators."

"description lang=en": "This blog presents the training of Zyphra's backwards kernels for transformers and hybrid models on AMD Instinct MI300X accelerators, suppressing the H100s performance"

resolved by f069a05

Ehud-Sharlin · 2024-12-10T15:41:07Z

blogs/ecosystems-and-partners/zyphra/README.md

+---
+blogpost: true
+date: 10 December 2024
+blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"


Please use the same (simpler to understand...) title we already use in the blog's body:

Suggested change

blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"

blog_title: "Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators"

Discussed over email earlier today, but we prefer to keep the original title

Ehud-Sharlin · 2024-12-10T15:50:18Z

blogs/ecosystems-and-partners/zyphra/README.md

+
+# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators
+
+## Introduction


Please remove the "Introduction" section title. We follow a magazine-like (non-academic) approach in our blogs where each post starts with a brief introductory text, not explicitly titled Introduction. ```suggestion

resolved by f069a05

Ehud-Sharlin · 2024-12-10T15:51:18Z

blogs/ecosystems-and-partners/zyphra/README.md

+Zyphra is designing MaiaOS, a multimodal agent system that combines next-gen neural network architectures (SSM hybrids), long-term memory, and reinforcement learning.
+
+In this blog we motivate our vision of training transformers and hybrid models at a lower cost using AMD technology. We explain how Zyphra harnessed the hardware advantages of the MI300x hardware for training both dense transformers and Zyphra's hybrid models. Specifically, the model blocks of interest are Mamba2 and Flash Attention v2. We conclude the blog by sharing benchmarks results showing the speedups we achieved on the MI300X using ROCm, compared to the competition.
+


Please add a new section title here

Suggested change

## Harnessing the MI300 Hardware Specs

Resolved by f069a05

Ehud-Sharlin

New title suggestion

Ehud-Sharlin · 2024-12-10T16:13:50Z

blogs/ecosystems-and-partners/zyphra/README.md

+---
+blogpost: true
+date: 10 December 2024
+blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"


Reverting back to an earlier suggested title, please use:

Suggested change

blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"

blog_title: "Zyphra Speeding Up Training on AMD Instinct MI300X Accelerators"

We prefer the previous title because:

We want readers to know we've achieved state of the art ("frontier")

We wrote kernels for both transformers and SSMs

We understand and resolved the main blocker from ROCm training: backward kernels

No worries, let's work with your preferred title!

Ehud-Sharlin · 2024-12-10T16:14:48Z

blogs/ecosystems-and-partners/zyphra/README.md

+---
+
+
+# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators


Reverting back to an earlier suggested title, please use:

Suggested change

# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators

# Zyphra Speeding Up Training on AMD Instinct MI300X Accelerators

We prefer the previous title for this one as well

Danny213123 · 2024-12-10T17:48:40Z

Hey @BerenMillidge , can you please take a look at BerenMillidge#1?

Add Blog Image

Quentin-Anthony · 2024-12-10T18:44:27Z

Hey @BerenMillidge , can you please take a look at BerenMillidge#1?

Just merged this!

Ehud-Sharlin

Going with your preferred title @Quentin-Anthony! :-)
But, the "Harnessing the MI300 Hardware Specs" section title needs to be move downwards, thanks!

Ehud-Sharlin · 2024-12-10T18:40:40Z

blogs/ecosystems-and-partners/zyphra/README.md

+
+# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators
+
+## Harnessing the MI300 Hardware Specs


No section title here, please. The post just starts with the brief intro text, with no section title (this section title goes in, but only a little further, please see below). Thanks!

Suggested change

## Harnessing the MI300 Hardware Specs

Ehud-Sharlin · 2024-12-10T18:41:15Z

blogs/ecosystems-and-partners/zyphra/README.md

+---
+blogpost: true
+date: 10 December 2024
+blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"


No worries, let's work with your preferred title!

Ehud-Sharlin · 2024-12-10T18:41:55Z

blogs/ecosystems-and-partners/zyphra/README.md

+
+In this blog we motivate our vision of training transformers and hybrid models at a lower cost using AMD technology. We explain how Zyphra harnessed the hardware advantages of the MI300x hardware for training both dense transformers and Zyphra's hybrid models. Specifically, the model blocks of interest are Mamba2 and Flash Attention v2. We conclude the blog by sharing benchmarks results showing the speedups we achieved on the MI300X using ROCm, compared to the competition.
+
+On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth.


Section title goes here, please:

Suggested change

On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth.

## Harnessing the MI300 Hardware Specs

On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth.

Quentin-Anthony · 2024-12-10T18:46:23Z

@Danny213123 @Ehud-Sharlin @saadrahim -- What remains before this can be merged?

Quentin-Anthony · 2024-12-10T19:00:44Z

Yay we're merged! Looks like my profile picture didn't upload properly?

saadrahim · 2024-12-10T19:03:47Z

@Danny213123 is fixing it. Shouldn't be much longer.

BerenMillidge added 3 commits December 3, 2024 22:32

initial test of blog upload

8e8e20f

Finalizing blogpost

77b0d67

Finalizing blog post

865ffa9

BerenMillidge requested review from saadrahim and a team as code owners December 4, 2024 07:05

saadrahim requested changes Dec 4, 2024

View reviewed changes

BerenMillidge and others added 4 commits December 4, 2024 17:40

Update blogs/artificial-intelligence/mamba2-flash-attention-kernels/R…

00612dc

…EADME.md Co-authored-by: Saad Rahim (AMD) <[email protected]>

Update blogs/artificial-intelligence/mamba2-flash-attention-kernels/R…

e7b3f41

…EADME.md Co-authored-by: Saad Rahim (AMD) <[email protected]>

Update blogs/artificial-intelligence/mamba2-flash-attention-kernels/R…

09efa1a

…EADME.md Co-authored-by: Saad Rahim (AMD) <[email protected]>

add authors Zyphra affiliation

95b0407

Ehud-Sharlin reviewed Dec 8, 2024

View reviewed changes

Update blogs/artificial-intelligence/mamba2-flash-attention-kernels/R…

1806d5f

…EADME.md Co-authored-by: Ehud Sharlin <[email protected]>

saadrahim reviewed Dec 9, 2024

View reviewed changes

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md Outdated Show resolved Hide resolved

saadrahim and others added 4 commits December 9, 2024 14:13

Update blogs/artificial-intelligence/mamba2-flash-attention-kernels/R…

92d2c49

…EADME.md

Update blogs/artificial-intelligence/mamba2-flash-attention-kernels/R…

ee0326d

…EADME.md Co-authored-by: Ehud Sharlin <[email protected]>

Merge branch 'ROCm:release' into zyphra_blogpost

5e17aa6

summary section

742695e

Co-authored-by: Ehud Sharlin <[email protected]>

saadrahim requested changes Dec 9, 2024

View reviewed changes

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md Outdated Show resolved Hide resolved

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md Outdated Show resolved Hide resolved

Quentin-Anthony and others added 4 commits December 9, 2024 14:55

Add intro to zyphra

da7d9aa

Co-authored-by: Ehud Sharlin <[email protected]>

change category in header

aabe0e4

Co-authored-by: Saad Rahim (AMD) <[email protected]>

change authors and soften h100 intro wording

d3cb13b

Merge branch 'zyphra_blogpost' of https://github.com/BerenMillidge/ro…

9290bc0

…cm-blogs into zyphra_blogpost

saadrahim reviewed Dec 9, 2024

View reviewed changes

blogs/artificial-intelligence/mamba2-flash-attention-kernels/README.md Outdated Show resolved Hide resolved

saadrahim and others added 3 commits December 9, 2024 16:53

Update blogs/artificial-intelligence/mamba2-flash-attention-kernels/R…

3c04d0b

…EADME.md

move blog to partner dir

c8eefd3

Merge branch 'zyphra_blogpost' of https://github.com/BerenMillidge/ro…

f68fb61

…cm-blogs into zyphra_blogpost

Quentin-Anthony added 3 commits December 9, 2024 16:34

update date and make intro subsection

8f5d077

add thumbnail image path

327f685

add bio and pic

05e2676

wording changes for intro

820759c

saadrahim reviewed Dec 10, 2024

View reviewed changes

blogs/authors/quentin-anthony.md Outdated Show resolved Hide resolved

format blogs/authors/quentin-anthony.md

657e15d

Co-authored-by: Saad Rahim (AMD) <[email protected]>

Ehud-Sharlin reviewed Dec 10, 2024

View reviewed changes

Update quentin-anthony.md

3440cf2

Ehud-Sharlin reviewed Dec 10, 2024

View reviewed changes

Add Blog Image

554cd48

Quentin-Anthony added 2 commits December 10, 2024 10:33

update header and subsection title

f069a05

Merge pull request #1 from ROCm/zyphra

3fd5c0c

Add Blog Image

Ehud-Sharlin reviewed Dec 10, 2024

View reviewed changes

saadrahim approved these changes Dec 10, 2024

View reviewed changes

saadrahim merged commit 425e6c9 into ROCm:release Dec 10, 2024
5 of 6 checks passed


		By Quentin Anthony and Beren Millidge from Zyphra

		On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth.

	"description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm"
	"description lang=en": "This blog presents Zyphra’s vision of training transformers and hybrid models at a lower cost, and its realization by utilizing and optimizing the superior hardware specs of the MI300x."

	"description lang=en": "In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300X accelerators."
	"description lang=en": "This blog presents the training of Zyphra's backwards kernels for transformers and hybrid models on AMD Instinct MI300X accelerators, suppressing the H100s performance"

	blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"
	blog_title: "Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators"


		# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators

		## Introduction

		Zyphra is designing MaiaOS, a multimodal agent system that combines next-gen neural network architectures (SSM hybrids), long-term memory, and reinforcement learning.

		In this blog we motivate our vision of training transformers and hybrid models at a lower cost using AMD technology. We explain how Zyphra harnessed the hardware advantages of the MI300x hardware for training both dense transformers and Zyphra's hybrid models. Specifically, the model blocks of interest are Mamba2 and Flash Attention v2. We conclude the blog by sharing benchmarks results showing the speedups we achieved on the MI300X using ROCm, compared to the competition.

	blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"
	blog_title: "Zyphra Speeding Up Training on AMD Instinct MI300X Accelerators"

	# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators
	# Zyphra Speeding Up Training on AMD Instinct MI300X Accelerators


		# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators

		## Harnessing the MI300 Hardware Specs

Zyphra Mamba2 and Flash Attention backward kernel blog #25

Zyphra Mamba2 and Flash Attention backward kernel blog #25

Conversation

BerenMillidge commented Dec 4, 2024

saadrahim left a comment

Choose a reason for hiding this comment

saadrahim commented Dec 4, 2024

BerenMillidge commented Dec 5, 2024

dannymartinelli1 commented Dec 5, 2024

saadrahim commented Dec 5, 2024

Ehud-Sharlin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Quentin-Anthony commented Dec 10, 2024

saadrahim commented Dec 10, 2024

saadrahim commented Dec 10, 2024

saadrahim commented Dec 10, 2024

Ehud-Sharlin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ehud-Sharlin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Danny213123 commented Dec 10, 2024

Quentin-Anthony commented Dec 10, 2024

Ehud-Sharlin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Quentin-Anthony commented Dec 10, 2024

Quentin-Anthony commented Dec 10, 2024

saadrahim commented Dec 10, 2024