Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zyphra Mamba2 and Flash Attention backward kernel blog #25

Merged
merged 28 commits into from
Dec 10, 2024

Conversation

BerenMillidge
Copy link
Contributor

Objective of the new blog:

To describe Zyphra's work on developing optimized and highly performant ROCm kernels for Flash-Attention backward and Mamba2.

Signoff section must be completed prior to publishing.

  • Technical reviewer approves publishing: (edit and replace with @githubid)
  • Editorial team approved publishing: (edit and replace with @githubid)
  • Add a thumbnail image for your blog if one is available
  • Text nugget summarizing your article. 2-3 lines to draw the reader's attention. Possibly the opening paragraph can be used.
  • Blog author team signoffs
    • Licenses file included for content is correct: (edit and replace with @githubid)
    • Changes from technical review and editorial team are acceptable: (edit and replace with @githubid)

@BerenMillidge BerenMillidge requested review from saadrahim and a team as code owners December 4, 2024 07:05
Copy link
Member

@saadrahim saadrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review, @Ehud-Sharlin please take a look.

@saadrahim
Copy link
Member

@BerenMillidge Thank you for the contribution. I will work to guide this blog to publication. This process may take until mid next week.

@BerenMillidge
Copy link
Contributor Author

@saadrahim Thanks for reviewing. Would it be possible to commit to a firm date for publication? Such as the 10th? We will try to be as responsive as possible on our side.

@dannymartinelli1
Copy link

@saadrahim and @Ehud-Sharlin, thanks for guiding this blog to publication. Are we able to get a date locked-in? Possibly next Monday or Tuesday (December 9th or 10th)?

~Cheers

@saadrahim
Copy link
Member

A December 10 release sounds reasonable.

Copy link
Contributor

@Ehud-Sharlin Ehud-Sharlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding revisions and suggestions to the blog’s title, thumbnail image, snippet text, introductory title and text, and summary.

myst:
html_meta:
"description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm"
"keywords": "Mamba, PyTorch, S4, S6, Mamba2, Transformer, Flash Attention, Optimization, Hardware-aware, Transformer, Attention, ROCm, Mi210, MI250, MI300, AI/ML, Generative AI"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a thumbnail image, format:
thumbnail: ' image name.jpg'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add this today

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the path in 327f685 and we sent the thumbnail over email. Does anything else need done for the thumbnail?


*By Quentin Anthony and Beren Millidge from Zyphra*

On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth.
## Harnessing the MI300 Superior Hardware Specs
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market, significantly outperforming its competitor the Nvidia H100 GPU. The key hardware specs where the MI300X surpasses the H100 are High Bandwidth Memory (HBM) capacity and bandwidth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a subsection with the more neutral wording of ## Introduction

language: English
myst:
html_meta:
"description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my suggestion for the blog snippet text:

Suggested change
"description lang=en": "Mamba2 and Flash Attention Backward Kernels on AMD MI300x with ROCm"
"description lang=en": "This blog presents Zyphra’s vision of training transformers and hybrid models at a lower cost, and its realization by utilizing and optimizing the superior hardware specs of the MI300x."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was handled and is now outdated

@Quentin-Anthony
Copy link
Contributor

@saadrahim -- Does anything else remain? Do we need to update index.md?

@saadrahim
Copy link
Member

@saadrahim -- Does anything else remain? Do we need to update index.md?

index.md is mostly autogenerated. I am checking what else is left.

@saadrahim
Copy link
Member

We can take care of

blogs/authors/data/Quentin-Anthony.jpg blogs/authors/quentin-anthony.md blogs/ecosystems-and-partners/zyphra/README.md blogs/ecosystems-and-partners/zyphra/images/Flash_attention_AMD_kernel_blog.png blogs/ecosystems-and-partners/zyphra/images/Mamba2_kernel_backward_AMD_blog.png
Checking metadata in blogs/authors/quentin-anthony.md
blogs/authors/quentin-anthony.md is missing a metadata field: blog_title author category language thumbnail date tags with error 1, please take a look at guide-to-blogs-metadata.md

@saadrahim
Copy link
Member

Some other minor linting issues that @Danny213123 and I can deal with tomorrow. No further blockers from my perspective.

Copy link
Contributor

@Ehud-Sharlin Ehud-Sharlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final touch-ups

language: English
myst:
html_meta:
"description lang=en": "In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300X accelerators."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This snippet is capped at 150 characters, I would like this to start with Zyphra & AMD work, with the message of suppressing others taking the back seat. Please see my revision:

Suggested change
"description lang=en": "In this blog, we demonstrate the first backwards kernels to surpass H100s for both transformers (Flash Attention v2) and hybrid models (Mamba2), which enables training foundation models on AMD Instinct MI300X accelerators."
"description lang=en": "This blog presents the training of Zyphra's backwards kernels for transformers and hybrid models on AMD Instinct MI300X accelerators, suppressing the H100s performance"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved by f069a05

---
blogpost: true
date: 10 December 2024
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the same (simpler to understand...) title we already use in the blog's body:

Suggested change
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"
blog_title: "Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed over email earlier today, but we prefer to keep the original title


# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators

## Introduction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the "Introduction" section title. We follow a magazine-like (non-academic) approach in our blogs where each post starts with a brief introductory text, not explicitly titled Introduction.  ```suggestion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved by f069a05

Zyphra is designing MaiaOS, a multimodal agent system that combines next-gen neural network architectures (SSM hybrids), long-term memory, and reinforcement learning.

In this blog we motivate our vision of training transformers and hybrid models at a lower cost using AMD technology. We explain how Zyphra harnessed the hardware advantages of the MI300x hardware for training both dense transformers and Zyphra's hybrid models. Specifically, the model blocks of interest are Mamba2 and Flash Attention v2. We conclude the blog by sharing benchmarks results showing the speedups we achieved on the MI300X using ROCm, compared to the competition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new section title here

Suggested change
## Harnessing the MI300 Hardware Specs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by f069a05

Copy link
Contributor

@Ehud-Sharlin Ehud-Sharlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New title suggestion

---
blogpost: true
date: 10 December 2024
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting back to an earlier suggested title, please use:

Suggested change
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"
blog_title: "Zyphra Speeding Up Training on AMD Instinct MI300X Accelerators"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer the previous title because:

  1. We want readers to know we've achieved state of the art ("frontier")
  2. We wrote kernels for both transformers and SSMs
  3. We understand and resolved the main blocker from ROCm training: backward kernels

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, let's work with your preferred title!

---


# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting back to an earlier suggested title, please use:

Suggested change
# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators
# Zyphra Speeding Up Training on AMD Instinct MI300X Accelerators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer the previous title for this one as well

@Danny213123
Copy link
Collaborator

Hey @BerenMillidge , can you please take a look at BerenMillidge#1?

@Quentin-Anthony
Copy link
Contributor

Hey @BerenMillidge , can you please take a look at BerenMillidge#1?

Just merged this!

Copy link
Contributor

@Ehud-Sharlin Ehud-Sharlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going with your preferred title @Quentin-Anthony! :-)
But, the "Harnessing the MI300 Hardware Specs" section title needs to be move downwards, thanks!


# Training Transformers and Hybrid models on AMD Instinct MI300X Accelerators

## Harnessing the MI300 Hardware Specs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No section title here, please. The post just starts with the brief intro text, with no section title (this section title goes in, but only a little further, please see below). Thanks!

Suggested change
## Harnessing the MI300 Hardware Specs

---
blogpost: true
date: 10 December 2024
blog_title: "Zyphra Introduces Frontier Training Kernels for Transformers and SSMs on AMD Instinct MI300X Accelerators"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, let's work with your preferred title!


In this blog we motivate our vision of training transformers and hybrid models at a lower cost using AMD technology. We explain how Zyphra harnessed the hardware advantages of the MI300x hardware for training both dense transformers and Zyphra's hybrid models. Specifically, the model blocks of interest are Mamba2 and Flash Attention v2. We conclude the blog by sharing benchmarks results showing the speedups we achieved on the MI300X using ROCm, compared to the competition.

On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section title goes here, please:

Suggested change
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth.
## Harnessing the MI300 Hardware Specs
On paper, the AMD Instinct MI300X GPU accelerators contain some of the best hardware specifications on the market. The key hardware specs where the MI300X surpasses its main competitor, the NVIDIA H100 GPU, are High Bandwidth Memory (HBM) capacity and bandwidth.

@Quentin-Anthony
Copy link
Contributor

@Danny213123 @Ehud-Sharlin @saadrahim -- What remains before this can be merged?

@saadrahim saadrahim merged commit 425e6c9 into ROCm:release Dec 10, 2024
5 of 6 checks passed
@Quentin-Anthony
Copy link
Contributor

Yay we're merged! Looks like my profile picture didn't upload properly?
image

@saadrahim
Copy link
Member

@Danny213123 is fixing it. Shouldn't be much longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants