Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Triton Backend for ROCm #1203

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

micmelesse
Copy link

@micmelesse micmelesse commented Sep 4, 2024

Hi, this is a pr to add a Triton backend to Flash Attention on ROCm. We hope that this pr will be the first in a series of prs to that end. Triton has had support for ROCm for a while now and a Flash Attention Triton backend will allow us to support Flash Attention on both our CDNA (MI200 & MI300) and RDNA Machines.

Below is the state of features in this pr.

These features are supported in Fwd and Bwd

  1. Fwd and Bwd with causal masking
  2. Variable sequence lengths
  3. Arbitrary Q and KV sequence lengths
  4. Arbitrary head sizes

These features are supported in Fwd for now. We will add them to backward soon.

  1. Multi and grouped query attention
  2. ALiBi

These features are in development

  1. Paged Attention
  2. Sliding Window
  3. Rotary embeddings
  4. Dropout
  5. Performance Improvements

We have created a test file, tests/test_flash_attn_triton_amd.py which is a subset of tests/test_flash_attn.py. It currently contains the following tests. The tests are the same as the main test files with some configs disabled that are not yet supported. All sequence lengths and head sizes are the same as the original. They all pass on an MI200 machine.

  1. test_flash_attn_qkvpacked
  2. test_flash_attn_varlen_qkvpacked
  3. test_flash_attn_output
  4. test_flash_attn_varlen_output
  5. test_flash_attn_causal
  6. test_flash_attn_varlen_causal
  7. test_flash_attn_kvcache

image

There is clearly more work to be done but we hope that this will make a good start. We have included instructions to run the Triton Backend in the README but the main point is to use export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" with Triton installed.

Please let us know what we can do on our end to help with this process.

Finally this pr includes work from multiple people besides myself, special thanks to @vgokhale, @scxiao and @jlgreathouse.

@micmelesse micmelesse marked this pull request as ready for review September 4, 2024 14:31
setup.py Outdated Show resolved Hide resolved
@unclemusclez
Copy link

The Gods are Gracious

Enable Fwd and Backward

Enable Fwd and Backward

Enable fwd and varlen_fwd on AMD  (#63)

* flash_attn_func works

Compress

This is a combination of 12 commits.

add scripts

save

add our kernel

import our kernel

round trip

use bshd layout

figure out segfault

fix

show backward failure with prints

save backward work

run forward only

test smallest config on everything

add test

fix

remove pre commit

install triton

skip dropout

pin d

32 factor d

just run power of 2

remove timeout

run serially

clean up

clean up 2

* Varlen works

This is a combination of 6 commits.

save

some tests passing

enable more

enable everything

move around

alibi works

* keep interface and kernel seperate

* clean up

enable flash_attn_with_kvcache (#68)

* Compress kvcache work

This is a combination of 11 commits.

kvcache work

This is a combination of 4 commits.

kvcache is not supported

save

save decode

save

clean up merge

save cases

save

save

save

save

key mask on triton side

fix q size issue

test combos

save

* fix causal. use cache_seqlens

* clean and test what works

* some configs work on new_kv but fails on 1,8

* cache overwrite correct

* new_kv works more or less

* test local

* work on paged kv attention

* prefill paged attention

* fix has_batch_idx and skip local and rotatary emb

* save

* save

* save

* save

* handle new_kv when paged kv cache

* all except has_batch_idx works

* major options are green

* test all

* add tests

* save

* clean up

* minor clean up

* simplest config

* save debug true

* save

* refactor slightly

* save work

* need key masking

* force hip

* use is_hip

* save

* fix cache_seq_len issue

* work on new_kv

* pass new_kv data

* save

* benchmark fwd only

* disable debug

* pandas pdf

* save

* set methods

* record number of heads

* use configs

* flexiable dim, n-heads, headofdim

* better benchmarking

* basic inplace update working

* works upto 64

* new_kv supported!

* test case for has_batch_idx

* has_batch_idx works!

* save

* save

* save

* save ref

* fix mqa and gqa by duplicating

* GQA and MQA working by kernel modifications

* fix new_kv with gqa

* cache index

* deal with nans on fwd_splitk

* save

* causal working on basic case

* causal works!

* alibi works!

* clean up

* clean prefill changes

* remove bwd stuff

* limit decode test to test_op_fwd

* add ref

* use bfloat

Fixes after rebase

Fixes after rebase

rebase fixes

deal with kvcache failure

new run for branch

cancel-in-progress

fix varlen_fwd bug

enable packed layouts and all configs (#72)

Clean up for Upstream (#81)

* Clean

Clean

This is a combination of 4 commits.

clean 1

clean 2

clean more

match main

typo fix

* use is_hip()

* clean up more

* skip odd d only

* fix bug

* skip randomly

* use Flag

* update readme

* remove quantization

* remove bwd

* minor

* print

* remove verbose print

* qunatize zero's out the d stride

Enable Vanilla Bwd and Refactor (#86)

* Vanilla BWD

Vanilla BWD

This is a combination of 79 commits.

save test_flash_attn_output

use impl functions

pass layout

add ref

move arround impls

fix stride issue

save oai kernel

add baseline impl

save bwd kernel working

remove old impl

remove block_ptrs from bwd

pass padded dmodel and apply masking. the old test cases work but cases with small d don't work

save

save

more prints

rename to M to L

save

add notes

add old_bwd back

fa failure fails in kernels too

isolate new bwd and keep old bwd in place

clean up

softmax_lse doesnot match refernce

LOG flag

softmax_lse with LN2

move qk_scale to loop

pass ln2 to fwd

just print kernel input

test softmax output from forward

test exp_scores_triton

save all the ref

create ref USE_EXP2 path

return scores

mask scores when returning them. Basic impl test passes

scores and output match

show max_diff

return score needs to be adjusted as we find new maxes

all good outputs. old style RCP2 example

prep bwd_impl test

save

try openai

save

fix softmax_lse bug

test_op_bwd_impl starting to work!

new kernel. exp2 works but exp is faliing

fix bwd exp2

add m and n masks. small cases still don't work

match old and new kernel prints

compare old and new

print inputs

save

old kernel match on dv

dq works

compare to pytorch including softmax in forward

fix bwd impl bug

small sizes in bwd impl work

old bwd test pass. Moving on to kernel tests

dq, dk and dv are filled in place if given. Need to match cast to match fa

fix non bug

fix dv mismatch. use_exp2 was set to true in fwd

fix case up 128

refactor and clean up a bit more

issue is that dq and dk are not zeros

dq must be zeroed out

ignore segfaults

fa ref and my ref match!

all tests run

use tolerance 1e-3

we need to figure out preprocessing

save

clean up

save

test delta diff

move old impl out

new preprocess function

preprocessing_use_o flag

working _bwd_preprocess_use_p

basic cases pass

all green

fwd exp2 usage is done right before exp

* refactor

* refactor 2

* refactor 3

* fix bug

* try ci

* add flag

* rename to utils

* skip test_op_fwd_decode_int4_kv

* reduce head size

* try again

* go back to old head sizes

* Use Strides

Use Strides

This is a combination of 11 commits.

use strides in bwd

add layout test in forward

fix shape layout function

smaller tests

save

fix varlen error

no headsize passed to bwd

deal with varlen layout

save

save

save

save

* use gen scripts

* varlen fwd passing

* core fwd ref impl

* fix minor bugs

* wrap varlen- launcher attention_forward_pytorch_ref_impl

* varlen backward ref added

* add offsets for varlen

* fix delta bug

* varlen bwd working

* save

* runs on Mi200

* just test basics

* save

* fix bug

* fix varlen in64 bug

* add ref

* test_impl working with causal

* fix qkvpacked issue

* qkvpacked run tests

* remove test_backward

* save

* just test output

* dump into tensors

* softmaxlse layout for varlen

* small cases working

* bwd thd green. although maybe some oom

* forward out and lse are good. Something wrong with backward ref

* make varlen ref work

* save work, ref is working mostly

* 91 failed, 6542 passed, 6336 skipped, 1 warning

* ref is all green

* debug flag in utils

* found bad softmax_lse in varlen fwd

* fix bug in softmax lse. strides in varlen werenot right

* add causal tests and 32*32 bwd doesnot have segfault

* save

* fix oom by reducing block size for small heads

* bwd ref with causal working

* test impl

* causal test passes

* causal working

* fix tests

* nicer bench

* fix qvpacked error

* fix varlen qvpacked bug

* fix minor bug

* bench prefill and prefill_old using the same script

* autotune configs for fwd

* autotune flag

* clean up decode impl

* clean up

* clean up more

* bench everything by default and return time

* clean up readmes

REBASE: fix interface changes in rebase

rename test to test_flash_attn_triton_amd

REBASE: fix unpad diffs

minor clean up in setup

FLASH_ATTENTION_TRITON_AMD flags

bench fwd and bwd

fix sequence_parallel
@micmelesse micmelesse changed the title [AMD] Triton Backend for ROCm #1 [AMD] Triton Backend for ROCm Oct 29, 2024
@unclemusclez
Copy link

will this work with CDNA 1?

@micmelesse
Copy link
Author

will this work with CDNA 1?

The kernels work on any architecture supported by the Triton compiler. Right now the Triton compiler does not officially support MI100 series but most cases should work. We are focused on MI300 and MI200 on the CDNA side.

* sequence_parallel working on bwd_impl test

* fix qkv error

* save

* save

* save

* bwd 3 times faster

* clean up

* fix varlen bug

* use copy back dict

* fix qkvpacked bug

* reduce bench sizes

* print copy back
@micmelesse
Copy link
Author

Hi @tridao

Hope you are doing well. I wanted to check if you have any feedback or suggestions regarding this PR. I've refreshed it to include support for the backward pass and have refactored it to be more modular and easier to review.

We would be happy to add more features or work on performance improvements if needed. If you have any fundamental reservations about adding a Triton backend, please let us know, and we will do everything we can to address them.

Thank you for your time.

@dtrifiro
Copy link

Is there anything holding this back?

@micmelesse
Copy link
Author

Is there anything holding this back?

We are just waiting for feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants