[AMD] Triton Backend for ROCm #1203

micmelesse · 2024-09-04T12:32:16Z

Hi, this is a pr to add a Triton backend to Flash Attention on ROCm. We hope that this pr will be the first in a series of prs to that end. Triton has had support for ROCm for a while now and a Flash Attention Triton backend will allow us to support Flash Attention on both our CDNA (MI200 & MI300) and RDNA Machines.

Below is the state of features in this pr.

These features are supported in Fwd and Bwd

Fwd and Bwd with causal masking
Variable sequence lengths
Arbitrary Q and KV sequence lengths
Arbitrary head sizes

These features are supported in Fwd for now. We will add them to backward soon.

Multi and grouped query attention
ALiBi

These features are in development

Paged Attention
Sliding Window
Rotary embeddings
Dropout
Performance Improvements

We have created a test file, tests/test_flash_attn_triton_amd.py which is a subset of tests/test_flash_attn.py. It currently contains the following tests. The tests are the same as the main test files with some configs disabled that are not yet supported. All sequence lengths and head sizes are the same as the original. They all pass on an MI200 machine.

test_flash_attn_qkvpacked
test_flash_attn_varlen_qkvpacked
test_flash_attn_output
test_flash_attn_varlen_output
test_flash_attn_causal
test_flash_attn_varlen_causal
test_flash_attn_kvcache

There is clearly more work to be done but we hope that this will make a good start. We have included instructions to run the Triton Backend in the README but the main point is to use export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" with Triton installed.

Please let us know what we can do on our end to help with this process.

Finally this pr includes work from multiple people besides myself, special thanks to @vgokhale, @scxiao and @jlgreathouse.

setup.py

unclemusclez · 2024-09-15T21:07:35Z

The Gods are Gracious

Enable Fwd and Backward Enable Fwd and Backward Enable fwd and varlen_fwd on AMD (#63) * flash_attn_func works Compress This is a combination of 12 commits. add scripts save add our kernel import our kernel round trip use bshd layout figure out segfault fix show backward failure with prints save backward work run forward only test smallest config on everything add test fix remove pre commit install triton skip dropout pin d 32 factor d just run power of 2 remove timeout run serially clean up clean up 2 * Varlen works This is a combination of 6 commits. save some tests passing enable more enable everything move around alibi works * keep interface and kernel seperate * clean up enable flash_attn_with_kvcache (#68) * Compress kvcache work This is a combination of 11 commits. kvcache work This is a combination of 4 commits. kvcache is not supported save save decode save clean up merge save cases save save save save key mask on triton side fix q size issue test combos save * fix causal. use cache_seqlens * clean and test what works * some configs work on new_kv but fails on 1,8 * cache overwrite correct * new_kv works more or less * test local * work on paged kv attention * prefill paged attention * fix has_batch_idx and skip local and rotatary emb * save * save * save * save * handle new_kv when paged kv cache * all except has_batch_idx works * major options are green * test all * add tests * save * clean up * minor clean up * simplest config * save debug true * save * refactor slightly * save work * need key masking * force hip * use is_hip * save * fix cache_seq_len issue * work on new_kv * pass new_kv data * save * benchmark fwd only * disable debug * pandas pdf * save * set methods * record number of heads * use configs * flexiable dim, n-heads, headofdim * better benchmarking * basic inplace update working * works upto 64 * new_kv supported! * test case for has_batch_idx * has_batch_idx works! * save * save * save * save ref * fix mqa and gqa by duplicating * GQA and MQA working by kernel modifications * fix new_kv with gqa * cache index * deal with nans on fwd_splitk * save * causal working on basic case * causal works! * alibi works! * clean up * clean prefill changes * remove bwd stuff * limit decode test to test_op_fwd * add ref * use bfloat Fixes after rebase Fixes after rebase rebase fixes deal with kvcache failure new run for branch cancel-in-progress fix varlen_fwd bug enable packed layouts and all configs (#72) Clean up for Upstream (#81) * Clean Clean This is a combination of 4 commits. clean 1 clean 2 clean more match main typo fix * use is_hip() * clean up more * skip odd d only * fix bug * skip randomly * use Flag * update readme * remove quantization * remove bwd * minor * print * remove verbose print * qunatize zero's out the d stride Enable Vanilla Bwd and Refactor (#86) * Vanilla BWD Vanilla BWD This is a combination of 79 commits. save test_flash_attn_output use impl functions pass layout add ref move arround impls fix stride issue save oai kernel add baseline impl save bwd kernel working remove old impl remove block_ptrs from bwd pass padded dmodel and apply masking. the old test cases work but cases with small d don't work save save more prints rename to M to L save add notes add old_bwd back fa failure fails in kernels too isolate new bwd and keep old bwd in place clean up softmax_lse doesnot match refernce LOG flag softmax_lse with LN2 move qk_scale to loop pass ln2 to fwd just print kernel input test softmax output from forward test exp_scores_triton save all the ref create ref USE_EXP2 path return scores mask scores when returning them. Basic impl test passes scores and output match show max_diff return score needs to be adjusted as we find new maxes all good outputs. old style RCP2 example prep bwd_impl test save try openai save fix softmax_lse bug test_op_bwd_impl starting to work! new kernel. exp2 works but exp is faliing fix bwd exp2 add m and n masks. small cases still don't work match old and new kernel prints compare old and new print inputs save old kernel match on dv dq works compare to pytorch including softmax in forward fix bwd impl bug small sizes in bwd impl work old bwd test pass. Moving on to kernel tests dq, dk and dv are filled in place if given. Need to match cast to match fa fix non bug fix dv mismatch. use_exp2 was set to true in fwd fix case up 128 refactor and clean up a bit more issue is that dq and dk are not zeros dq must be zeroed out ignore segfaults fa ref and my ref match! all tests run use tolerance 1e-3 we need to figure out preprocessing save clean up save test delta diff move old impl out new preprocess function preprocessing_use_o flag working _bwd_preprocess_use_p basic cases pass all green fwd exp2 usage is done right before exp * refactor * refactor 2 * refactor 3 * fix bug * try ci * add flag * rename to utils * skip test_op_fwd_decode_int4_kv * reduce head size * try again * go back to old head sizes * Use Strides Use Strides This is a combination of 11 commits. use strides in bwd add layout test in forward fix shape layout function smaller tests save fix varlen error no headsize passed to bwd deal with varlen layout save save save save * use gen scripts * varlen fwd passing * core fwd ref impl * fix minor bugs * wrap varlen- launcher attention_forward_pytorch_ref_impl * varlen backward ref added * add offsets for varlen * fix delta bug * varlen bwd working * save * runs on Mi200 * just test basics * save * fix bug * fix varlen in64 bug * add ref * test_impl working with causal * fix qkvpacked issue * qkvpacked run tests * remove test_backward * save * just test output * dump into tensors * softmaxlse layout for varlen * small cases working * bwd thd green. although maybe some oom * forward out and lse are good. Something wrong with backward ref * make varlen ref work * save work, ref is working mostly * 91 failed, 6542 passed, 6336 skipped, 1 warning * ref is all green * debug flag in utils * found bad softmax_lse in varlen fwd * fix bug in softmax lse. strides in varlen werenot right * add causal tests and 32*32 bwd doesnot have segfault * save * fix oom by reducing block size for small heads * bwd ref with causal working * test impl * causal test passes * causal working * fix tests * nicer bench * fix qvpacked error * fix varlen qvpacked bug * fix minor bug * bench prefill and prefill_old using the same script * autotune configs for fwd * autotune flag * clean up decode impl * clean up * clean up more * bench everything by default and return time * clean up readmes REBASE: fix interface changes in rebase rename test to test_flash_attn_triton_amd REBASE: fix unpad diffs minor clean up in setup FLASH_ATTENTION_TRITON_AMD flags bench fwd and bwd fix sequence_parallel

unclemusclez · 2024-10-30T02:52:32Z

will this work with CDNA 1?

micmelesse · 2024-10-30T15:46:07Z

will this work with CDNA 1?

The kernels work on any architecture supported by the Triton compiler. Right now the Triton compiler does not officially support MI100 series but most cases should work. We are focused on MI300 and MI200 on the CDNA side.

* sequence_parallel working on bwd_impl test * fix qkv error * save * save * save * bwd 3 times faster * clean up * fix varlen bug * use copy back dict * fix qkvpacked bug * reduce bench sizes * print copy back

micmelesse · 2024-10-30T16:02:00Z

Hi @tridao

Hope you are doing well. I wanted to check if you have any feedback or suggestions regarding this PR. I've refreshed it to include support for the backward pass and have refactored it to be more modular and easier to review.

We would be happy to add more features or work on performance improvements if needed. If you have any fundamental reservations about adding a Triton backend, please let us know, and we will do everything we can to address them.

Thank you for your time.

dtrifiro · 2024-11-22T09:52:39Z

Is there anything holding this back?

micmelesse · 2024-11-22T14:28:19Z

Is there anything holding this back?

We are just waiting for feedback

micmelesse marked this pull request as ready for review September 4, 2024 14:31

dtrifiro reviewed Sep 12, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

dtrifiro mentioned this pull request Sep 12, 2024

amd build improvements opendatahub-io/vllm#156

Merged

mirh mentioned this pull request Sep 14, 2024

Merge to upstream flash-attention repo ROCm/flash-attention#35

Open

Beinsezii mentioned this pull request Sep 15, 2024

Improve Backward Performance and Navi31 Support ROCm/aotriton#39

Merged

LunNova mentioned this pull request Oct 9, 2024

AMD ROCm Card can not use flash attention ollama/ollama#6953

Open

micmelesse force-pushed the micmelesse/upstream_pr branch from 675844e to c119315 Compare October 14, 2024 16:07

micmelesse force-pushed the micmelesse/upstream_pr branch from c119315 to 730d260 Compare October 29, 2024 15:50

clean up

da9c36a

micmelesse changed the title ~~[AMD] Triton Backend for ROCm #1~~ [AMD] Triton Backend for ROCm Oct 29, 2024

micmelesse added 2 commits October 30, 2024 21:27

Enable sequence_parallel in bwd (#89)

b76eb08

* sequence_parallel working on bwd_impl test * fix qkv error * save * save * save * bwd 3 times faster * clean up * fix varlen bug * use copy back dict * fix qkvpacked bug * reduce bench sizes * print copy back

clean more

849023c

micmelesse added 2 commits October 31, 2024 15:18

Autotune off by default

f6099ac

update Triton commit readme (#92)

21cf529

jamesxu2 mentioned this pull request Nov 5, 2024

Fused Attention Kernel with gfx1030? ROCm/composable_kernel#886

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Triton Backend for ROCm #1203

[AMD] Triton Backend for ROCm #1203

micmelesse commented Sep 4, 2024 •

edited

Loading

unclemusclez commented Sep 15, 2024

unclemusclez commented Oct 30, 2024

micmelesse commented Oct 30, 2024

micmelesse commented Oct 30, 2024

dtrifiro commented Nov 22, 2024

micmelesse commented Nov 22, 2024

[AMD] Triton Backend for ROCm #1203

Are you sure you want to change the base?

[AMD] Triton Backend for ROCm #1203

Conversation

micmelesse commented Sep 4, 2024 • edited Loading

unclemusclez commented Sep 15, 2024

unclemusclez commented Oct 30, 2024

micmelesse commented Oct 30, 2024

micmelesse commented Oct 30, 2024

dtrifiro commented Nov 22, 2024

micmelesse commented Nov 22, 2024

micmelesse commented Sep 4, 2024 •

edited

Loading