FA3 kvcache + split kv + gqa parallelization #1236

jayhshah · 2024-09-18T02:48:11Z

This PR adds split KV ("Flash decoding") and GQA parallelization improvements for FA3. Some essential parts of the KV cache API are added as well, including the cache_seqlens and cache_batch_idx arguments.

Up to 15x improvement over FA2 measured on my H100 PCIe in exceptional cases, e.g.

DTYPE: FLOAT16, CAUSAL, QHEADS:16, KVHEADS:1, HEADDIM:128
CONTEXT:16384, BSZ:4, QLEN:4, FA2:402.86, FA3:26.93, NUM SPLITS:22, RATIO:14.96, GB/s:1245.77

Times given in microseconds. GB/s is measured in terms of loading the KV cache. Note that theoretical max bandwidth is 2 TB/s for H100 PCIe.

TODO on this PR before merge: add split kv heuristic, implement for FP8.

fa3-decoding-times-091724.log

ipiszy · 2024-09-19T07:22:51Z

hopper/flash_attn_interface.py

@@ -174,7 +175,8 @@ def forward(
            causal,
            descale_q=descale_q,
            descale_k=descale_k,
-            descale_v=descale_v,
+            descale_v=descale_v,     
+            gqa_decoding=False,


I wonder does it make sense to give user an option to enable GQA optimization for general use cases outside of decoding?

e.g. It's generally useful for small seq_len prefill. In this case we don't really need split-kv, but we want to have each threadblock handle multiple Q heads with the same KV head.

Furthermore, does it make sense to just enable GQA optimization by default when input is GQA? I feel it won't cause perf regressions even for long sequence length.

I feel it might slow things down a bit, but I haven't tried

KV cache functionality not added yet.

…ded template params

…ly matters for fp8 support

…sion using smem boolean

ipiszy reviewed Sep 19, 2024

View reviewed changes

ganeshcolfax and others added 29 commits September 29, 2024 17:52

Adding the flash3 kv cache API. Just compiling for now.

9dbd114

KV cache functionality not added yet.

start extending seqlen traits for kv cache

7ee8ee4

added cache_batch_idx.

3fdf7ee

adding python interface.

84e31c2

add test_kvcache.py.

e16053a

enable use of actual seqlen for kv cache

be0e36d

add new param to handle cache_batch_size

38ad0ac

add semaphore for kv cache causal

57de4da

add comparision with fa2.

435f86d

change template parameter for SeqLenTraits for ease of further extension

74f160b

modify seqlentraits for gqa parallelism

13bad55

modify Ktraits for decoding QO layouts

ccf5b9b

decouple types of seqlen traits q and k

fc8f704

change logic of Q loads for gqa parallelization

d2f049c

fix o strides

c6311e4

complete gqa parallel changes for non-causal

535b827

fix some errors

5704a1f

add causal logic

64a9cfb

add to kv cache api

a06f1f9

add in lse writeout and store zero

0c4cea9

refactor for split kv

0a1a0c2

re-enable fp16/bf16 fwd

1135dbd

add 1 mma warpgroup option, enable splitkv for hdim 256

68ff3f7

fix bug with finalize for split kv

23bf5b0

delete unused files

ac19795

add hid=64.

1a5e40a

change flash api for rebase

c75c243

avoid redundant compilation with combine kernel by only including nee…

e9db102

…ded template params

change Element to OutputType for template param in combine kernel. On…

9250969

…ly matters for fp8 support

jayhshah and others added 4 commits October 3, 2024 17:26

add split kv benchmark script

fff4b5c

move descale tensor declarations outside of conditional

aa0e699

fix bug with fp8 q layout

785d978

adding rmem to gmem. (Not validating yet).

8fbefa8

jayhshah mentioned this pull request Oct 8, 2024

FlashAttention3 support for forward pass with kv cache #1263

Open

ganeshcolfax and others added 4 commits October 8, 2024 15:41

changes to use tiledcopy (still not passing).

f0b4946

tests passing now for non-gqa impl

8f45a8c

move IsRegToGmem

4a4dbd2

handle gqa_parallel with rmem-to-gmem. Not validating yet.

a075e76

kadeng mentioned this pull request Oct 10, 2024

Paged Attention support for FA3 #1268

Merged

ganeshcolfax and others added 19 commits October 10, 2024 12:00

compiles and builes. Not validates.

dc2c952

passes except for hdim=256.

e49cb5f

remove smem usage for when rmem -> gmem epilogue is used

d437d3d

better writeout logic with vectorization

ab5d336

unify rmem -> gmem methods

7169b23

uniform notation

551b91f

add rmem -> gmem for fp8

eb9c0ee

revert epi change for fp8 due to measured perf regression

b0f067e

refactor names

35f3542

remove test code

8374e1f

remove constexpr checks for actual seqlen in mainloop

1ecf821

remove Is_batch_dynamic from seqlen traits and handle fp8 perf regres…

7c1473e

…sion using smem boolean

change cu_seqlens_k to seqused_k for kv cache api

c06cc0b

adjust tolerances in test script for kv cache

a7cce59

remove commented out code

8efb953

prune more dead code

b3d60fa

comment out unimplemented kwargs from flash_attn_with_kvcache

50cb90a

fix integer sign compare warning

dec7dee

remove some debug code

9b6cba1

jayhshah merged commit a5a7527 into main Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA3 kvcache + split kv + gqa parallelization #1236

FA3 kvcache + split kv + gqa parallelization #1236

jayhshah commented Sep 18, 2024

ipiszy Sep 19, 2024

ipiszy Sep 19, 2024

tridao Sep 19, 2024

FA3 kvcache + split kv + gqa parallelization #1236

FA3 kvcache + split kv + gqa parallelization #1236

Conversation

jayhshah commented Sep 18, 2024

ipiszy Sep 19, 2024

Choose a reason for hiding this comment

ipiszy Sep 19, 2024

Choose a reason for hiding this comment

tridao Sep 19, 2024

Choose a reason for hiding this comment