Feature parity with PyTorch FlexAttention #24395

schmrlng · 2024-10-18T21:34:37Z

schmrlng
Oct 18, 2024

PyTorch 2.5 was recently released with one of the headline features being a prototype of FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention, which seems to be an interesting combination of (i) deep implementation wizardry and (ii) UX improvements, the latter being the main selling point for end-user adoption. What is the state of flexible, performance-optimized attention implementations in the JAX ML ecosystem?

Perhaps it's possible to achieve similar performance in a cross-platform, even more flexible way with Pallas kernels ("implementation wizardry") but as an end-user interested in attention variants (there are dozens of us! dozens!) I admit it would be nice to have a unified API that already exists, similar to what FlexAttention is promising for the PyTorch community.

Related: jax.nn.dot_product_attention (discussion in #21371 recognizes that attention is a sufficiently fundamental operation in modern ML practice to be addressed in core JAX, not delegated to downstream ML frameworks), #18121, #18314

zinccat · 2024-10-20T04:53:52Z

zinccat
Oct 20, 2024

Just done a bit of straightforward porting of flexattention to here: https://github.com/zinccat/flaxattention

3 replies

schmrlng Oct 21, 2024
Author

Very cool -- thanks for working on this! Aside from matching the FlexAttention API, I'm impressed that on the optimization side you're already able to get near performance parity (at least for causal ALiBi -- do you see similar results for more complicated sparsity patterns?) through your Pallas kernel implementation. I still think it would be nice to see this upstreamed and optimized as a target case for JAX/Pallas developers but this is an incredible start!

zinccat Oct 21, 2024

thanks! I haven't optimized for sparse matrix for now, where flexattention could get a speed up. I'll see how to deal with that after finishing the autograd

zinccat Oct 21, 2024

the performance speedup is mainly due to the attention kernel implemented using pallas in jax repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature parity with PyTorch FlexAttention #24395

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Feature parity with PyTorch FlexAttention #24395

schmrlng Oct 18, 2024

Replies: 1 comment · 3 replies

zinccat Oct 20, 2024

schmrlng Oct 21, 2024 Author

zinccat Oct 21, 2024

zinccat Oct 21, 2024

schmrlng
Oct 18, 2024

Replies: 1 comment 3 replies

zinccat
Oct 20, 2024

schmrlng Oct 21, 2024
Author