Multi-head attention? #385

ToucheSir · 2022-02-08T19:10:57Z

Now that we're seeing models with attention layers in more than one downstream (e.g. FluxML/Metalhead.jl#105 and Transformers.jl), it may be time to consider pulling some building blocks into NNlib. CUDA.jl already wraps cuDNN's MHA too, see https://github.com/JuliaGPU/CUDA.jl/blob/27c87a6f261aa7964d797e8fe4bf33b46c1a185e/test/cudnn/multiheadattn.jl#L55.

darsnack · 2022-02-09T21:55:08Z

Another case like recurrence where the cuDNN API doesn't match the cute semantics that we use to define the layer. One thing that the Metalhead PR made me consider was adding the ability to thread Parallel when the branches are expensive and equally computationally intensive.

ToucheSir · 2022-02-09T23:25:35Z

That's an interesting idea. I think you could extend it to using separate CUDA streams for GPU-tasks too. Making AD co-operate would be its own challenge of course 😅

CarloLucibello mentioned this issue Jan 3, 2023

implement dot_product_attention #455

Merged

CarloLucibello closed this as completed in #455 Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-head attention? #385

Multi-head attention? #385

ToucheSir commented Feb 8, 2022

darsnack commented Feb 9, 2022

ToucheSir commented Feb 9, 2022

Multi-head attention? #385

Multi-head attention? #385

Comments

ToucheSir commented Feb 8, 2022

darsnack commented Feb 9, 2022

ToucheSir commented Feb 9, 2022