Suboptimal GroupNorm Implementation on GPUs #10

avik-pal · 2022-04-24T13:58:03Z

As observed in SciML/DeepEquilibriumNetworks.jl#45 (comment) we get a 2x speedup by moving from GroupNorm to BatchNorm which uses CUDNN kernels.

ToucheSir · 2022-04-24T19:12:57Z

cuDNN should support this, so it's mostly a matter of hooking up NNlib + NNlibCUDA (unless you're fine with directly calling the CUDA.jl routines here)

avik-pal · 2022-05-27T06:07:44Z

Actually I dont think CUDNN supports this (at least could figure it out from its documentation). Pytorch uses its own kernel https://github.com/pytorch/pytorch/blob/35d4a805ebc3b6eca1bafb2d332dffa8d0c1fc54/aten/src/ATen/native/cuda/group_norm_kernel.cu

ToucheSir · 2022-05-27T17:02:54Z

I must've hallucinated a mention of groups in the docs for cudnnNormalizationForward* then. The PyTorch kernel is quite a beast, so unless someone's up to the task of translating it I think we're stuck with the slower vectorized variant for now. Ideally we would figure out why https://triton-lang.org/master/getting-started/tutorials/05-layer-norm.html is so fast, tweak it to run groupnorm instead and port it to KernelAbstractions or the like. @vchuravy is KA sufficiently high-level to handle such a translation?

…ns/create-pull-request-5 Bump peter-evans/create-pull-request from 4 to 5

GPU Downstream testing

Generalize the generators to complex numbers

avik-pal mentioned this issue May 10, 2022

Better CUDNN Dispatches #16

Merged

avik-pal added the performance label Jun 26, 2022

This was referenced Sep 17, 2022

Introducing LuxLib.jl: Effectively pullout some of the custom layer implementations from Lux.jl #154

Merged

Update to use LuxLib #156

Merged

avik-pal closed this as completed in #156 Sep 25, 2022

avik-pal added a commit that referenced this issue Nov 3, 2024

Merge pull request #10 from LuxDL/dependabot/github_actions/peter-eva…

91b38bb

…ns/create-pull-request-5 Bump peter-evans/create-pull-request from 4 to 5

avik-pal added a commit that referenced this issue Nov 3, 2024

Merge pull request #10 from LuxDL/ap/downstream

ea649d5

GPU Downstream testing

avik-pal added a commit that referenced this issue Nov 3, 2024

Merge pull request #10 from LuxDL/ap/generalize

6a2cc47

Generalize the generators to complex numbers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal GroupNorm Implementation on GPUs #10

Suboptimal GroupNorm Implementation on GPUs #10

avik-pal commented Apr 24, 2022

ToucheSir commented Apr 24, 2022

avik-pal commented May 27, 2022

ToucheSir commented May 27, 2022

Suboptimal GroupNorm Implementation on GPUs #10

Suboptimal GroupNorm Implementation on GPUs #10

Comments

avik-pal commented Apr 24, 2022

ToucheSir commented Apr 24, 2022

avik-pal commented May 27, 2022

ToucheSir commented May 27, 2022