Optimize LayerNormalization for better cache efficiency + SIMD #465

robertknight · 2024-12-16T21:46:09Z

Optimize the LayerNormalization implementation by:

Re-organizing the process to be more cache efficient. Instead of applying each step in the process to the whole input before moving on to the next, apply the whole normalization process to each normalized slice individually. This means we only load each slice into cache once (assuming each slice fits in L1)
Fusing the steps that normalize the variance and apply the bias and scale into one vectorized pass

Tested on docvqa this speeds up LayerNormalization in the encoder by 2.5-3x.

Instead of performing each step of normalization on the whole input before moving onto the next, perform the full normalization over each input slice before moving on to the next. This is more cache efficient. Also fuse and vectorize the steps that scale the input to normalize the variance and apply elementwise scales. With these changes the operator is ~2.5-3x faster on x64 assuming the input is already contiguous. The `LayerNormalization` operator specification allows for the `bias` and `scale` values to have any shape that can be broadcast to the input shape. However actual models seen so far always set these shapes to match the normalized axes of the input. Hence this change drops support for other bias/scale input shapes for the time being.

Add vec_shift_scale_in_place SIMD kernel

f9ad7af

robertknight force-pushed the simd-layer-norm branch 4 times, most recently from b9041a6 to 6ed9b72 Compare December 17, 2024 10:27

robertknight marked this pull request as ready for review December 17, 2024 15:35

robertknight added 2 commits December 17, 2024 15:46

Add test for normalizing multiple axes in LayerNormalization

7685b8b

robertknight force-pushed the simd-layer-norm branch from 6ed9b72 to 7685b8b Compare December 17, 2024 15:47

robertknight merged commit 0f9eac6 into main Dec 17, 2024
2 checks passed

robertknight deleted the simd-layer-norm branch December 17, 2024 15:52

robertknight mentioned this pull request Dec 17, 2024

Parallelize LayerNormalization operator #334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize LayerNormalization for better cache efficiency + SIMD #465

Optimize LayerNormalization for better cache efficiency + SIMD #465

robertknight commented Dec 16, 2024

Optimize LayerNormalization for better cache efficiency + SIMD #465

Optimize LayerNormalization for better cache efficiency + SIMD #465

Conversation

robertknight commented Dec 16, 2024