Parallelize LayerNormalization operator #334

robertknight · 2024-08-26T20:12:15Z

This is an operation which appears in most transformer models, although some more recent ones use relatives such as RMSNorm. It can be parallelized by splitting the input over a non-normalized axis and applying normalization to each chunk separately.

From a quick experiment on a 4-core system I can get ~2x speedup quite easily.

robertknight · 2024-12-17T15:54:05Z

#465 improved efficiency of LayerNormalization by ~2.5x without parallelization but through better cache efficiency and vectorization. Parallelization can still be added on top.

robertknight · 2024-12-30T09:57:51Z

After vectorizing LayerNormalization, I only see a benefit to adding parallelism when the amount of normalized data starts to exceed about 192 KB (on a BERT model with d_embed=768, sequence_len=256, batch_size=1). On the same system (i5-1038NG7) the speedup from using multiple threads tops out at about ~2.5x. In the context of overall inference time this is about a 1% win.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize LayerNormalization operator #334

Parallelize LayerNormalization operator #334

robertknight commented Aug 26, 2024

robertknight commented Dec 17, 2024

robertknight commented Dec 30, 2024

Parallelize LayerNormalization operator #334

Parallelize LayerNormalization operator #334

Comments

robertknight commented Aug 26, 2024

robertknight commented Dec 17, 2024

robertknight commented Dec 30, 2024