Refactor GEMM prepacking to depend on fewer blocking parameters #482

robertknight · 2024-12-24T12:03:37Z

In preparation for longer-lived pre-packed matrices, reduce the number of parameters that affect the packed layout and have to be kept the same at the time of pre-packing and later usage.

Previously the packed layout depended on the NC, KC, MC, MR and NR blocking parameters, plus the size of the input matrix. Change it so that it only depends on the MR and NR parameters for packed A and B matrices respectively. The NC parameter varied depending on the thread pool size, so prepacked buffers would become unusable if the thread pool size changed.

This is achieved by revising the packed layout so that packed A matrices with shape M*K are laid out as row panels of height MR and width K. Packed B matrices with shape K*N are laid out as column panels of width NR and height K. Compared to the previous layout this means there may be a gap between each of the MR*KC and NR*KC-shaped panels used by each call to the kernel. This is handled by adding a panel_stride field to the packed block which is used when setting up the kernel inputs in gemm_block.

A downside of this change is that panels accessed by consecutive invocations of the kernel are no longer guaranteed to be contiguous in memory and are therefore less likely to have been prefetched by hardware prefetchers. Initial tests suggest the effect is small, and I plan to work around this if necessary by adding explicit prefetches in software if necessary.

In preparation for longer-lived pre-packed matrices, reduce the number of parameters that affect the packed layout and have to be kept the same at the time of pre-packing and later usage. Previously the packed layout depended on the NC, KC, MC, MR and NR blocking parameters, plus the size of the input matrix. Change it so that it only depends on the MR and NR parameters for packed A and B matrices respectively. The NC parameter varied depending on the thread pool size, so prepacked buffers would become unusable if the thread pool size changed. This is achieved by revising the packed layout so that packed A matrices with shape M*K are laid out as row panels of height MR and width K. Packed B matrices with shape K*N are laid out as column panels of width NR and height K. Compared to the previous layout this means there may be a gap between each of the `MR*KC` and `NR*KC`-shaped panels used by each call to the kernel. This is handled by adding a `panel_stride` field to the packed block which is used when setting up the kernel inputs in `gemm_block`. A downside of this change is that panels accessed by consecutive invocations of the kernel are no longer guaranteed to be contiguous in memory and are therefore less likely to have been prefetched by hardware prefetchers. Initial tests suggest the effect is small, and I plan to work around this if necessary by adding explicit prefetches in software if necessary.

robertknight added 2 commits December 24, 2024 18:36

Correct height <-> width mixup in Kernel::{mr, nr} docs

4e22fa8

robertknight force-pushed the gemm-panel-stride branch from 9fdaedc to 29ddbb7 Compare December 24, 2024 18:36

robertknight marked this pull request as ready for review December 24, 2024 18:41

robertknight merged commit 996d062 into main Dec 24, 2024
2 checks passed

robertknight deleted the gemm-panel-stride branch December 24, 2024 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor GEMM prepacking to depend on fewer blocking parameters #482

Refactor GEMM prepacking to depend on fewer blocking parameters #482

robertknight commented Dec 24, 2024 •

edited

Loading

Refactor GEMM prepacking to depend on fewer blocking parameters #482

Refactor GEMM prepacking to depend on fewer blocking parameters #482

Conversation

robertknight commented Dec 24, 2024 • edited Loading

robertknight commented Dec 24, 2024 •

edited

Loading