Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GEMM prepacking to depend on fewer blocking parameters #482

Merged
merged 2 commits into from
Dec 24, 2024

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented Dec 24, 2024

In preparation for longer-lived pre-packed matrices, reduce the number of parameters that affect the packed layout and have to be kept the same at the time of pre-packing and later usage.

Previously the packed layout depended on the NC, KC, MC, MR and NR blocking parameters, plus the size of the input matrix. Change it so that it only depends on the MR and NR parameters for packed A and B matrices respectively. The NC parameter varied depending on the thread pool size, so prepacked buffers would become unusable if the thread pool size changed.

This is achieved by revising the packed layout so that packed A matrices with shape M*K are laid out as row panels of height MR and width K. Packed B matrices with shape K*N are laid out as column panels of width NR and height K. Compared to the previous layout this means there may be a gap between each of the MR*KC and NR*KC-shaped panels used by each call to the kernel. This is handled by adding a panel_stride field to the packed block which is used when setting up the kernel inputs in gemm_block.

A downside of this change is that panels accessed by consecutive invocations of the kernel are no longer guaranteed to be contiguous in memory and are therefore less likely to have been prefetched by hardware prefetchers. Initial tests suggest the effect is small, and I plan to work around this if necessary by adding explicit prefetches in software if necessary.

In preparation for longer-lived pre-packed matrices, reduce the number of
parameters that affect the packed layout and have to be kept the same at the
time of pre-packing and later usage.

Previously the packed layout depended on the NC, KC, MC, MR and NR blocking
parameters, plus the size of the input matrix. Change it so that it only depends
on the MR and NR parameters for packed A and B matrices respectively. The NC
parameter varied depending on the thread pool size, so prepacked buffers would
become unusable if the thread pool size changed.

This is achieved by revising the packed layout so that packed A matrices with
shape M*K are laid out as row panels of height MR and width K.  Packed B
matrices with shape K*N are laid out as column panels of width NR and height K.
Compared to the previous layout this means there may be a gap between each of
the `MR*KC` and `NR*KC`-shaped panels used by each call to the kernel. This is
handled by adding a `panel_stride` field to the packed block which is used when
setting up the kernel inputs in `gemm_block`.

A downside of this change is that panels accessed by consecutive invocations of
the kernel are no longer guaranteed to be contiguous in memory and are therefore
less likely to have been prefetched by hardware prefetchers. Initial tests
suggest the effect is small, and I plan to work around this if necessary by
adding explicit prefetches in software if necessary.
@robertknight robertknight marked this pull request as ready for review December 24, 2024 18:41
@robertknight robertknight merged commit 996d062 into main Dec 24, 2024
2 checks passed
@robertknight robertknight deleted the gemm-panel-stride branch December 24, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant