Refactor GEMM prepacking to depend on fewer blocking parameters #482
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In preparation for longer-lived pre-packed matrices, reduce the number of parameters that affect the packed layout and have to be kept the same at the time of pre-packing and later usage.
Previously the packed layout depended on the NC, KC, MC, MR and NR blocking parameters, plus the size of the input matrix. Change it so that it only depends on the MR and NR parameters for packed A and B matrices respectively. The NC parameter varied depending on the thread pool size, so prepacked buffers would become unusable if the thread pool size changed.
This is achieved by revising the packed layout so that packed A matrices with shape
M*K
are laid out as row panels of height MR and width K. Packed B matrices with shapeK*N
are laid out as column panels of width NR and height K. Compared to the previous layout this means there may be a gap between each of theMR*KC
andNR*KC
-shaped panels used by each call to the kernel. This is handled by adding apanel_stride
field to the packed block which is used when setting up the kernel inputs ingemm_block
.A downside of this change is that panels accessed by consecutive invocations of the kernel are no longer guaranteed to be contiguous in memory and are therefore less likely to have been prefetched by hardware prefetchers. Initial tests suggest the effect is small, and I plan to work around this if necessary by adding explicit prefetches in software if necessary.