CpuMath Enhancement: Double-compute input elements in hardware intrinsics #836
Labels
enhancement
New feature or request
P2
Priority of the issue for triage purpose: Needs to be fixed at some point.
perf
Performance and Benchmarking related
up-for-grabs
A good issue to fix if you are trying to contribute to the project
Style changes needed to solve part of #823
After implementing "double-compute", it is expected to make hardware intrinsics more efficient.
Details (mostly from @tannergooding)
src\Microsoft.ML.CpuMath\SseIntrinsics.cs
andsrc\Microsoft.ML.CpuMath\AvxIntrinsics.cs
, change the last loop of the existing 3-loop code pattern into the following:dstVector
) from the last iteration of the vectorized codepDstCurrent
back such thatpDstCurrent + elementsPerIteration == pEnd
This generally results in more performant code, depending on the exact algorithm and number of remaining elements
For some algorithms (like
Sum
), it is possible to “double-compute” a few elements in the beginning and end to have better overall performance. See the following pseudo-code:So, your overall algorithm will probably look like:
If you can’t “double-compute” for some reason, then you generally do the “software” processing for the beginning (to become aligned) and end (to catch stray elements).
•
AvxLimit
is generally a number that takes into account the “downclocking” that can occur for heavy 256-bit instruction usage•
SseLimit
is generally 128-bits for algorithms where you can “double-compute” and some profiled number for other algorithmscc: @tannergooding since he suggested this approach.
The text was updated successfully, but these errors were encountered: