You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Listed below are CpuMath enhancement suggestions raised during PR reviews for SSE and AVX intrinsics but only documented for future follow-up.
Style
Big-scale renaming of input arguments of Microsoft.ML.CpuMath\Sse.cs and CpuMathUtils.netcoreapp.cs, e.g. for MatTimesSrc, change tran into transpose.
Change 0 < count into count > 0 and offset < dst.Length - count into offset < (dst.Length - count)
Split && asserts into two separate asserts
In CpuMathUtils.netcoreapp.cs, Span<T> might already do the checks (Contracts.Assert) in the public functions.
(minor) In SseIntrinsics.cs, it may make sense to add some Debug.Asserts to check the src and dst Lengths match. However, these are internal functions that are only called from functions that guarantee the arguments are checked, so I don't believe it is a blocking issue. It just may be some nice documentation on the expectations of these methods. And in case they get new callsites in the future.
In SsePerformanceTests.cs, regarding the line private float[] src, dst, original, src1, src2;, all these non-constant privates should be prefixed with _ but it can be updated in the next PR to not block the current one.
In 'SseIntrinsics.cs', do the following two things to make while loops more efficient:
Bound checking: var remainder = count % elementsPerIteration; and then float* pEnd = pdst + (count - remainder);. Your loop check then just does while (pDstCurrent < pEnd)
Double-computing:
Finish off the remaining indices one of two ways.
drop down to a scalar algorithm or,
"double compute" a couple of the indices, that involves:
Saving the stored result (dstVector) from the last iteration of the vectorized code
Moving pDstCurrent back such that pDstCurrent + elementsPerIteration == pEnd
Doing a single iteration for the remaining elements
Mix the saved result from the last iteration of the vectorized code with the result from the remaining elements
Write the result
This generally results in more performant code, depending on the exact algorithm and number of remaining elements
In SseIntrinsics.cs, Use FusedMultiplyAdd to replace srcVector = Sse.Multiply(srcVector, scaleVector); in AddScaleU. It would be part of any AVX related code-work.
In the original code of SseIntrinsics.cs, even though VectorSum is inlined, the codegen is not optimized due to register spill and reload. It seems JIT has optimization opportunity over there. Do you mind opening an issue to discuss about it on CoreCLR github repo? (from Intel partners)
(probably) Try the default static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); in Program.cs of perf tests. It will work, but we will need to decide how to parse arguments from the command line.
Add unit tests for software fallback implementations, particularly for MatTimesSrc.
(@tannergooding) On handling unpadded parts in AVX intrinsics:
It can be that simple, but that is often not the best way to handle it.
For some algorithms (like Sum), it is possible to “double-compute” a few elements in the beginning and end to have better overall performance. See the following pseudo-code:
if addr not aligned
tmp = unaligned load from addr
tmp &= mask which zero's elements after the first aligned address
result = tmp
move addr forward to the first aligned address
while addr is aligned and remaining bits >= 128
result += aligned load
addr += 128-bits
if any remaining
addr = endAddr - 128
tmp = unaligned load from addr
tmp &= mask which zero's elements already processed
result += tmp
Sum the elements in result (using "horizontal add" or "shuffle and add")
So, your overall algorithm will probably look like:
if (Avx.IsSupported && (Length >= AvxLimit))
{
// Process 256-bits, we have a limit since 256-bit
// AVX instructions can cause a downclock in the CPU
// Algorithm would be similar to the SSE pseudo-code
}
else if (Sse.IsSupported && (Length >= SseLimit))
{
// Pseudo-code algorithm given above
// 128-bit instructions operate at full frequency
// and don't downclock the CPU, we can only use
// them for more than 128-bits so we don't AV
}
else
{
// Software Implementation
}
If you can’t “double-compute” for some reason, then you generally do the “software” processing for the beginning (to become aligned) and end (to catch stray elements).
• AvxLimit is generally a number that takes into account the “downclocking” that can occur for heavy 256-bit instruction usage
• SseLimit is generally 128-bits for algorithms where you can “double-compute” and some profiled number for other algorithms
The text was updated successfully, but these errors were encountered:
Listed below are CpuMath enhancement suggestions raised during PR reviews for SSE and AVX intrinsics but only documented for future follow-up.
Style
MatTimesSrc
, changetran
intotranspose
.0 < count
intocount > 0
andoffset < dst.Length - count
intooffset < (dst.Length - count)
&&
asserts into two separate assertsCpuMathUtils.netcoreapp.cs
,Span<T>
might already do the checks (Contracts.Assert
) in the public functions.SseIntrinsics.cs
, it may make sense to add someDebug.Asserts
to check thesrc
anddst
Lengths match. However, these are internal functions that are only called from functions that guarantee the arguments are checked, so I don't believe it is a blocking issue. It just may be some nice documentation on the expectations of these methods. And in case they get new callsites in the future.SsePerformanceTests.cs
, regarding the lineprivate float[] src, dst, original, src1, src2;
, all these non-constant privates should be prefixed with_
but it can be updated in the next PR to not block the current one.Functionality
For large arrays, especially those that cross cache line or page boundaries, doing this should save some measurable amount of time.
Reference: https://github.com/dotnet/machinelearning/pull/562/files/f0f81a5019a3c8cbd795a970e40d633e9e1770c1#r204061074
Bound checking:
var remainder = count % elementsPerIteration;
and thenfloat* pEnd = pdst + (count - remainder);
. Your loop check then just doeswhile (pDstCurrent < pEnd)
Double-computing:
Finish off the remaining indices one of two ways.
Saving the stored result (
dstVector
) from the last iteration of the vectorized codeMoving
pDstCurrent
back such thatpDstCurrent + elementsPerIteration == pEnd
Doing a single iteration for the remaining elements
Mix the saved result from the last iteration of the vectorized code with the result from the remaining elements
Write the result
This generally results in more performant code, depending on the exact algorithm and number of remaining elements
In
SseIntrinsics.cs
, UseFusedMultiplyAdd
to replacesrcVector = Sse.Multiply(srcVector, scaleVector);
inAddScaleU
. It would be part of any AVX related code-work.In the original code of
SseIntrinsics.cs
, even thoughVectorSum
is inlined, the codegen is not optimized due to register spill and reload. It seems JIT has optimization opportunity over there. Do you mind opening an issue to discuss about it on CoreCLR github repo? (from Intel partners)(probably) Try the default static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); in Program.cs of perf tests. It will work, but we will need to decide how to parse arguments from the command line.
Add unit tests for software fallback implementations, particularly for
MatTimesSrc
.Change
while (pDstCurrent + 4 <= pDstEnd)
for the loop bound checking intowhile (pDstCurrent <= pDstEnd - 4)
to save an instruction (ref: Port all active C# hardware intrinsics APIs for SSE from SIMD native algorithms dotnet/machinelearning#668 (comment)).(@tannergooding) On handling unpadded parts in AVX intrinsics:
It can be that simple, but that is often not the best way to handle it.
For some algorithms (like
Sum
), it is possible to “double-compute” a few elements in the beginning and end to have better overall performance. See the following pseudo-code:So, your overall algorithm will probably look like:
If you can’t “double-compute” for some reason, then you generally do the “software” processing for the beginning (to become aligned) and end (to catch stray elements).
•
AvxLimit
is generally a number that takes into account the “downclocking” that can occur for heavy 256-bit instruction usage•
SseLimit
is generally 128-bits for algorithms where you can “double-compute” and some profiled number for other algorithmsThe text was updated successfully, but these errors were encountered: