Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvements to resize kernel (w/ SIMD) #1513

Merged
merged 8 commits into from
Jan 21, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@
using System;
using System.Numerics;
using System.Runtime.CompilerServices;
#if SUPPORTS_RUNTIME_INTRINSICS
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
#endif

namespace SixLabors.ImageSharp.Processing.Processors.Transforms
{
Expand Down Expand Up @@ -66,21 +71,94 @@ public Vector4 Convolve(Span<Vector4> rowSpan)
[MethodImpl(InliningOptions.ShortMethod)]
public Vector4 ConvolveCore(ref Vector4 rowStartRef)
{
ref float horizontalValues = ref Unsafe.AsRef<float>(this.bufferPtr);
#if SUPPORTS_RUNTIME_INTRINSICS
if (Fma.IsSupported)
{
float* bufferStart = this.bufferPtr;
float* bufferEnd = bufferStart + (this.Length & ~3);
Vector256<float> result256_0 = Vector256<float>.Zero;
Vector256<float> result256_1 = Vector256<float>.Zero;
ReadOnlySpan<byte> maskBytes = new byte[]
{
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0,
};
Vector256<int> mask = Unsafe.ReadUnaligned<Vector256<int>>(ref MemoryMarshal.GetReference(maskBytes));

// Destination color components
Vector4 result = Vector4.Zero;
while (bufferStart < bufferEnd)
{
// It is important to use a single expression here so that the JIT will correctly use vfmadd231ps
// for the FMA operation, and execute it directly on the target register and reading directly from
// memory for the first parameter. This skips initializing a SIMD register, and an extra copy.
// The code below should compile in the following assembly on .NET 5 x64:
//
// vmovsd xmm2, [rax] ; load *(double*)bufferStart into xmm2 as [ab, _]
// vpermps ymm2, ymm1, ymm2 ; permute as a float YMM register to [a, a, a, a, b, b, b, b]
// vfmadd231ps ymm0, ymm2, [r8] ; result256_0 = FMA(pixels, factors) + result256_0
//
// For tracking the codegen issue with FMA, see: https://github.com/dotnet/runtime/issues/12212.
// Additionally, we're also unrolling two computations per each loop iterations to leverage the
// fact that most CPUs have two ports to schedule multiply operations for FMA instructions.
result256_0 = Fma.MultiplyAdd(
Unsafe.As<Vector4, Vector256<float>>(ref rowStartRef),
Avx2.PermuteVar8x32(Vector256.CreateScalarUnsafe(*(double*)bufferStart).AsSingle(), mask),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to what I learned from @saucecontrol, moving permutes out from an operation (dependency) chain and running them in a separate sequence might help performance.

Thinking it further:
Would make the code more tricky, but maybe we can try to process two Vector256<float>-s in the loop body, so we can run 4 permutes in a row, then do 4+4 FMA-s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an issue with using locals here (I documented that in the comments here), where the JIT was picking the wrong instruction for the FMA operation and adding extra unnecessary memory copies. Doing this inline instead picked the right one that directly loaded the first argument from memory, which resulted in much better asseembly. I'm worried that moving things around will make the codegen worse again there. Also from what we discussed on Discord, there's usually 2 ports to perform FMA multiplications, so it might not be beneficial to do more than 2 in the same loop? I mean, other than the general marginal improvements just due to more unrolling, possibly.
I think @saucecontrol is doing only two ops per iteration as well in his own lib because of this? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 2 is the max number of FMAs that can be scheduled at once, but it's a pipelined instruction, so you can get more benefit from scheduling more sequentially. I had an unroll by 4 in MagicScaler previously, but it wasn't a ton faster so I dropped it to reduce complexity.

The way I get around having to shuffle/permute the weights in the inner loop is by pre-duplicating them in my kernel map. So my inner loop is just 2 reads of pixel values and 2 FMAs (with the weight reads contained in the FMA instruction). That approach also has the benefit of being backward compatible with Vector<T>, allowing AVX processing on netfx.

result256_0);

for (int i = 0; i < this.Length; i++)
{
float weight = Unsafe.Add(ref horizontalValues, i);
result256_1 = Fma.MultiplyAdd(
Unsafe.As<Vector4, Vector256<float>>(ref Unsafe.Add(ref rowStartRef, 2)),
Avx2.PermuteVar8x32(Vector256.CreateScalarUnsafe(*(double*)(bufferStart + 2)).AsSingle(), mask),
result256_1);

bufferStart += 4;
rowStartRef = ref Unsafe.Add(ref rowStartRef, 4);
}

result256_0 = Avx.Add(result256_0, result256_1);

if ((this.Length & 3) >= 2)
{
result256_0 = Fma.MultiplyAdd(
Unsafe.As<Vector4, Vector256<float>>(ref rowStartRef),
Avx2.PermuteVar8x32(Vector256.CreateScalarUnsafe(*(double*)bufferStart).AsSingle(), mask),
result256_0);

bufferStart += 2;
rowStartRef = ref Unsafe.Add(ref rowStartRef, 2);
}

// Vector4 v = offsetedRowSpan[i];
Vector4 v = Unsafe.Add(ref rowStartRef, i);
result += v * weight;
Vector128<float> result128 = Sse.Add(result256_0.GetLower(), result256_0.GetUpper());

if ((this.Length & 1) != 0)
{
result128 = Fma.MultiplyAdd(
Unsafe.As<Vector4, Vector128<float>>(ref rowStartRef),
Vector128.Create(*bufferStart),
result128);
}

return *(Vector4*)&result128;
}
else
#endif
{
// Destination color components
Vector4 result = Vector4.Zero;
float* bufferStart = this.bufferPtr;
float* bufferEnd = this.bufferPtr + this.Length;

return result;
while (bufferStart < bufferEnd)
{
// Vector4 v = offsetedRowSpan[i];
result += rowStartRef * *bufferStart;

bufferStart++;
rowStartRef = ref Unsafe.Add(ref rowStartRef, 1);
}

return result;
}
}

/// <summary>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ public void WorkingBufferSizeHintInBytes_IsAppliedCorrectly<TPixel>(
testOutputDetails: workingBufferLimitInRows,
appendPixelTypeToFileName: false);
image.CompareToReferenceOutput(
ImageComparer.TolerantPercentage(0.001f),
ImageComparer.TolerantPercentage(0.004f),
provider,
testOutputDetails: workingBufferLimitInRows,
appendPixelTypeToFileName: false);
Expand Down