Speed improvements to resize kernel (w/ SIMD) #1513

Sergio0694 · 2021-01-19T22:28:27Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Related to #1476, this PR includes some speed improvements to the resize kernels.
In particular, it introduces a new vectorized path using AVX2/FMA operations to perform the convolutions.
For those interested, here is a sharplab link with the current codegen for the ConvolveCore method, when using SIMD operations.

Still running benchmarks and work in progress...

Assembly for loading in the loop went from: ```asm vmovss xmm2, [rax] vbroadcastss xmm2, xmm2 vmovss xmm3, [rax+4] vbroadcastss xmm3, xmm3 vinsertf128 ymm2, ymm2, xmm3, 1 ``` To: ```asm vmovsd xmm3, [rax] vbroadcastsd ymm3, xmm3 vpermps ymm3, ymm1, ymm3 ```

saucecontrol · 2021-01-20T03:11:27Z

Looking good on Skylake 👍

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
  Job-XNQYOQ : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT
  Job-ALEINB : .NET Core 2.1.24 (CoreCLR 4.6.29518.01, CoreFX 4.6.29518.01), X64 RyuJIT
  Job-LVTEAU : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT

Before:

Method	Job	Runtime	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
SystemDrawing	Job-ILLXJH	.NET 4.7.2	17.320 ms	0.0918 ms	0.0814 ms	1.00	-	-	-	256 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-ILLXJH	.NET 4.7.2	9.572 ms	0.0701 ms	0.0656 ms	0.55	-	-	-	40624 B

SystemDrawing	Job-MQEJZE	.NET Core 2.1	17.535 ms	0.2673 ms	0.2500 ms	1.00	-	-	-	96 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-MQEJZE	.NET Core 2.1	8.645 ms	0.0660 ms	0.0618 ms	0.49	-	-	-	40472 B

SystemDrawing	Job-BXMGKS	.NET Core 3.1	17.498 ms	0.1888 ms	0.1766 ms	1.00	-	-	-	140 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-BXMGKS	.NET Core 3.1	8.008 ms	0.0335 ms	0.0314 ms	0.46	-	-	-	40472 B

After:

Method	Job	Runtime	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
SystemDrawing	Job-KIBPBM	.NET 4.7.2	17.318 ms	0.1773 ms	0.1572 ms	1.00	-	-	-	256 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-KIBPBM	.NET 4.7.2	9.562 ms	0.0685 ms	0.0607 ms	0.55	-	-	-	40624 B

SystemDrawing	Job-AOLXLZ	.NET Core 2.1	17.362 ms	0.2023 ms	0.1892 ms	1.00	-	-	-	96 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-AOLXLZ	.NET Core 2.1	8.621 ms	0.0707 ms	0.0662 ms	0.50	-	-	-	40472 B

SystemDrawing	Job-INXCIO	.NET Core 3.1	17.418 ms	0.1830 ms	0.1712 ms	1.00	-	-	-	140 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-INXCIO	.NET Core 3.1	6.545 ms	0.0714 ms	0.0668 ms	0.38	7.8125	-	-	40424 B

Sergio0694 · 2021-01-20T11:47:01Z

@saucecontrol That's awesome, thank you for running the benchmarks on your machine! 🚀

Here are mine, it looks like this PR is actually ever so slightly slower than master.
I'm thinking it might just be that my Ryzen 2700X (Zen+ arch) is trash at FMA stuff, wouldn't surprise me 🤔

Master

Method	Runtime	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
ImageSharp	.NET Core 3.1	7.506 ms	0.0110 ms	0.0098 ms	0.44	7.8125	-	-	40772 B

PR

Method	Runtime	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
ImageSharp	.NET Core 3.1	7.646 ms	0.0986 ms	0.1174 ms	0.45	7.8125	-	-	40712 B

antonfirsov · 2021-01-20T13:12:26Z

According to the test failures, there is some noticeable (but visually still insignificant) difference between the Vector4 and the FMA output. @Sergio0694 I wonder if we get a better understanding on this before changing tolerances?

antonfirsov · 2021-01-20T13:08:15Z

src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeKernel.cs

+                    // fact that most CPUs have two ports to schedule multiply operations for FMA instructions.
+                    result256_0 = Fma.MultiplyAdd(
+                        Unsafe.As<Vector4, Vector256<float>>(ref rowStartRef),
+                        Avx2.PermuteVar8x32(Vector256.CreateScalarUnsafe(*(double*)bufferStart).AsSingle(), mask),


According to what I learned from @saucecontrol, moving permutes out from an operation (dependency) chain and running them in a separate sequence might help performance.

Thinking it further:
Would make the code more tricky, but maybe we can try to process two Vector256<float>-s in the loop body, so we can run 4 permutes in a row, then do 4+4 FMA-s.

There was an issue with using locals here (I documented that in the comments here), where the JIT was picking the wrong instruction for the FMA operation and adding extra unnecessary memory copies. Doing this inline instead picked the right one that directly loaded the first argument from memory, which resulted in much better asseembly. I'm worried that moving things around will make the codegen worse again there. Also from what we discussed on Discord, there's usually 2 ports to perform FMA multiplications, so it might not be beneficial to do more than 2 in the same loop? I mean, other than the general marginal improvements just due to more unrolling, possibly.
I think @saucecontrol is doing only two ops per iteration as well in his own lib because of this? 🤔

Yeah, 2 is the max number of FMAs that can be scheduled at once, but it's a pipelined instruction, so you can get more benefit from scheduling more sequentially. I had an unroll by 4 in MagicScaler previously, but it wasn't a ton faster so I dropped it to reduce complexity.

The way I get around having to shuffle/permute the weights in the inner loop is by pre-duplicating them in my kernel map. So my inner loop is just 2 reads of pixel values and 2 FMAs (with the weight reads contained in the FMA instruction). That approach also has the benefit of being backward compatible with Vector<T>, allowing AVX processing on netfx.

antonfirsov · 2021-01-20T13:11:17Z

src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeKernel.cs

+                float* bufferEnd = bufferStart + (this.Length & ~3);
+                Vector256<float> result256_0 = Vector256<float>.Zero;
+                Vector256<float> result256_1 = Vector256<float>.Zero;
+                var mask = Vector256.Create(0, 0, 0, 0, 1, 1, 1, 1);


Should be Vector.Load with a ROS.

The codegen in this very specific case seems to actually be just a vxorps (which is actually... weird), so I'm looking into this. It seems to be ok even though we're not using a ROS though, or actually better than that too.
Will update in a bit 🙂

EDIT: this might actually be a JIT bug, investigating...

EDIT 2: yeah it's an inlining bug that only repros on .NET 5 with precisely these arguments 🤣
Opening an issue.

antonfirsov · 2021-01-20T13:30:18Z

Regarding my #1513 (comment), I think Wikipedia answered my concerns:

a fused multiply–add would compute the entire expression a+b×c to its full precision before rounding the final result down to N significant bits.

If I get it right, the FMA code is actually more accurate.

Sergio0694 · 2021-01-20T14:28:47Z

Yeah, in theory the FMA path should be more accurate. From what @tannergooding said on Discord:

"So, for reference, an fma(x, y, z) vs x * y + z should differ by no more than a single bit (ignoring the special case around infinity)
this can compound over many operations to produce even more significant differences
this falls out because the former rounds once while the latter rounds twice"

So in theory the difference should mean the FMA is ever so slightly better, as you mentioned.
It's not 100% that in theory since if the number of pixels per row is not evenly divisible, the last pixel will be computed with a slightly different precision compared to the previous, and also there's the fact that we're using 4 different accumulators in parallel and then just summing them all together at the end, so I'd expect that would alter the overall precision a bit too. But all things considered this new version should be at least equally as precise as the original, I think 🤔
Also if we can't spot any differences when looking at the images, I'd say we're probably good 😄

See Vector256.Create issue: dotnet/runtime#47236

saucecontrol · 2021-01-20T17:44:46Z

using 4 different accumulators in parallel and then just summing them all together at the end, so I'd expect that would alter the overall precision a bit too

Yes, this also improves precision. Take the following example:

var rand = new Random(101);
var vals = new float[128];
for (int i = 0; i < vals.Length; i++)
	vals[i] = (float)rand.NextDouble();

float fa = 0;
for (int i = 0; i < vals.Length; i++)
	fa += vals[i];

double da = 0;
for (int i = 0; i < vals.Length; i++)
	da += vals[i];

float a0 = 0, a1 = 0, a2 = 0, a3 = 0;
for (int i = 0; i < vals.Length / 4; i++)
{
	int j = i * 4;
	a0 += vals[j];
	a1 += vals[j + 1];
	a2 += vals[j + 2];
	a3 += vals[j + 3];
}
float pa = a0 + a1 + a2 + a3;

Console.WriteLine("float accumulator:    " + fa.ToString("f12"));
Console.WriteLine("double accumulator:   " + da.ToString("f12"));
Console.WriteLine("4x float accumulator: " + pa.ToString("f12"));

Outputs:

float accumulator:    57.172065734863
double accumulator:   57.172055415809
4x float accumulator: 57.172058105469

Same as with FMA reducing the number of rounding steps, 4 accumulators rounds 1/4 as many times.

saucecontrol · 2021-01-20T18:05:31Z

I'm thinking it might just be that my Ryzen 2700X (Zen+ arch) is trash at FMA stuff

I looked at the Zen+ perf numbers at uops.info, and they're not great. On Intel processors, FMA gives you the add for free after the multiply, saving 4 cycles latency. On AMD processors, FMA is only 1 cycle faster than MUL+ADD, but it runs on the multiply ports the whole time, whereas ADD can use a different set of ports if they are split. This was the point @tannergooding was making on discord yesterday. If there were contention on the MUL ports, the FMA version ties them up longer so may not be a win. That's particularly true of Zen+, where a 256-bit FMA ties up both ports because it's split to 2 128-bit uops. In this specific case, there is no other work to be done, so it's not a negative -- but it's also not much of a win at all.

The permute, however, is particularly bad on Zen+. From the perf numbers, it appears AMD processors emulate this instruction with microcode, and the Zen+ version is extra slow. Since the throughput is so low there, you can't really take advantage of the parallel FMAs. Looks like you're lucky not to be too much slower than the Vector4 code on that machine.

So, on Intel, this is strictly a win. On Zen2/3, it should be a win but less so (I don't have access to one at the moment). And Zen+ would appear to be a loss but one worth taking for the gains everywhere else.

Sergio0694 · 2021-01-20T20:48:01Z

Thanks for all the extra info @saucecontrol and also the perf analysis, that's super interesting! 😄
Will bump the image comparison threshold to 0.004% for those resize tests then to make the CI happy.

codecov · 2021-01-20T22:10:30Z

Codecov Report

Merging #1513 (e2211c3) into master (eab04e4) will decrease coverage by 0.00%.
The diff coverage is 80.85%.

@@            Coverage Diff             @@
##           master    #1513      +/-   ##
==========================================
- Coverage   83.53%   83.53%   -0.01%     
==========================================
  Files         742      742              
  Lines       32732    32772      +40     
  Branches     3665     3669       +4     
==========================================
+ Hits        27344    27375      +31     
- Misses       4672     4680       +8     
- Partials      716      717       +1

Flag	Coverage Δ
unittests	`83.53% <80.85%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ssing/Processors/Transforms/Resize/ResizeKernel.cs	`85.24% <80.85%> (-14.76%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eab04e4...e2211c3. Read the comment docs.

JimBobSquarePants · 2021-01-21T01:15:50Z

LGTM. Great work (and code comments) @Sergio0694 and thanks @saucecontrol for the additional input.

JimBobSquarePants · 2021-01-21T02:42:43Z

For anyone curious here's what the benchmarks against other libraries looks like currently on my SB2.

BenchmarkDotNet=v0.12.1.1467-nightly, OS=Windows 10.0.19042
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.102
  [Host]       : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT
  .Net 5.0 CLI : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT

Job=.Net 5.0 CLI  Arguments=/p:DebugType=portable  Toolchain=.NET 5.0
IterationCount=5  LaunchCount=1  WarmupCount=5

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
'System.Drawing Resize'	11,789.6 us	530.93 us	137.88 us	1.00	0.00	-	-	-	136 B
'ImageSharp Resize'	2,793.8 us	87.18 us	22.64 us	0.24	0.00	-	-	-	9,152 B
'ImageMagick Resize'	57,661.5 us	6,566.96 us	1,705.42 us	4.89	0.14	-	-	-	5,449 B
'FreeImage Resize'	8,188.4 us	133.72 us	34.73 us	0.69	0.01	500.0000	500.0000	500.0000	136 B
'MagicScaler Resize'	797.7 us	21.21 us	3.28 us	0.07	0.00	-	-	-	1,872 B
'SkiaSharp Canvas Resize'	2,508.5 us	331.56 us	86.11 us	0.21	0.01	-	-	-	1,584 B
'SkiaSharp Bitmap Resize'	2,452.5 us	377.57 us	58.43 us	0.21	0.01	-	-	-	488 B
'NetVips Resize'	6,148.8 us	195.83 us	30.31 us	0.52	0.01	-	-	-	4,104 B

BBenchmarkDotNet=v0.12.1.1467-nightly, OS=Windows 10.0.19042
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.102
  [Host]       : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT
  .Net 5.0 CLI : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT

Job=.Net 5.0 CLI  Arguments=/p:DebugType=portable  Toolchain=.NET 5.0
IterationCount=5  LaunchCount=1  WarmupCount=5

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
'System.Drawing Load, Resize, Save'	420.2 ms	37.87 ms	9.83 ms	1.00	0.00	-	-	-	12 KB
'ImageSharp Load, Resize, Save'	211.6 ms	7.87 ms	2.04 ms	0.50	0.01	333.3333	-	-	2,179 KB
'ImageMagick Load, Resize, Save'	469.7 ms	10.39 ms	2.70 ms	1.12	0.02	-	-	-	58 KB
'MagicScaler Load, Resize, Save'	105.0 ms	5.45 ms	1.41 ms	0.25	0.00	-	-	-	57 KB
'SkiaSharp Canvas Load, Resize, Save'	263.6 ms	4.55 ms	0.70 ms	0.63	0.02	-	-	-	107 KB
'SkiaSharp Bitmap Load, Resize, Save'	278.6 ms	101.39 ms	15.69 ms	0.66	0.05	-	-	-	91 KB
'NetVips Load, Resize, Save'	184.5 ms	17.62 ms	4.58 ms	0.44	0.01	-	-	-	50 KB

saucecontrol · 2021-01-21T06:39:15Z

Looking good! I reckon with some more codec work, Vips is within reach. I've just started working on codecs myself and should be able to help out with that this year.

JimBobSquarePants · 2021-01-21T08:35:26Z

I've just started working on codecs myself and should be able to help out with that this year.

Sergio0694 · 2021-01-21T14:46:50Z

My takeaways from the benchmarks that James shared:

The overall performance for ImageSharp looks pretty good!
Definitely agreed that with some improvements to encoding/decoding we can beat NetVips 🚀
MagicScaler is just stupid fast, it's ridiculous 🤣

Speed improvements to resize kernel (w/ SIMD)

Sergio0694 added 5 commits January 19, 2021 18:19

Add initial FMA resize kernel convolve implementation

42632c7

Switch from FMA to AVX2 instructions

874e951

Revert to FMA, codegen improvements

941e173

Add unrolled FMA loop

493d04a

Sergio0694 added the area:performance label Jan 19, 2021

Sergio0694 added this to the 1.1.0 milestone Jan 19, 2021

Add missing indexing update

407c2d9

antonfirsov reviewed Jan 20, 2021

View reviewed changes

Sergio0694 mentioned this pull request Jan 20, 2021

Incorrect codegen for Vector256.Create(0, 0, 0, 0, 1, 1, 1, 1) dotnet/runtime#47236

Closed

Workaround for incorrect codegen on .NET 5

a7ca1b0

See Vector256.Create issue: dotnet/runtime#47236

Update image threshold for resize tests

e2211c3

JimBobSquarePants approved these changes Jan 21, 2021

View reviewed changes

JimBobSquarePants merged commit 7eb5cc0 into master Jan 21, 2021

JimBobSquarePants deleted the sp/simd-resize-convolve branch January 21, 2021 01:18

Sergio0694 mentioned this pull request Jan 21, 2021

Speed improvements to resize convolution (no vpermps w/ FMA) #1518

Closed

4 tasks

JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021

Merge pull request #1513 from SixLabors/sp/simd-resize-convolve

de80630

Speed improvements to resize kernel (w/ SIMD)

brianpopow mentioned this pull request Apr 22, 2023

Fix thresholds for Resize kernel tests #2440

Merged

4 tasks

JimBobSquarePants mentioned this pull request Aug 15, 2024

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #2793

Draft

4 tasks

lizard-boy mentioned this pull request Nov 1, 2024

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) grepdemos/ImageSharp#3

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed improvements to resize kernel (w/ SIMD) #1513

Speed improvements to resize kernel (w/ SIMD) #1513

Sergio0694 commented Jan 19, 2021 •

edited

Loading

saucecontrol commented Jan 20, 2021

Sergio0694 commented Jan 20, 2021

antonfirsov commented Jan 20, 2021

antonfirsov Jan 20, 2021

Sergio0694 Jan 20, 2021

saucecontrol Jan 20, 2021

antonfirsov Jan 20, 2021

Sergio0694 Jan 20, 2021 •

edited

Loading

antonfirsov commented Jan 20, 2021

Sergio0694 commented Jan 20, 2021

saucecontrol commented Jan 20, 2021 •

edited

Loading

saucecontrol commented Jan 20, 2021

Sergio0694 commented Jan 20, 2021

codecov bot commented Jan 20, 2021

JimBobSquarePants commented Jan 21, 2021

JimBobSquarePants commented Jan 21, 2021

saucecontrol commented Jan 21, 2021

JimBobSquarePants commented Jan 21, 2021

Sergio0694 commented Jan 21, 2021

Speed improvements to resize kernel (w/ SIMD) #1513

Speed improvements to resize kernel (w/ SIMD) #1513

Conversation

Sergio0694 commented Jan 19, 2021 • edited Loading

Prerequisites

Description

saucecontrol commented Jan 20, 2021

Sergio0694 commented Jan 20, 2021

Master

PR

antonfirsov commented Jan 20, 2021

antonfirsov Jan 20, 2021

Choose a reason for hiding this comment

Sergio0694 Jan 20, 2021

Choose a reason for hiding this comment

saucecontrol Jan 20, 2021

Choose a reason for hiding this comment

antonfirsov Jan 20, 2021

Choose a reason for hiding this comment

Sergio0694 Jan 20, 2021 • edited Loading

Choose a reason for hiding this comment

antonfirsov commented Jan 20, 2021

Sergio0694 commented Jan 20, 2021

saucecontrol commented Jan 20, 2021 • edited Loading

saucecontrol commented Jan 20, 2021

Sergio0694 commented Jan 20, 2021

codecov bot commented Jan 20, 2021

Codecov Report

JimBobSquarePants commented Jan 21, 2021

JimBobSquarePants commented Jan 21, 2021

saucecontrol commented Jan 21, 2021

JimBobSquarePants commented Jan 21, 2021

Sergio0694 commented Jan 21, 2021

Sergio0694 commented Jan 19, 2021 •

edited

Loading

Sergio0694 Jan 20, 2021 •

edited

Loading

saucecontrol commented Jan 20, 2021 •

edited

Loading