Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvements to resize kernel (w/ SIMD) #1513

Merged
merged 8 commits into from
Jan 21, 2021

Conversation

Sergio0694
Copy link
Member

@Sergio0694 Sergio0694 commented Jan 19, 2021

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Related to #1476, this PR includes some speed improvements to the resize kernels.
In particular, it introduces a new vectorized path using AVX2/FMA operations to perform the convolutions.
For those interested, here is a sharplab link with the current codegen for the ConvolveCore method, when using SIMD operations.

Still running benchmarks and work in progress...

Assembly for loading in the loop went from:
```asm
vmovss xmm2, [rax]
vbroadcastss xmm2, xmm2
vmovss xmm3, [rax+4]
vbroadcastss xmm3, xmm3
vinsertf128 ymm2, ymm2, xmm3, 1
```
To:
```asm
vmovsd xmm3, [rax]
vbroadcastsd ymm3, xmm3
vpermps ymm3, ymm1, ymm3
```
@Sergio0694 Sergio0694 added this to the 1.1.0 milestone Jan 19, 2021
@saucecontrol
Copy link
Contributor

Looking good on Skylake 👍

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
  Job-XNQYOQ : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT
  Job-ALEINB : .NET Core 2.1.24 (CoreCLR 4.6.29518.01, CoreFX 4.6.29518.01), X64 RyuJIT
  Job-LVTEAU : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT

Before:

Method Job Runtime Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
SystemDrawing Job-ILLXJH .NET 4.7.2 17.320 ms 0.0918 ms 0.0814 ms 1.00 - - - 256 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-ILLXJH .NET 4.7.2 9.572 ms 0.0701 ms 0.0656 ms 0.55 - - - 40624 B
SystemDrawing Job-MQEJZE .NET Core 2.1 17.535 ms 0.2673 ms 0.2500 ms 1.00 - - - 96 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-MQEJZE .NET Core 2.1 8.645 ms 0.0660 ms 0.0618 ms 0.49 - - - 40472 B
SystemDrawing Job-BXMGKS .NET Core 3.1 17.498 ms 0.1888 ms 0.1766 ms 1.00 - - - 140 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-BXMGKS .NET Core 3.1 8.008 ms 0.0335 ms 0.0314 ms 0.46 - - - 40472 B

After:

Method Job Runtime Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
SystemDrawing Job-KIBPBM .NET 4.7.2 17.318 ms 0.1773 ms 0.1572 ms 1.00 - - - 256 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-KIBPBM .NET 4.7.2 9.562 ms 0.0685 ms 0.0607 ms 0.55 - - - 40624 B
SystemDrawing Job-AOLXLZ .NET Core 2.1 17.362 ms 0.2023 ms 0.1892 ms 1.00 - - - 96 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-AOLXLZ .NET Core 2.1 8.621 ms 0.0707 ms 0.0662 ms 0.50 - - - 40472 B
SystemDrawing Job-INXCIO .NET Core 3.1 17.418 ms 0.1830 ms 0.1712 ms 1.00 - - - 140 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-INXCIO .NET Core 3.1 6.545 ms 0.0714 ms 0.0668 ms 0.38 7.8125 - - 40424 B

@Sergio0694
Copy link
Member Author

@saucecontrol That's awesome, thank you for running the benchmarks on your machine! 🚀

Here are mine, it looks like this PR is actually ever so slightly slower than master.
I'm thinking it might just be that my Ryzen 2700X (Zen+ arch) is trash at FMA stuff, wouldn't surprise me 🤔

Master

Method Runtime Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
ImageSharp .NET Core 3.1 7.506 ms 0.0110 ms 0.0098 ms 0.44 7.8125 - - 40772 B

PR

Method Runtime Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
ImageSharp .NET Core 3.1 7.646 ms 0.0986 ms 0.1174 ms 0.45 7.8125 - - 40712 B

@antonfirsov
Copy link
Member

According to the test failures, there is some noticeable (but visually still insignificant) difference between the Vector4 and the FMA output. @Sergio0694 I wonder if we get a better understanding on this before changing tolerances?

// fact that most CPUs have two ports to schedule multiply operations for FMA instructions.
result256_0 = Fma.MultiplyAdd(
Unsafe.As<Vector4, Vector256<float>>(ref rowStartRef),
Avx2.PermuteVar8x32(Vector256.CreateScalarUnsafe(*(double*)bufferStart).AsSingle(), mask),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to what I learned from @saucecontrol, moving permutes out from an operation (dependency) chain and running them in a separate sequence might help performance.

Thinking it further:
Would make the code more tricky, but maybe we can try to process two Vector256<float>-s in the loop body, so we can run 4 permutes in a row, then do 4+4 FMA-s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an issue with using locals here (I documented that in the comments here), where the JIT was picking the wrong instruction for the FMA operation and adding extra unnecessary memory copies. Doing this inline instead picked the right one that directly loaded the first argument from memory, which resulted in much better asseembly. I'm worried that moving things around will make the codegen worse again there. Also from what we discussed on Discord, there's usually 2 ports to perform FMA multiplications, so it might not be beneficial to do more than 2 in the same loop? I mean, other than the general marginal improvements just due to more unrolling, possibly.
I think @saucecontrol is doing only two ops per iteration as well in his own lib because of this? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 2 is the max number of FMAs that can be scheduled at once, but it's a pipelined instruction, so you can get more benefit from scheduling more sequentially. I had an unroll by 4 in MagicScaler previously, but it wasn't a ton faster so I dropped it to reduce complexity.

The way I get around having to shuffle/permute the weights in the inner loop is by pre-duplicating them in my kernel map. So my inner loop is just 2 reads of pixel values and 2 FMAs (with the weight reads contained in the FMA instruction). That approach also has the benefit of being backward compatible with Vector<T>, allowing AVX processing on netfx.

float* bufferEnd = bufferStart + (this.Length & ~3);
Vector256<float> result256_0 = Vector256<float>.Zero;
Vector256<float> result256_1 = Vector256<float>.Zero;
var mask = Vector256.Create(0, 0, 0, 0, 1, 1, 1, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Vector.Load with a ROS.

Copy link
Member Author

@Sergio0694 Sergio0694 Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codegen in this very specific case seems to actually be just a vxorps (which is actually... weird), so I'm looking into this. It seems to be ok even though we're not using a ROS though, or actually better than that too.
Will update in a bit 🙂

EDIT: this might actually be a JIT bug, investigating...

EDIT 2: yeah it's an inlining bug that only repros on .NET 5 with precisely these arguments 🤣
Opening an issue.

@antonfirsov
Copy link
Member

Regarding my #1513 (comment), I think Wikipedia answered my concerns:

a fused multiply–add would compute the entire expression a+b×c to its full precision before rounding the final result down to N significant bits.

If I get it right, the FMA code is actually more accurate.

@Sergio0694
Copy link
Member Author

Yeah, in theory the FMA path should be more accurate. From what @tannergooding said on Discord:

"So, for reference, an fma(x, y, z) vs x * y + z should differ by no more than a single bit (ignoring the special case around infinity)
this can compound over many operations to produce even more significant differences
this falls out because the former rounds once while the latter rounds twice"

So in theory the difference should mean the FMA is ever so slightly better, as you mentioned.
It's not 100% that in theory since if the number of pixels per row is not evenly divisible, the last pixel will be computed with a slightly different precision compared to the previous, and also there's the fact that we're using 4 different accumulators in parallel and then just summing them all together at the end, so I'd expect that would alter the overall precision a bit too. But all things considered this new version should be at least equally as precise as the original, I think 🤔
Also if we can't spot any differences when looking at the images, I'd say we're probably good 😄

@saucecontrol
Copy link
Contributor

saucecontrol commented Jan 20, 2021

using 4 different accumulators in parallel and then just summing them all together at the end, so I'd expect that would alter the overall precision a bit too

Yes, this also improves precision. Take the following example:

var rand = new Random(101);
var vals = new float[128];
for (int i = 0; i < vals.Length; i++)
	vals[i] = (float)rand.NextDouble();

float fa = 0;
for (int i = 0; i < vals.Length; i++)
	fa += vals[i];

double da = 0;
for (int i = 0; i < vals.Length; i++)
	da += vals[i];

float a0 = 0, a1 = 0, a2 = 0, a3 = 0;
for (int i = 0; i < vals.Length / 4; i++)
{
	int j = i * 4;
	a0 += vals[j];
	a1 += vals[j + 1];
	a2 += vals[j + 2];
	a3 += vals[j + 3];
}
float pa = a0 + a1 + a2 + a3;

Console.WriteLine("float accumulator:    " + fa.ToString("f12"));
Console.WriteLine("double accumulator:   " + da.ToString("f12"));
Console.WriteLine("4x float accumulator: " + pa.ToString("f12"));

Outputs:

float accumulator:    57.172065734863
double accumulator:   57.172055415809
4x float accumulator: 57.172058105469

Same as with FMA reducing the number of rounding steps, 4 accumulators rounds 1/4 as many times.

@saucecontrol
Copy link
Contributor

I'm thinking it might just be that my Ryzen 2700X (Zen+ arch) is trash at FMA stuff

I looked at the Zen+ perf numbers at uops.info, and they're not great. On Intel processors, FMA gives you the add for free after the multiply, saving 4 cycles latency. On AMD processors, FMA is only 1 cycle faster than MUL+ADD, but it runs on the multiply ports the whole time, whereas ADD can use a different set of ports if they are split. This was the point @tannergooding was making on discord yesterday. If there were contention on the MUL ports, the FMA version ties them up longer so may not be a win. That's particularly true of Zen+, where a 256-bit FMA ties up both ports because it's split to 2 128-bit uops. In this specific case, there is no other work to be done, so it's not a negative -- but it's also not much of a win at all.

The permute, however, is particularly bad on Zen+. From the perf numbers, it appears AMD processors emulate this instruction with microcode, and the Zen+ version is extra slow. Since the throughput is so low there, you can't really take advantage of the parallel FMAs. Looks like you're lucky not to be too much slower than the Vector4 code on that machine.

So, on Intel, this is strictly a win. On Zen2/3, it should be a win but less so (I don't have access to one at the moment). And Zen+ would appear to be a loss but one worth taking for the gains everywhere else.

@Sergio0694
Copy link
Member Author

Thanks for all the extra info @saucecontrol and also the perf analysis, that's super interesting! 😄
Will bump the image comparison threshold to 0.004% for those resize tests then to make the CI happy.

@codecov
Copy link

codecov bot commented Jan 20, 2021

Codecov Report

Merging #1513 (e2211c3) into master (eab04e4) will decrease coverage by 0.00%.
The diff coverage is 80.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1513      +/-   ##
==========================================
- Coverage   83.53%   83.53%   -0.01%     
==========================================
  Files         742      742              
  Lines       32732    32772      +40     
  Branches     3665     3669       +4     
==========================================
+ Hits        27344    27375      +31     
- Misses       4672     4680       +8     
- Partials      716      717       +1     
Flag Coverage Δ
unittests 83.53% <80.85%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ssing/Processors/Transforms/Resize/ResizeKernel.cs 85.24% <80.85%> (-14.76%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eab04e4...e2211c3. Read the comment docs.

@JimBobSquarePants
Copy link
Member

LGTM. Great work (and code comments) @Sergio0694 and thanks @saucecontrol for the additional input.

@JimBobSquarePants JimBobSquarePants merged commit 7eb5cc0 into master Jan 21, 2021
@JimBobSquarePants JimBobSquarePants deleted the sp/simd-resize-convolve branch January 21, 2021 01:18
@JimBobSquarePants
Copy link
Member

For anyone curious here's what the benchmarks against other libraries looks like currently on my SB2.

BenchmarkDotNet=v0.12.1.1467-nightly, OS=Windows 10.0.19042
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.102
  [Host]       : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT
  .Net 5.0 CLI : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT

Job=.Net 5.0 CLI  Arguments=/p:DebugType=portable  Toolchain=.NET 5.0
IterationCount=5  LaunchCount=1  WarmupCount=5
Method Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'System.Drawing Resize' 11,789.6 us 530.93 us 137.88 us 1.00 0.00 - - - 136 B
'ImageSharp Resize' 2,793.8 us 87.18 us 22.64 us 0.24 0.00 - - - 9,152 B
'ImageMagick Resize' 57,661.5 us 6,566.96 us 1,705.42 us 4.89 0.14 - - - 5,449 B
'FreeImage Resize' 8,188.4 us 133.72 us 34.73 us 0.69 0.01 500.0000 500.0000 500.0000 136 B
'MagicScaler Resize' 797.7 us 21.21 us 3.28 us 0.07 0.00 - - - 1,872 B
'SkiaSharp Canvas Resize' 2,508.5 us 331.56 us 86.11 us 0.21 0.01 - - - 1,584 B
'SkiaSharp Bitmap Resize' 2,452.5 us 377.57 us 58.43 us 0.21 0.01 - - - 488 B
'NetVips Resize' 6,148.8 us 195.83 us 30.31 us 0.52 0.01 - - - 4,104 B
BBenchmarkDotNet=v0.12.1.1467-nightly, OS=Windows 10.0.19042
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.102
  [Host]       : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT
  .Net 5.0 CLI : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT

Job=.Net 5.0 CLI  Arguments=/p:DebugType=portable  Toolchain=.NET 5.0
IterationCount=5  LaunchCount=1  WarmupCount=5
Method Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'System.Drawing Load, Resize, Save' 420.2 ms 37.87 ms 9.83 ms 1.00 0.00 - - - 12 KB
'ImageSharp Load, Resize, Save' 211.6 ms 7.87 ms 2.04 ms 0.50 0.01 333.3333 - - 2,179 KB
'ImageMagick Load, Resize, Save' 469.7 ms 10.39 ms 2.70 ms 1.12 0.02 - - - 58 KB
'MagicScaler Load, Resize, Save' 105.0 ms 5.45 ms 1.41 ms 0.25 0.00 - - - 57 KB
'SkiaSharp Canvas Load, Resize, Save' 263.6 ms 4.55 ms 0.70 ms 0.63 0.02 - - - 107 KB
'SkiaSharp Bitmap Load, Resize, Save' 278.6 ms 101.39 ms 15.69 ms 0.66 0.05 - - - 91 KB
'NetVips Load, Resize, Save' 184.5 ms 17.62 ms 4.58 ms 0.44 0.01 - - - 50 KB

@saucecontrol
Copy link
Contributor

Looking good! I reckon with some more codec work, Vips is within reach. I've just started working on codecs myself and should be able to help out with that this year.

@JimBobSquarePants
Copy link
Member

I've just started working on codecs myself and should be able to help out with that this year.

BD7FC273-69BC-4F7D-A687-98387A8546B8

@Sergio0694
Copy link
Member Author

My takeaways from the benchmarks that James shared:

  • The overall performance for ImageSharp looks pretty good!
  • Definitely agreed that with some improvements to encoding/decoding we can beat NetVips 🚀
  • MagicScaler is just stupid fast, it's ridiculous 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants