-
-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed improvements to resize kernel (w/ SIMD) #1513
Conversation
Assembly for loading in the loop went from: ```asm vmovss xmm2, [rax] vbroadcastss xmm2, xmm2 vmovss xmm3, [rax+4] vbroadcastss xmm3, xmm3 vinsertf128 ymm2, ymm2, xmm3, 1 ``` To: ```asm vmovsd xmm3, [rax] vbroadcastsd ymm3, xmm3 vpermps ymm3, ymm1, ymm3 ```
Looking good on Skylake 👍 BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
[Host] : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
Job-XNQYOQ : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT
Job-ALEINB : .NET Core 2.1.24 (CoreCLR 4.6.29518.01, CoreFX 4.6.29518.01), X64 RyuJIT
Job-LVTEAU : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT Before:
After:
|
@saucecontrol That's awesome, thank you for running the benchmarks on your machine! 🚀 Here are mine, it looks like this PR is actually ever so slightly slower than master. Master
PR
|
According to the test failures, there is some noticeable (but visually still insignificant) difference between the |
// fact that most CPUs have two ports to schedule multiply operations for FMA instructions. | ||
result256_0 = Fma.MultiplyAdd( | ||
Unsafe.As<Vector4, Vector256<float>>(ref rowStartRef), | ||
Avx2.PermuteVar8x32(Vector256.CreateScalarUnsafe(*(double*)bufferStart).AsSingle(), mask), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to what I learned from @saucecontrol, moving permutes out from an operation (dependency) chain and running them in a separate sequence might help performance.
Thinking it further:
Would make the code more tricky, but maybe we can try to process two Vector256<float>
-s in the loop body, so we can run 4 permutes in a row, then do 4+4 FMA-s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an issue with using locals here (I documented that in the comments here), where the JIT was picking the wrong instruction for the FMA operation and adding extra unnecessary memory copies. Doing this inline instead picked the right one that directly loaded the first argument from memory, which resulted in much better asseembly. I'm worried that moving things around will make the codegen worse again there. Also from what we discussed on Discord, there's usually 2 ports to perform FMA multiplications, so it might not be beneficial to do more than 2 in the same loop? I mean, other than the general marginal improvements just due to more unrolling, possibly.
I think @saucecontrol is doing only two ops per iteration as well in his own lib because of this? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, 2 is the max number of FMAs that can be scheduled at once, but it's a pipelined instruction, so you can get more benefit from scheduling more sequentially. I had an unroll by 4 in MagicScaler previously, but it wasn't a ton faster so I dropped it to reduce complexity.
The way I get around having to shuffle/permute the weights in the inner loop is by pre-duplicating them in my kernel map. So my inner loop is just 2 reads of pixel values and 2 FMAs (with the weight reads contained in the FMA instruction). That approach also has the benefit of being backward compatible with Vector<T>
, allowing AVX processing on netfx.
float* bufferEnd = bufferStart + (this.Length & ~3); | ||
Vector256<float> result256_0 = Vector256<float>.Zero; | ||
Vector256<float> result256_1 = Vector256<float>.Zero; | ||
var mask = Vector256.Create(0, 0, 0, 0, 1, 1, 1, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be Vector.Load
with a ROS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codegen in this very specific case seems to actually be just a vxorps
(which is actually... weird), so I'm looking into this. It seems to be ok even though we're not using a ROS
though, or actually better than that too.
Will update in a bit 🙂
EDIT: this might actually be a JIT bug, investigating...
EDIT 2: yeah it's an inlining bug that only repros on .NET 5 with precisely these arguments 🤣
Opening an issue.
Regarding my #1513 (comment), I think Wikipedia answered my concerns:
If I get it right, the FMA code is actually more accurate. |
Yeah, in theory the FMA path should be more accurate. From what @tannergooding said on Discord:
So in theory the difference should mean the FMA is ever so slightly better, as you mentioned. |
See Vector256.Create issue: dotnet/runtime#47236
Yes, this also improves precision. Take the following example: var rand = new Random(101);
var vals = new float[128];
for (int i = 0; i < vals.Length; i++)
vals[i] = (float)rand.NextDouble();
float fa = 0;
for (int i = 0; i < vals.Length; i++)
fa += vals[i];
double da = 0;
for (int i = 0; i < vals.Length; i++)
da += vals[i];
float a0 = 0, a1 = 0, a2 = 0, a3 = 0;
for (int i = 0; i < vals.Length / 4; i++)
{
int j = i * 4;
a0 += vals[j];
a1 += vals[j + 1];
a2 += vals[j + 2];
a3 += vals[j + 3];
}
float pa = a0 + a1 + a2 + a3;
Console.WriteLine("float accumulator: " + fa.ToString("f12"));
Console.WriteLine("double accumulator: " + da.ToString("f12"));
Console.WriteLine("4x float accumulator: " + pa.ToString("f12")); Outputs:
Same as with FMA reducing the number of rounding steps, 4 accumulators rounds 1/4 as many times. |
I looked at the Zen+ perf numbers at uops.info, and they're not great. On Intel processors, FMA gives you the add for free after the multiply, saving 4 cycles latency. On AMD processors, FMA is only 1 cycle faster than MUL+ADD, but it runs on the multiply ports the whole time, whereas ADD can use a different set of ports if they are split. This was the point @tannergooding was making on discord yesterday. If there were contention on the MUL ports, the FMA version ties them up longer so may not be a win. That's particularly true of Zen+, where a 256-bit FMA ties up both ports because it's split to 2 128-bit uops. In this specific case, there is no other work to be done, so it's not a negative -- but it's also not much of a win at all. The permute, however, is particularly bad on Zen+. From the perf numbers, it appears AMD processors emulate this instruction with microcode, and the Zen+ version is extra slow. Since the throughput is so low there, you can't really take advantage of the parallel FMAs. Looks like you're lucky not to be too much slower than the Vector4 code on that machine. So, on Intel, this is strictly a win. On Zen2/3, it should be a win but less so (I don't have access to one at the moment). And Zen+ would appear to be a loss but one worth taking for the gains everywhere else. |
Thanks for all the extra info @saucecontrol and also the perf analysis, that's super interesting! 😄 |
Codecov Report
@@ Coverage Diff @@
## master #1513 +/- ##
==========================================
- Coverage 83.53% 83.53% -0.01%
==========================================
Files 742 742
Lines 32732 32772 +40
Branches 3665 3669 +4
==========================================
+ Hits 27344 27375 +31
- Misses 4672 4680 +8
- Partials 716 717 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
LGTM. Great work (and code comments) @Sergio0694 and thanks @saucecontrol for the additional input. |
For anyone curious here's what the benchmarks against other libraries looks like currently on my SB2.
|
Looking good! I reckon with some more codec work, Vips is within reach. I've just started working on codecs myself and should be able to help out with that this year. |
My takeaways from the benchmarks that James shared:
|
Speed improvements to resize kernel (w/ SIMD)
Prerequisites
Description
Related to #1476, this PR includes some speed improvements to the resize kernels.
In particular, it introduces a new vectorized path using AVX2/FMA operations to perform the convolutions.
For those interested, here is a sharplab link with the current codegen for the
ConvolveCore
method, when using SIMD operations.Still running benchmarks and work in progress...