Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvements to resize convolution (no vpermps w/ FMA) #1518

Closed
wants to merge 8,904 commits into from

Conversation

Sergio0694
Copy link
Member

@Sergio0694 Sergio0694 commented Jan 21, 2021

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Follow up to #1513. This PR does a couple things:

  • Switch the resize kernel processing to float
  • Add an AVX2 vectorized method to normalize the kernel
  • Vectorize the kernel copy when not using FMA, using Span<T>.CopyTo instead
  • Remove the permute8x32 when using FMA, by creating a convolution kernel of 4x the size

Resize convolution codegen diff

Before:

vmovsd xmm2, [rax]
vpermps ymm2, ymm1, ymm2
vfmadd231ps ymm0, ymm2, [r8]

After:

vmovupd ymm2, [r8]
vfmadd231ps ymm0, ymm2, [rax] 

@Sergio0694 Sergio0694 added this to the 1.1.0 milestone Jan 21, 2021
@antonfirsov
Copy link
Member

@Sergio0694 looks like there are several tests failing in KernelMapTests, some of them are concerning, for example:
https://github.com/SixLabors/ImageSharp/pull/1518/checks?check_run_id=1745872105#step:7:272

[xUnit.net 00:01:20.95]     KernelMapContentIsCorrect<WelchResampler>(resampler: SixLabors.ImageSharp.Processing.Processors.Transforms.WelchResampler, srcSize: 300, destSize: 2008) [FAIL]
   0 != 0.33333334 @ (Row:50, Col:0)

Would be nice if we could dig out the git history, and check how @JimBobSquarePants 's original code used double-s before I introduced ResizeKernelMap, and try to figure out if the high accuracy calculations are serving real user needs or not.

@Sergio0694
Copy link
Member Author

@antonfirsov yeah I noticed that too, just hadn't had the time to work on that just yet 😅
That test result seems too off to just be a rounding issue, I'm thinking I might've just messed something up somewhere. Which is weird because all the actual resize tests are passing just fine, so that's... A bit concerning.

I'm thinking maybe we should split up this PR and only merge the vperms improvement here, and then possibly tackle the switch to float and the other vectorizations I've added here in there too, to make it easier to review and debug? 🤔

@antonfirsov
Copy link
Member

all the actual resize tests are passing just fine

The resize tests do not cover all the cases. Kernel map creation is tested against all kinds of weird image dimensions + different resampler dimensions. It's not worth to run expensive end-to-end resize tests for all those combinations, instead we have unit tests in ResizeKernelMapTests. There is an extended set of those tests, I suggest to do a local run with those when you are done with the rest of the PR.

I'm thinking maybe we should split up this PR and only merge the vperms improvement here.

Splitting out would be great yeah! By "vperms improvement" you mean the parts implementing #1515?

@Sergio0694
Copy link
Member Author

The resize tests do not cover all the cases. Kernel map creation is tested against all kinds of weird image dimensions + different resampler dimensions. It's not worth to run expensive end-to-end resize tests for all those combinations, instead we have unit tests in ResizeKernelMapTests. There is an extended set of those tests, I suggest to do a local run with those when you are done with the rest of the PR.

Ooh I see, makes sense, thanks! Will take a look at those extra tests then 😄

Splitting out would be great yeah! By "vperms improvement" you mean the parts implementing #1515?

Yup, exactly - the bit that expands the factors buffer to length 4x and then removing the shuffle.
Will revert the other changes in this PR then, and eventually apply then in a different PR later on 👍

@JimBobSquarePants
Copy link
Member

Sorry @Sergio0694 The introduction of Git LFS (and subsequent history rewrite) has broken this. I spent a couple of hours trying to do a merge with unmatched history but Git simply wont bend to my will. It'd be simpler to create a new branch from master, copy your changes into it and open a new PR.

@JimBobSquarePants
Copy link
Member

So... I decided to revisit this after far too many years and reimplemented each commit against the v4 codebase.

Differences in kernel map generation have something to do with the PeriodicKernelMap, disabling that leads all ResizeKernelMap tests passing. It must be something to do with the double to float changeover.

Benchmarks don't really yield any meaningful difference between this and main.

BenchmarkDotNet v0.13.10, Windows 11 (10.0.22631.3958/23H2/2023Update/SunValley3)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.100-preview.7.24407.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-VTPPCJ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2

Runtime=.NET 8.0  Arguments=/p:DebugType=portable

Main

Method Mean Error StdDev Ratio Allocated Alloc Ratio
SystemDrawing 18.562 ms 0.2120 ms 0.1983 ms 1.00 98 B 1.00
'ImageSharp, MaxDegreeOfParallelism = 1' 4.652 ms 0.0433 ms 0.0362 ms 0.25 51117 B 521.60

PR

Method Mean Error StdDev Ratio Allocated Alloc Ratio
SystemDrawing 18.122 ms 0.2950 ms 0.2759 ms 1.00 137 B 1.00
'ImageSharp, MaxDegreeOfParallelism = 1' 4.533 ms 0.0583 ms 0.0517 ms 0.25 52091 B 380.23

You can see my changes in this branch here.

https://github.com/SixLabors/ImageSharp/tree/js/resize-map-optimizations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants