Optimize Convolution #1465

JimBobSquarePants · 2020-12-07T16:32:01Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Refactors all the convolution algorithms to take advantage of bulk operations.

Operationally the way the algorithm works has changed as follows:

Before
Per Interest Row
        |____ Per Interest Column
                |____ Per Kernel Row
                        |____ Per Kernel Column

After
Per Interest Row
        |____ Per Kernel Row
                |____ Per Interest Column
                        |____ Per Kernel Column

This allows us to use our bulk conversion methods per row.

Benchmarks are very healthy.

Before

After

JimBobSquarePants · 2020-12-07T16:33:03Z

src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.Buffer{T}.cs

@@ -53,8 +53,13 @@ public override Span<T> GetSpan()
                {
                    ThrowObjectDisposedException();
                }
-
+#if SUPPORTS_CREATESPAN


This came up during profiling. On NET Core 3.1 We can use a shortcut since we know the length.

Sergio0694 · 2020-12-07T16:42:42Z

src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.Buffer{T}.cs

@@ -53,8 +53,13 @@ public override Span<T> GetSpan()
                {
                    ThrowObjectDisposedException();
                }
-
+#if SUPPORTS_CREATESPAN
+                ref byte r0 = ref MemoryMarshal.GetReference<byte>(this.Data);


Small note - this is implicitly creating a Span<byte> from the array. When we add .NET 5 support we can optimize this even further by using MemoryMarshal.GetArrayDataReference instead and completely bypass all checks 😊

Yeah... I was looking at that. I still don't know what to do re .NET 5 though. It's be our first target framework which isn't LTS.

Ah, right. The way I see it, we could maybe do:

Add a .NET 5 target to start preparing the codebase for .NET 6. When .NET 6 lands, immediately switch to it and drop .NET 5, since support for it would end 3 months after .NET 6 is released anyways.

Just hold our breath until .NET 6 lands and stick to .NET Core 3.1 as max until then.

I'd say it might be worth doing an initial investigation for starters, to see how many places in the lib could actually benefit from .NET 5 exclusive APIs that are not on .NET Core 3.1? Then we could decide based on that 🙂

ARM intrinsics is a big one.

Ah, right, right. Yeah that alone might be worth it. Properly supporting ARM64 gives a ton of visibility right now, plus being ready for .NET 6 would be very nice. And if we are to work on ARM64 right now anyway, I don't see the harm in publishing .NET 5 packages already, assuming we'll get a working build soon. After all, companies would likely remain on .NET Core 3.1 until .NET 6 is out, so those wouldn't be affected. And devs on .NET 5 now would be fine with having to jump to .NET 6 immediately anyway, after all they jsut did the same with .NET 5 right now. I'd say if you have time you could look into this and then when you have a branch that shows a nice boost on ARM64, consider shipping if it's really good (and if it's not like mid 2021 already by the time that's done)? 😄

codecov · 2020-12-07T16:55:30Z

Codecov Report

Merging #1465 (863bddb) into master (1f351ee) will decrease coverage by 0.06%.
The diff coverage is 72.53%.

@@            Coverage Diff             @@
##           master    #1465      +/-   ##
==========================================
- Coverage   83.56%   83.49%   -0.07%     
==========================================
  Files         737      742       +5     
  Lines       32232    32347     +115     
  Branches     3618     3639      +21     
==========================================
+ Hits        26935    27009      +74     
- Misses       4581     4625      +44     
+ Partials      716      713       -3

Flag	Coverage Δ
unittests	`83.49% <72.53%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/ImageSharp/Primitives/DenseMatrix{T}.cs	`90.00% <ø> (ø)`
...ors/Convolution/ConvolutionRowOperation{TPixel}.cs	`53.73% <53.73%> (ø)`
...s/Convolution/Convolution2DRowOperation{TPixel}.cs	`54.54% <54.54%> (ø)`
...rocessing/Processors/Convolution/ReadOnlyKernel.cs	`66.66% <66.66%> (ø)`
...essors/Convolution/ConvolutionProcessor{TPixel}.cs	`77.63% <71.92%> (+0.27%)`	⬆️
...y/Allocators/ArrayPoolMemoryAllocator.Buffer{T}.cs	`77.27% <100.00%> (+1.08%)`	⬆️
...sors/Convolution/Convolution2DProcessor{TPixel}.cs	`100.00% <100.00%> (+22.03%)`	⬆️
...ssing/Processors/Convolution/Convolution2DState.cs	`100.00% <100.00%> (ø)`
...s/Convolution/Convolution2PassProcessor{TPixel}.cs	`100.00% <100.00%> (+20.68%)`	⬆️
...cessing/Processors/Convolution/ConvolutionState.cs	`100.00% <100.00%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f351ee...863bddb. Read the comment docs.

Sergio0694

Looks great, and love the speed improvements!
Just left a few notes with respect to code style and other minor things. My main concern is more about peak memory usage int his case, so curious to see more info on that front. Will get to work on the bokeh blur version using this new pattern once this is merged 😄

Sergio0694 · 2020-12-10T14:14:02Z

src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.Buffer{T}.cs

+#if SUPPORTS_CREATESPAN
+                ref byte r0 = ref MemoryMarshal.GetReference<byte>(this.Data);
+                return MemoryMarshal.CreateSpan(ref Unsafe.As<byte, T>(ref r0), this.length);
+#else
                return MemoryMarshal.Cast<byte, T>(this.Data.AsSpan()).Slice(0, this.length);


In theory here we know that the byte size is an exact multiple of our target length for the T span, so I was thinking we could be able to remove that final Slice call by pre-slicing the Data array. Something like:

return MemoryMarshal.Cast<byte, T>(this.Data.AsSpan(0, this.length * sizeof(T)));

Tried out on sharplab (here) but apparently the codegen is worse for some reason. So yeah I'm not really seeing a way to improve this without having access to the .NET 5 APIs 🤔

Yep .NET 5 makes this easier.

Sergio0694 · 2020-12-10T14:14:57Z

src/ImageSharp/Primitives/DenseMatrix{T}.cs

-            [MethodImpl(InliningOptions.ShortMethod)]
+            [MethodImpl(MethodImplOptions.AggressiveInlining)]


Why is it AggressiveInlining here and not the internal ShortMethod? Was this intentional or did you just forget to change this back after your initial experiments?

Intentional. ShortMethod was initially designed for certain jpeg profiling but it kinda spread. I don't think we need it any more wit the tooling we have available.

Ah, got it. Well we might want to revert that in another PR then 😄

src/ImageSharp/Processing/Processors/Convolution/Convolution2DProcessor{TPixel}.cs

Sergio0694 · 2020-12-10T14:19:23Z

src/ImageSharp/Processing/Processors/Convolution/Convolution2DProcessor{TPixel}.cs

+            // We use a rectangle 3x the interest width to allocate a buffer big enough
+            // for source and target bulk pixel conversion.
+            var operationBounds = new Rectangle(interest.X, interest.Y, interest.Width * 3, interest.Height);


I'm a bit confused about the temporary memory usage of this solution, as I remember you being a bit worried about this back when I was first working on the bokeh blur processor. Wouldn't this mean we're now allocating a 3 x [image size] buffer every time now? As in, if I'm processing a 1920x1080 image, this would allocate a 5760 x 1080 temporary buffer, correct? Is that ok? At 4K image that'd be a buffer of 24 million pixels 🤔

Looking at the rest of the code I think I get why you're doing this (as it allows you to access pixels on the Y access in row major order), but I'd be curious about a diff in max peak memory usage compared to master, if you have made such a benchmark? If nothing else, I think it'd be an interesting bit of into to include 🙂

We're actually only allocating a single of Span<Vector4> of length 3x interest width per parallel region and only for 2D convolution. For 1D or 2Pass it's 2x interest width.

ImageSharp/src/ImageSharp/Advanced/ParallelRowIterator.Wrappers.cs

Lines 101 to 108 in 1f351ee

using IMemoryOwner<TBuffer> buffer = this.allocator.Allocate<TBuffer>(this.width);

Span<TBuffer> span = buffer.Memory.Span;

for (int y = yMin; y < yMax; y++)

{

Unsafe.AsRef(this.action).Invoke(y, span);

}

Oooh I see, yeah that makes sense. Perfect then! 🚀

...I even wrote that code 🤣

Sergio0694 · 2020-12-10T14:48:38Z

src/ImageSharp/Processing/Processors/Convolution/Convolution2PassProcessor{TPixel}.cs

-        /// Gets the horizontal gradient operator.
+        /// Gets the horizontal convolution kernel.
        /// </summary>
        public DenseMatrix<float> KernelX { get; }

        /// <summary>
-        /// Gets the vertical gradient operator.
+        /// Gets the vertical convolution kernel.
        /// </summary>
        public DenseMatrix<float> KernelY { get; }


Small side note, I'm still convinced that using DenseMatrix<T> for separable 1D kernels makes the code less intuitive (those are 1D vectors, not matrices) and potentially more overhead for the unnecessary 2D coordinate calculation. I think we could possibly look into having a different DenseVector type in the future that would just map to a 1D vector, to use in cases such as this.

It makes my life easier as I can use the same code throughout when doing things like offset mapping.

We also use the same code for the non-separable convolution operation so that can be arbitrary dimensions.

Sergio0694 · 2020-12-10T14:57:35Z

src/ImageSharp/Processing/Processors/Convolution/ReadOnlyKernel.cs

+    /// A stack only, readonly, kernel matrix that can be indexed without
+    /// bounds checks when compiled in release mode.
+    /// </summary>
+    internal readonly ref struct ReadOnlyKernel


Also to get back to my point above with respect to clarity, imho this should be called ReadOnlyKernel2D.

See above, there's only be one type.

Optimize Convolution

JimBobSquarePants added 11 commits December 4, 2020 14:54

First working no-clamp version for 2 pass convolution

2eb09c3

Naming tweaks

a071d7e

All tests pass

170d220

Refactor 2D and cleanup

c6c867c

Update KernelSamplingMap.cs

81f21e5

Explicit in

6e76887

Use faster GetSpan()

811600a

Avoid per-index multiply.

656ca73

Merge branch 'master' into js/convolution-experiments

e89f967

Working version per-row

0be08a2

Cleanup

fd2ece0

JimBobSquarePants added the area:performance label Dec 7, 2020

JimBobSquarePants added this to the 1.1.0 milestone Dec 7, 2020

JimBobSquarePants requested review from Sergio0694, antonfirsov, brianpopow, dlemstra and tocsoft December 7, 2020 16:32

JimBobSquarePants commented Dec 7, 2020

View reviewed changes

Sergio0694 reviewed Dec 7, 2020

View reviewed changes

JimBobSquarePants added 2 commits December 8, 2020 21:12

Merge branch 'master' into js/convolution-experiments

9e3dd8a

Merge branch 'master' into js/convolution-experiments

863bddb

Sergio0694 approved these changes Dec 10, 2020

View reviewed changes

JimBobSquarePants merged commit ff94d20 into master Dec 10, 2020

JimBobSquarePants deleted the js/convolution-experiments branch December 10, 2020 19:39

Sergio0694 mentioned this pull request Dec 12, 2020

Optimize bokeh blur convolution #1475

Merged

4 tasks

JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021

Merge pull request #1465 from SixLabors/js/convolution-experiments

943830f

Optimize Convolution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Convolution #1465

Optimize Convolution #1465

JimBobSquarePants commented Dec 7, 2020 •

edited

Loading

JimBobSquarePants Dec 7, 2020

Sergio0694 Dec 7, 2020

JimBobSquarePants Dec 7, 2020

Sergio0694 Dec 7, 2020 •

edited

Loading

JimBobSquarePants Dec 7, 2020

Sergio0694 Dec 7, 2020 •

edited

Loading

codecov bot commented Dec 7, 2020 •

edited

Loading

Sergio0694 left a comment

Sergio0694 Dec 10, 2020

JimBobSquarePants Dec 10, 2020

Sergio0694 Dec 10, 2020

JimBobSquarePants Dec 10, 2020

Sergio0694 Dec 10, 2020

Sergio0694 Dec 10, 2020

JimBobSquarePants Dec 10, 2020 •

edited

Loading

Sergio0694 Dec 10, 2020 •

edited

Loading

Sergio0694 Dec 10, 2020

JimBobSquarePants Dec 10, 2020

Sergio0694 Dec 10, 2020

JimBobSquarePants Dec 10, 2020

		[MethodImpl(InliningOptions.ShortMethod)]
		[MethodImpl(MethodImplOptions.AggressiveInlining)]

	using IMemoryOwner<TBuffer> buffer = this.allocator.Allocate<TBuffer>(this.width);

	Span<TBuffer> span = buffer.Memory.Span;

	for (int y = yMin; y < yMax; y++)
	{
	Unsafe.AsRef(this.action).Invoke(y, span);
	}

Optimize Convolution #1465

Optimize Convolution #1465

Conversation

JimBobSquarePants commented Dec 7, 2020 • edited Loading

Prerequisites

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sergio0694 Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sergio0694 Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Dec 7, 2020 • edited Loading

Codecov Report

Sergio0694 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants Dec 10, 2020 • edited Loading

Choose a reason for hiding this comment

Sergio0694 Dec 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants commented Dec 7, 2020 •

edited

Loading

Sergio0694 Dec 7, 2020 •

edited

Loading

Sergio0694 Dec 7, 2020 •

edited

Loading

codecov bot commented Dec 7, 2020 •

edited

Loading

JimBobSquarePants Dec 10, 2020 •

edited

Loading

Sergio0694 Dec 10, 2020 •

edited

Loading