Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Convolution #1465

Merged
merged 13 commits into from
Dec 10, 2020
Merged

Optimize Convolution #1465

merged 13 commits into from
Dec 10, 2020

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Dec 7, 2020

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Refactors all the convolution algorithms to take advantage of bulk operations.

Operationally the way the algorithm works has changed as follows:

Before
Per Interest Row
        |____ Per Interest Column
                |____ Per Kernel Row
                        |____ Per Kernel Column

After
Per Interest Row
        |____ Per Kernel Row
                |____ Per Interest Column
                        |____ Per Kernel Column

This allows us to use our bulk conversion methods per row.

Benchmarks are very healthy.

Before
blur-before

After
blur-after

@@ -53,8 +53,13 @@ public override Span<T> GetSpan()
{
ThrowObjectDisposedException();
}

#if SUPPORTS_CREATESPAN
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came up during profiling. On NET Core 3.1 We can use a shortcut since we know the length.

@@ -53,8 +53,13 @@ public override Span<T> GetSpan()
{
ThrowObjectDisposedException();
}

#if SUPPORTS_CREATESPAN
ref byte r0 = ref MemoryMarshal.GetReference<byte>(this.Data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note - this is implicitly creating a Span<byte> from the array. When we add .NET 5 support we can optimize this even further by using MemoryMarshal.GetArrayDataReference instead and completely bypass all checks 😊

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I was looking at that. I still don't know what to do re .NET 5 though. It's be our first target framework which isn't LTS.

Copy link
Member

@Sergio0694 Sergio0694 Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. The way I see it, we could maybe do:

  • Add a .NET 5 target to start preparing the codebase for .NET 6. When .NET 6 lands, immediately switch to it and drop .NET 5, since support for it would end 3 months after .NET 6 is released anyways.
  • Just hold our breath until .NET 6 lands and stick to .NET Core 3.1 as max until then.

I'd say it might be worth doing an initial investigation for starters, to see how many places in the lib could actually benefit from .NET 5 exclusive APIs that are not on .NET Core 3.1? Then we could decide based on that 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARM intrinsics is a big one.

Copy link
Member

@Sergio0694 Sergio0694 Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, right. Yeah that alone might be worth it. Properly supporting ARM64 gives a ton of visibility right now, plus being ready for .NET 6 would be very nice. And if we are to work on ARM64 right now anyway, I don't see the harm in publishing .NET 5 packages already, assuming we'll get a working build soon. After all, companies would likely remain on .NET Core 3.1 until .NET 6 is out, so those wouldn't be affected. And devs on .NET 5 now would be fine with having to jump to .NET 6 immediately anyway, after all they jsut did the same with .NET 5 right now. I'd say if you have time you could look into this and then when you have a branch that shows a nice boost on ARM64, consider shipping if it's really good (and if it's not like mid 2021 already by the time that's done)? 😄

@codecov
Copy link

codecov bot commented Dec 7, 2020

Codecov Report

Merging #1465 (863bddb) into master (1f351ee) will decrease coverage by 0.06%.
The diff coverage is 72.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1465      +/-   ##
==========================================
- Coverage   83.56%   83.49%   -0.07%     
==========================================
  Files         737      742       +5     
  Lines       32232    32347     +115     
  Branches     3618     3639      +21     
==========================================
+ Hits        26935    27009      +74     
- Misses       4581     4625      +44     
+ Partials      716      713       -3     
Flag Coverage Δ
unittests 83.49% <72.53%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/ImageSharp/Primitives/DenseMatrix{T}.cs 90.00% <ø> (ø)
...ors/Convolution/ConvolutionRowOperation{TPixel}.cs 53.73% <53.73%> (ø)
...s/Convolution/Convolution2DRowOperation{TPixel}.cs 54.54% <54.54%> (ø)
...rocessing/Processors/Convolution/ReadOnlyKernel.cs 66.66% <66.66%> (ø)
...essors/Convolution/ConvolutionProcessor{TPixel}.cs 77.63% <71.92%> (+0.27%) ⬆️
...y/Allocators/ArrayPoolMemoryAllocator.Buffer{T}.cs 77.27% <100.00%> (+1.08%) ⬆️
...sors/Convolution/Convolution2DProcessor{TPixel}.cs 100.00% <100.00%> (+22.03%) ⬆️
...ssing/Processors/Convolution/Convolution2DState.cs 100.00% <100.00%> (ø)
...s/Convolution/Convolution2PassProcessor{TPixel}.cs 100.00% <100.00%> (+20.68%) ⬆️
...cessing/Processors/Convolution/ConvolutionState.cs 100.00% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f351ee...863bddb. Read the comment docs.

Copy link
Member

@Sergio0694 Sergio0694 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, and love the speed improvements!
Just left a few notes with respect to code style and other minor things. My main concern is more about peak memory usage int his case, so curious to see more info on that front. Will get to work on the bokeh blur version using this new pattern once this is merged 😄

#if SUPPORTS_CREATESPAN
ref byte r0 = ref MemoryMarshal.GetReference<byte>(this.Data);
return MemoryMarshal.CreateSpan(ref Unsafe.As<byte, T>(ref r0), this.length);
#else
return MemoryMarshal.Cast<byte, T>(this.Data.AsSpan()).Slice(0, this.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory here we know that the byte size is an exact multiple of our target length for the T span, so I was thinking we could be able to remove that final Slice call by pre-slicing the Data array. Something like:

return MemoryMarshal.Cast<byte, T>(this.Data.AsSpan(0, this.length * sizeof(T)));

Tried out on sharplab (here) but apparently the codegen is worse for some reason. So yeah I'm not really seeing a way to improve this without having access to the .NET 5 APIs 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep .NET 5 makes this easier.

[MethodImpl(InliningOptions.ShortMethod)]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it AggressiveInlining here and not the internal ShortMethod? Was this intentional or did you just forget to change this back after your initial experiments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. ShortMethod was initially designed for certain jpeg profiling but it kinda spread. I don't think we need it any more wit the tooling we have available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. Well we might want to revert that in another PR then 😄

Comment on lines +67 to +69
// We use a rectangle 3x the interest width to allocate a buffer big enough
// for source and target bulk pixel conversion.
var operationBounds = new Rectangle(interest.X, interest.Y, interest.Width * 3, interest.Height);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about the temporary memory usage of this solution, as I remember you being a bit worried about this back when I was first working on the bokeh blur processor. Wouldn't this mean we're now allocating a 3 x [image size] buffer every time now? As in, if I'm processing a 1920x1080 image, this would allocate a 5760 x 1080 temporary buffer, correct? Is that ok? At 4K image that'd be a buffer of 24 million pixels 🤔

Looking at the rest of the code I think I get why you're doing this (as it allows you to access pixels on the Y access in row major order), but I'd be curious about a diff in max peak memory usage compared to master, if you have made such a benchmark? If nothing else, I think it'd be an interesting bit of into to include 🙂

Copy link
Member Author

@JimBobSquarePants JimBobSquarePants Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're actually only allocating a single of Span<Vector4> of length 3x interest width per parallel region and only for 2D convolution. For 1D or 2Pass it's 2x interest width.

using IMemoryOwner<TBuffer> buffer = this.allocator.Allocate<TBuffer>(this.width);
Span<TBuffer> span = buffer.Memory.Span;
for (int y = yMin; y < yMax; y++)
{
Unsafe.AsRef(this.action).Invoke(y, span);
}

Copy link
Member

@Sergio0694 Sergio0694 Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh I see, yeah that makes sense. Perfect then! 🚀

...I even wrote that code 🤣

Comment on lines -45 to 52
/// Gets the horizontal gradient operator.
/// Gets the horizontal convolution kernel.
/// </summary>
public DenseMatrix<float> KernelX { get; }

/// <summary>
/// Gets the vertical gradient operator.
/// Gets the vertical convolution kernel.
/// </summary>
public DenseMatrix<float> KernelY { get; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small side note, I'm still convinced that using DenseMatrix<T> for separable 1D kernels makes the code less intuitive (those are 1D vectors, not matrices) and potentially more overhead for the unnecessary 2D coordinate calculation. I think we could possibly look into having a different DenseVector type in the future that would just map to a 1D vector, to use in cases such as this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes my life easier as I can use the same code throughout when doing things like offset mapping.

We also use the same code for the non-separable convolution operation so that can be arbitrary dimensions.

/// A stack only, readonly, kernel matrix that can be indexed without
/// bounds checks when compiled in release mode.
/// </summary>
internal readonly ref struct ReadOnlyKernel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to get back to my point above with respect to clarity, imho this should be called ReadOnlyKernel2D.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, there's only be one type.

@JimBobSquarePants JimBobSquarePants merged commit ff94d20 into master Dec 10, 2020
@JimBobSquarePants JimBobSquarePants deleted the js/convolution-experiments branch December 10, 2020 19:39
@Sergio0694 Sergio0694 mentioned this pull request Dec 12, 2020
4 tasks
JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants