-
-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Convolution #1465
Optimize Convolution #1465
Conversation
@@ -53,8 +53,13 @@ public override Span<T> GetSpan() | |||
{ | |||
ThrowObjectDisposedException(); | |||
} | |||
|
|||
#if SUPPORTS_CREATESPAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came up during profiling. On NET Core 3.1 We can use a shortcut since we know the length.
@@ -53,8 +53,13 @@ public override Span<T> GetSpan() | |||
{ | |||
ThrowObjectDisposedException(); | |||
} | |||
|
|||
#if SUPPORTS_CREATESPAN | |||
ref byte r0 = ref MemoryMarshal.GetReference<byte>(this.Data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small note - this is implicitly creating a Span<byte>
from the array. When we add .NET 5 support we can optimize this even further by using MemoryMarshal.GetArrayDataReference
instead and completely bypass all checks 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I was looking at that. I still don't know what to do re .NET 5 though. It's be our first target framework which isn't LTS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. The way I see it, we could maybe do:
- Add a .NET 5 target to start preparing the codebase for .NET 6. When .NET 6 lands, immediately switch to it and drop .NET 5, since support for it would end 3 months after .NET 6 is released anyways.
- Just hold our breath until .NET 6 lands and stick to .NET Core 3.1 as max until then.
I'd say it might be worth doing an initial investigation for starters, to see how many places in the lib could actually benefit from .NET 5 exclusive APIs that are not on .NET Core 3.1? Then we could decide based on that 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARM intrinsics is a big one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, right. Yeah that alone might be worth it. Properly supporting ARM64 gives a ton of visibility right now, plus being ready for .NET 6 would be very nice. And if we are to work on ARM64 right now anyway, I don't see the harm in publishing .NET 5 packages already, assuming we'll get a working build soon. After all, companies would likely remain on .NET Core 3.1 until .NET 6 is out, so those wouldn't be affected. And devs on .NET 5 now would be fine with having to jump to .NET 6 immediately anyway, after all they jsut did the same with .NET 5 right now. I'd say if you have time you could look into this and then when you have a branch that shows a nice boost on ARM64, consider shipping if it's really good (and if it's not like mid 2021 already by the time that's done)? 😄
Codecov Report
@@ Coverage Diff @@
## master #1465 +/- ##
==========================================
- Coverage 83.56% 83.49% -0.07%
==========================================
Files 737 742 +5
Lines 32232 32347 +115
Branches 3618 3639 +21
==========================================
+ Hits 26935 27009 +74
- Misses 4581 4625 +44
+ Partials 716 713 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, and love the speed improvements!
Just left a few notes with respect to code style and other minor things. My main concern is more about peak memory usage int his case, so curious to see more info on that front. Will get to work on the bokeh blur version using this new pattern once this is merged 😄
#if SUPPORTS_CREATESPAN | ||
ref byte r0 = ref MemoryMarshal.GetReference<byte>(this.Data); | ||
return MemoryMarshal.CreateSpan(ref Unsafe.As<byte, T>(ref r0), this.length); | ||
#else | ||
return MemoryMarshal.Cast<byte, T>(this.Data.AsSpan()).Slice(0, this.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory here we know that the byte
size is an exact multiple of our target length for the T
span, so I was thinking we could be able to remove that final Slice
call by pre-slicing the Data
array. Something like:
return MemoryMarshal.Cast<byte, T>(this.Data.AsSpan(0, this.length * sizeof(T)));
Tried out on sharplab (here) but apparently the codegen is worse for some reason. So yeah I'm not really seeing a way to improve this without having access to the .NET 5 APIs 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep .NET 5 makes this easier.
[MethodImpl(InliningOptions.ShortMethod)] | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it AggressiveInlining
here and not the internal ShortMethod
? Was this intentional or did you just forget to change this back after your initial experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional. ShortMethod
was initially designed for certain jpeg profiling but it kinda spread. I don't think we need it any more wit the tooling we have available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. Well we might want to revert that in another PR then 😄
src/ImageSharp/Processing/Processors/Convolution/Convolution2DProcessor{TPixel}.cs
Show resolved
Hide resolved
// We use a rectangle 3x the interest width to allocate a buffer big enough | ||
// for source and target bulk pixel conversion. | ||
var operationBounds = new Rectangle(interest.X, interest.Y, interest.Width * 3, interest.Height); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about the temporary memory usage of this solution, as I remember you being a bit worried about this back when I was first working on the bokeh blur processor. Wouldn't this mean we're now allocating a 3 x [image size]
buffer every time now? As in, if I'm processing a 1920x1080
image, this would allocate a 5760 x 1080
temporary buffer, correct? Is that ok? At 4K image that'd be a buffer of 24 million pixels 🤔
Looking at the rest of the code I think I get why you're doing this (as it allows you to access pixels on the Y access in row major order), but I'd be curious about a diff in max peak memory usage compared to master
, if you have made such a benchmark? If nothing else, I think it'd be an interesting bit of into to include 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're actually only allocating a single of Span<Vector4>
of length 3x interest width
per parallel region and only for 2D convolution. For 1D or 2Pass it's 2x interest width
.
ImageSharp/src/ImageSharp/Advanced/ParallelRowIterator.Wrappers.cs
Lines 101 to 108 in 1f351ee
using IMemoryOwner<TBuffer> buffer = this.allocator.Allocate<TBuffer>(this.width); | |
Span<TBuffer> span = buffer.Memory.Span; | |
for (int y = yMin; y < yMax; y++) | |
{ | |
Unsafe.AsRef(this.action).Invoke(y, span); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh I see, yeah that makes sense. Perfect then! 🚀
...I even wrote that code 🤣
/// Gets the horizontal gradient operator. | ||
/// Gets the horizontal convolution kernel. | ||
/// </summary> | ||
public DenseMatrix<float> KernelX { get; } | ||
|
||
/// <summary> | ||
/// Gets the vertical gradient operator. | ||
/// Gets the vertical convolution kernel. | ||
/// </summary> | ||
public DenseMatrix<float> KernelY { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small side note, I'm still convinced that using DenseMatrix<T>
for separable 1D kernels makes the code less intuitive (those are 1D vectors, not matrices) and potentially more overhead for the unnecessary 2D coordinate calculation. I think we could possibly look into having a different DenseVector type in the future that would just map to a 1D vector, to use in cases such as this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes my life easier as I can use the same code throughout when doing things like offset mapping.
We also use the same code for the non-separable convolution operation so that can be arbitrary dimensions.
/// A stack only, readonly, kernel matrix that can be indexed without | ||
/// bounds checks when compiled in release mode. | ||
/// </summary> | ||
internal readonly ref struct ReadOnlyKernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to get back to my point above with respect to clarity, imho this should be called ReadOnlyKernel2D
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above, there's only be one type.
Optimize Convolution
Prerequisites
Description
Refactors all the convolution algorithms to take advantage of bulk operations.
Operationally the way the algorithm works has changed as follows:
This allows us to use our bulk conversion methods per row.
Benchmarks are very healthy.
Before
After