Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substantial Jpeg performance degradation on master branch as of 09.12.2021 #1884

Closed
br3aker opened this issue Dec 9, 2021 · 14 comments · Fixed by #1901
Closed

Substantial Jpeg performance degradation on master branch as of 09.12.2021 #1884

br3aker opened this issue Dec 9, 2021 · 14 comments · Fixed by #1901
Assignees

Comments

@br3aker
Copy link
Contributor

br3aker commented Dec 9, 2021

Description

Looks like it was caused by Memory Allocator PR but I'm not sure, just spotted it @antonfirsov @JimBobSquarePants

Master before allocator PR

Method - Decode Mean Error StdDev
'Baseline 4:4:4 Interleaved' 11.504 ms 0.0719 ms 0.0673 ms
'Baseline 4:2:0 Interleaved' 8.605 ms 0.0372 ms 0.0348 ms
'Baseline 4:0:0 (grayscale)' 1.575 ms 0.0054 ms 0.0042 ms
'Progressive 4:2:0 Non-Interleaved' 13.207 ms 0.0472 ms 0.0369 ms

Current master

Method - Decode Mean Error StdDev
'Baseline 4:4:4 Interleaved' 13.532 ms 0.1669 ms 0.1561 ms
'Baseline 4:2:0 Interleaved' 9.081 ms 0.0397 ms 0.0332 ms
'Baseline 4:0:0 (grayscale)' 1.850 ms 0.0109 ms 0.0085 ms
'Progressive 4:2:0 Non-Interleaved' 13.581 ms 0.1397 ms 0.1238 ms

What is strange is that progressive code wasn't really affected as baseline decoder.


Master before allocator PR

Method - Encode Quality Mean Error StdDev Ratio
'System.Drawing Jpeg 4:2:0' 75 30.04 ms 0.540 ms 0.479 ms 1.00
'ImageSharp Jpeg 4:2:0' 75 19.32 ms 0.290 ms 0.257 ms 0.64
'ImageSharp Jpeg 4:4:4' 75 26.76 ms 0.332 ms 0.294 ms 0.89
'System.Drawing Jpeg 4:2:0' 90 32.82 ms 0.184 ms 0.163 ms 1.00
'ImageSharp Jpeg 4:2:0' 90 25.00 ms 0.408 ms 0.361 ms 0.76
'ImageSharp Jpeg 4:4:4' 90 31.83 ms 0.636 ms 0.595 ms 0.97
'System.Drawing Jpeg 4:2:0' 100 39.30 ms 0.359 ms 0.318 ms 1.00
'ImageSharp Jpeg 4:2:0' 100 34.49 ms 0.265 ms 0.235 ms 0.88
'ImageSharp Jpeg 4:4:4' 100 56.40 ms 0.565 ms 0.501 ms 1.44

Current master

Method - Encode Quality Mean Error StdDev Ratio
'System.Drawing Jpeg 4:2:0' 75 28.01 ms 0.541 ms 0.506 ms 1.00
'ImageSharp Jpeg 4:2:0' 75 26.24 ms 0.489 ms 0.481 ms 0.94
'ImageSharp Jpeg 4:4:4' 75 25.04 ms 0.498 ms 0.859 ms 0.87
'System.Drawing Jpeg 4:2:0' 90 30.73 ms 0.082 ms 0.069 ms 1.00
'ImageSharp Jpeg 4:2:0' 90 30.95 ms 0.275 ms 0.215 ms 1.01
'ImageSharp Jpeg 4:4:4' 90 29.80 ms 0.541 ms 0.948 ms 0.99
'System.Drawing Jpeg 4:2:0' 100 37.27 ms 0.174 ms 0.163 ms 1.00
'ImageSharp Jpeg 4:2:0' 100 40.44 ms 0.159 ms 0.124 ms 1.09
'ImageSharp Jpeg 4:4:4' 100 50.05 ms 0.749 ms 0.664 ms 1.34

  • ImageSharp version: master branch as of 09.12.2021
@antonfirsov
Copy link
Member

Thanks for spotting! This is something we should definitely look into. I won't have time before next week, if you have an easy way to obtain perf profile before/after, don't hesitate to share, but no worries if not.

@br3aker
Copy link
Contributor Author

br3aker commented Dec 9, 2021

if you have an easy way to obtain perf profile before/after, don't hesitate to share, but no worries if not

Not sure I understand what you meant by that. Some sort of a flame graph or full capture from PerfView (don't have a Rider)?

@brianpopow
Copy link
Collaborator

I have done some profiling with dotTrace before and after #1730 merged with RunDecodeJpegProfilingTests from the sandbox project. The new DangerousGetRowSpan seems to be noticable slower then GetRowSpan before.

master

NewMemoryAllocator

before:

BeforeNewMemoryAllocator

Profiling results as a zip:
MemoryAllocatorProfileResults.zip

@br3aker
Copy link
Contributor Author

br3aker commented Dec 9, 2021

@brianpopow thanks!

Looks like all codecs are affected by this, should I rename the issue?

@brianpopow
Copy link
Collaborator

Lets wait with renaming until @antonfirsov confirms that DangerousGetRowSpan is really the culprit here.

@antonfirsov
Copy link
Member

Thanks @brianpopow! This all makes sense, in the new DangerousGetRowSpan I removed the single Memory<T> fast path, assuming it will be less common case. That assumption might be wrong. We should find out how to fix this regression.

@JimBobSquarePants
Copy link
Member

I removed the single Memory fast path, assuming it will be less common case.

I assumed it just wasn't required anymore when reviewing. Whoops!

@br3aker
Copy link
Contributor Author

br3aker commented Dec 10, 2021

Makes sense why progressive jpeg wasn't affected by this as it does not get span that often.

@br3aker
Copy link
Contributor Author

br3aker commented Dec 10, 2021

There's a lot of work going on in internal API of the memory code. Consider this somewhat of a small draft proposal.

We should separate internal API and public API, i.e. provide unsafe check-free api for internal code with debug checks just for the sake of fool-proofness and public API with full amount of checks.

For example, this method has 6 if checks which are basically useless for the Jpeg codec because higher level code ensures all them to be true.

internal static Memory<T> GetBoundedSlice<T>(this IMemoryGroup<T> group, long start, int length)
where T : struct
{
Guard.NotNull(group, nameof(group));
Guard.IsTrue(group.IsValid, nameof(group), "Group must be valid!");
Guard.MustBeGreaterThanOrEqualTo(length, 0, nameof(length));
Guard.MustBeLessThan(start, group.TotalLength, nameof(start));
int bufferIdx = (int)(start / group.BufferLength);
// if (bufferIdx < 0 || bufferIdx >= group.Count)
if ((uint)bufferIdx >= group.Count)
{
throw new ArgumentOutOfRangeException(nameof(start));
}
int bufferStart = (int)(start % group.BufferLength);
int bufferEnd = bufferStart + length;
Memory<T> memory = group[bufferIdx];
if (bufferEnd > memory.Length)
{
throw new ArgumentOutOfRangeException(nameof(length));
}
return memory.Slice(bufferStart, length);
}

Yes, branch predictor most likely will eliminate miss overhead but it's still a check and it still may screw up prediction history - we can't rely on hardware implementation and/or predictor history buffer.


A lot less important, uniform group buffer has redundant operations during span fetching in a Buffer2D 'get span at y index' scenario:

First, it converts y to an offset for Buffer2D logic:

private Memory<T> GetRowMemoryCore(int y) => this.FastMemoryGroup.GetBoundedSlice(y * (long)this.Width, this.Width);

Then it converts this offset to the y coordinate again:

int bufferIdx = (int)(start / group.BufferLength);

While all of this might not be that performance heavy, thing is, screws are so tight for jpeg right now that even such small optimizations can gain percents of performance.

@antonfirsov
Copy link
Member

antonfirsov commented Dec 10, 2021

Cant look at code right now, but if I remember correctly y and bufferIdx are not the same. I agree that (non-debug) bounds checks shall be eliminated entirely whenever possible, but we should carefully think it over, for public entry poimts there should be a bounds check somewhere. Additionally, we can get rid of the entire Memory/IMemoryOwner virtual dispatch for Span methods by caching pointers and arrays for known IMemoryOwner implementations.

The hard part is to find a consistent design. I'm thinking of adding separate virtual methods to MemoryGroup for Span and Memory dispatch, and create different subclasses for unmanaged and array memory. (One virtual dispatch remains, but the rest is as fast as possible) Edit: Swap logic does not allow subclassing in the current design.

@br3aker
Copy link
Contributor Author

br3aker commented Dec 11, 2021

I'll write down some notes with fresh thoughts so you you'll have a full picture when you'll have time to look at this.

if I remember correctly y and bufferIdx are not the same
Yes and no at the same time. From a quick look there's two modes of 2D memory allocation: continuous and non-continuous buffers.

Jpeg spectral data buffers (which are the main bottnecks in this conversation about slow GetSpan(y)) are allocated with this code:

public void AllocateSpectral(bool fullScan)
{
if (this.SpectralBlocks != null)
{
// this method will be called each scan marker so we need to allocate only once
return;
}
int spectralAllocWidth = this.SizeInBlocks.Width;
int spectralAllocHeight = fullScan ? this.SizeInBlocks.Height : this.VerticalSamplingFactor;
this.SpectralBlocks = this.memoryAllocator.Allocate2D<Block8x8>(spectralAllocWidth, spectralAllocHeight, AllocationOptions.Clean);
}

Which allocates 2D buffer with preferContiguosImageBuffers set to false:

public static Buffer2D<T> Allocate2D<T>(
this MemoryAllocator memoryAllocator,
int width,
int height,
AllocationOptions options = AllocationOptions.None)
where T : struct =>
Allocate2D<T>(memoryAllocator, width, height, false, options);

So we can safely use y coordinate without width recalculations. Problem is that we would need to do a separate unsafe get method which must be used only when code is sure that underlying buffer is jagged. It's too complicated and as I've said - too little gain for this much complexity. Maybe one day I'll create a somewhat PR/discussion about this but right now DangerousGetRowSpan and redundant if checks are more important.

@antonfirsov antonfirsov added this to the 2.0.0 milestone Dec 12, 2021
@antonfirsov antonfirsov self-assigned this Dec 12, 2021
@br3aker
Copy link
Contributor Author

br3aker commented Dec 19, 2021

@antonfirsov sorry for being intrusive but any chance for an ETA? :)

@antonfirsov
Copy link
Member

You are probably reading minds, because I just started looking at this :) A PR will hopefully land next week.

@br3aker
Copy link
Contributor Author

br3aker commented Dec 19, 2021

Nice, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants