Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore ability to use null cache #194

Merged
merged 12 commits into from
Dec 20, 2021

Conversation

kroymann
Copy link
Contributor

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Per the discussion in #193, this PR restores the ability to effectively disable the caching logic through the use of a "NullCache". The most important part of this change is restoring the ability to stream the response directly from the processed image output, which is how it worked before v1.0.3. In v1.0.3, the WriterWorkers pattern was introduced, and a side effect of that change was that the response was now always streamed from the cache, which broke the ability to use a NullCache. To address this, I removed the ReadWorkers/WriterWorkers stuff and replaced it with a more standard reader/writer locking pattern using a synchronization library ported from my company's codebase. This made it so that the processed image stream could remain available beyond the section of code protected inside the writer lock, and thus be available for use in generating the response.

A key sticking point here (obviously) is benchmark testing this to see if performance was improved or degraded by this change.

@CLAassistant
Copy link

CLAassistant commented Dec 10, 2021

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Dec 10, 2021

Codecov Report

Merging #194 (0a9a0fb) into master (5cb589c) will decrease coverage by 0.20%.
The diff coverage is 89.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #194      +/-   ##
==========================================
- Coverage   84.80%   84.60%   -0.21%     
==========================================
  Files          50       55       +5     
  Lines        1448     1539      +91     
  Branches      199      228      +29     
==========================================
+ Hits         1228     1302      +74     
- Misses        165      181      +16     
- Partials       55       56       +1     
Flag Coverage Δ
unittests 84.60% <89.86%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../Synchronization/RefCountedConcurrentDictionary.cs 76.00% <76.00%> (ø)
src/ImageSharp.Web/Synchronization/AsyncKeyLock.cs 83.33% <83.33%> (ø)
.../ImageSharp.Web/Middleware/ImageSharpMiddleware.cs 84.13% <89.36%> (-2.63%) ⬇️
src/ImageSharp.Web/Synchronization/AsyncLock.cs 95.23% <95.23%> (ø)
...Sharp.Web/Synchronization/AsyncReaderWriterLock.cs 98.48% <98.48%> (ø)
...DependencyInjection/ServiceCollectionExtensions.cs 100.00% <100.00%> (ø)
...rp.Web/Synchronization/AsyncKeyReaderWriterLock.cs 100.00% <100.00%> (ø)
...p.Web/Middleware/ConcurrentDictionaryExtensions.cs 0.00% <0.00%> (-50.00%) ⬇️
...ching/LruCache/ConcurrentTLruCache{TKey,TValue}.cs 43.75% <0.00%> (+1.78%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb589c...0a9a0fb. Read the comment docs.

@JimBobSquarePants
Copy link
Member

@kroymann I'm struggling to find the time to benchmark this so if you have the opportunity please do.

@deanmarcussen
Copy link
Collaborator

I had a quick look this morning.

The results are both very promising, and suspicously close.

It either indicates that the LRU cache is blocking any meaningful test of this piece of functionality (I don't think so, but need to have a longer look at code to say for sure), or both locking arrangements are having no meaningful impact on performance, and we've at the limit of what we can serve (i.e. limited by disk speed, or the framework itself).

I'm tending to the second, as the most meaningful change I could make to impact rps was to set the samples project logging to "Warning" instead of "Debug". (basically drop all the expensive logging)

Need to find some time to look at the code changes, and do another run with the samples project updated to .NET 6.

@JimBobSquarePants
Copy link
Member

Thanks @deanmarcussen

I think it might be an idea to try both this and master with a much reduced timespan for LRU cache. It's sitting at 5 minutes just now which will be interfering with results. It would be amazing if you could document the setup you use to run the benchmark as I had a go at reading the Crank docs again and felt a bit overwhelmed.

A .NET 6 run would definitely be a good idea.

this.RefCount = refCount;
}

public bool Equals(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting error CS8767: Nullability of reference types in type of parameter 'other' of 'bool RefCountedValue.Equals(RefCountedValue other)' doesn't match implicitly implemented member 'bool IEquatable<RefCountedValue>.Equals(RefCountedValue other)' (possibly because of nullability attributes) when running the benchmarks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird. This compiles cleanly in both netcoreapp2.1 and 3.1 for me, and I'm able to run the benchmarks in both TFMs as well....? @sebastienros Anything special about how you're trying to run the benchmarks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise, using crank, doing a web load benchmark independent from BDN. It works fine on master though. Full stack:

Command:
dotnet publish ImageSharp.Web.Sample.csproj -c Release -o /tmp/benchmarks-agent/benchmarks-server-1/2ns2okc1.ov0/ImageSharp.Web/samples/ImageSharp.Web.Sample/published /p:MicrosoftNETCoreAppPackageVersion=3.1.21 /p:MicrosoftAspNetCoreAppPackageVersion=3.1.21 /p:MicrosoftNETCoreApp31PackageVersion=3.1.21 /p:MicrosoftNETPlatformLibrary=Microsoft.NETCore.App /p:RestoreNoCache=true --framework netcoreapp3.1 --self-contained -r linux-x64 
Microsoft (R) Build Engine version 16.7.2+b60ddb6f4 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.

  Determining projects to restore...
  Restored /tmp/benchmarks-agent/benchmarks-server-1/2ns2okc1.ov0/ImageSharp.Web/src/ImageSharp.Web/ImageSharp.Web.csproj (in 297 ms).
  Restored /tmp/benchmarks-agent/benchmarks-server-1/2ns2okc1.ov0/ImageSharp.Web/samples/ImageSharp.Web.Sample/ImageSharp.Web.Sample.csproj (in 297 ms).
Synchronization/RefCountedConcurrentDictionary.cs(229,25): error CS8767: Nullability of reference types in type of parameter 'other' of 'bool RefCountedValue.Equals(RefCountedValue other)' doesn't match implicitly implemented member 'bool IEquatable<RefCountedValue>.Equals(RefCountedValue other)' (possibly because of nullability attributes). [/tmp/benchmarks-agent/benchmarks-server-1/2ns2okc1.ov0/ImageSharp.Web/src/ImageSharp.Web/ImageSharp.Web.csproj]
Exit code: 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a small change to this code that adjusts the nullability attributes when compiling with net5.0 or higher. This is preemptively getting ahead of any update to this codebase to target net6.0, and maybe it will address whatever issue you hit running benchmarks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your changes it builds on net6.0. At least it's unblocking me.

/// <summary>
/// Simple immutable tuple that combines a <typeparamref name="TValue"/> instance with a ref count integer.
/// </summary>
private class RefCountedValue : IEquatable<RefCountedValue>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use struct records here instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benchmarked this using both class and struct for this type and determined that using a class executes more quickly and allocates less memory. I believe this happens because ConcurrentDictionary can use an optimized code path that leverages atomic writes when TValue is a class, but has to fall back on a less efficient path that allocates when TValue is a struct.

|                          Method |      Mean |    Error |   StdDev |  Gen 0 | Allocated |
|-------------------------------- |----------:|---------:|---------:|-------:|----------:|
|       Class_GetAndReleaseNewKey | 129.92 ns | 0.554 ns | 0.518 ns | 0.0095 |      80 B |
|      Struct_GetAndReleaseNewKey | 147.59 ns | 0.687 ns | 0.573 ns | 0.0067 |      56 B |
|  Class_GetAndReleaseExistingKey | 142.32 ns | 1.159 ns | 1.027 ns | 0.0076 |      64 B |
| Struct_GetAndReleaseExistingKey | 177.69 ns | 0.813 ns | 0.721 ns | 0.0134 |     112 B |
|            Class_GetExistingKey |  69.75 ns | 0.301 ns | 0.267 ns | 0.0038 |      32 B |
|           Struct_GetExistingKey |  89.81 ns | 0.682 ns | 0.638 ns | 0.0067 |      56 B |

I have not yet experimented with using record types (in part because this codebase is still targeting netcoreapp2.1 and 3.1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave changing target frameworks to V2. I don't want your hard work delayed by our upstream work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not a struct record, tuples would still be good while keeping the code simpler. But I don't know about tfm requirements either, so you'll decide.

@sebastienros
Copy link
Contributor

With 10 concurrent clients, 15s warmup and 15s measurement, on a resized url but without any cache header (etag, ...) so all request return the file.

| application                             | master      | enable_null_cache |          |
| --------------------------------------- | ----------- | ----------------- | -------- |
| CPU Usage (%)                           |          84 |                86 |   +2.38% |
| Cores usage (%)                         |       1,011 |             1,027 |   +1.58% |
| Working Set (MB)                        |         212 |               215 |   +1.42% |
| Private Memory (MB)                     |         719 |               718 |   -0.14% |
| Build Time (ms)                         |      10,384 |             4,225 |  -59.31% |
| Start Time (ms)                         |         221 |               235 |   +6.33% |
| Published Size (KB)                     |      93,043 |            93,067 |   +0.03% |
| .NET Core SDK Version                   |     6.0.101 |           6.0.101 |          |
| Max CPU Usage (%)                       |          84 |                85 |   +1.19% |
| Max Working Set (MB)                    |         221 |               224 |   +1.36% |
| Max GC Heap Size (MB)                   |         133 |               131 |   -1.50% |
| Size of committed memory by the GC (MB) |         149 |               150 |   +0.67% |
| Max Number of Gen 0 GCs / sec           |        3.00 |              3.00 |    0.00% |
| Max Number of Gen 1 GCs / sec           |        1.00 |              1.00 |    0.00% |
| Max Number of Gen 2 GCs / sec           |        0.00 |              0.00 |          |
| Max Time in GC (%)                      |        0.00 |              0.00 |          |
| Max Gen 0 Size (B)                      |  26,989,312 |        18,437,472 |  -31.69% |
| Max Gen 1 Size (B)                      |  16,276,192 |        17,172,176 |   +5.50% |
| Max Gen 2 Size (B)                      |   1,767,440 |         1,761,712 |   -0.32% |
| Max LOH Size (B)                        |   2,577,912 |         2,577,912 |    0.00% |
| Max Allocation Rate (B/sec)             | 273,655,584 |       264,959,240 |   -3.18% |
| Max GC Heap Fragmentation               |          42 |                29 |  -31.43% |
| # of Assemblies Loaded                  |          97 |                97 |    0.00% |
| Max Exceptions (#/s)                    |           0 |                 0 |          |
| Max Lock Contention (#/s)               |          20 |                78 | +290.00% |
| Max ThreadPool Threads Count            |          23 |                22 |   -4.35% |
| Max ThreadPool Queue Length             |           0 |                 1 |      +∞% |
| Max ThreadPool Items (#/s)              |     150,078 |           145,980 |   -2.73% |
| Max Active Timers                       |           1 |                 1 |    0.00% |
| IL Jitted (B)                           |     302,807 |           312,448 |   +3.18% |
| Methods Jitted                          |       3,634 |             3,715 |   +2.23% |


| load                | master  | enable_null_cache |         |
| ------------------- | ------- | ----------------- | ------- |
| CPU Usage (%)       |      16 |                16 |   0.00% |
| Cores usage (%)     |     187 |               190 |  +1.60% |
| Working Set (MB)    |      41 |                41 |   0.00% |
| Private Memory (MB) |     110 |               110 |   0.00% |
| Start Time (ms)     |     112 |               110 |  -1.79% |
| First Request (ms)  |     388 |               327 | -15.72% |
| Requests            | 382,623 |           377,625 |  -1.31% |
| Bad responses       |       0 |                 0 |         |
| Mean latency (us)   |     388 |               393 |  +1.33% |
| Max latency (us)    |   9,234 |             7,155 | -22.51% |
| Requests/sec        |  25,511 |            25,176 |  -1.31% |
| Requests/sec (max)  |  28,741 |            29,711 |  +3.38% |

@JimBobSquarePants
Copy link
Member

@sebastienros Thanks for the numbers. I can see a bit of give and take but otherwise fairly even. For the additional feature I think the change is worth it.

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this merged in. Perf is comparable and we can always iterate further.

@JimBobSquarePants JimBobSquarePants merged commit fea0207 into SixLabors:master Dec 20, 2021
@kroymann kroymann deleted the enable_null_cache branch February 7, 2022 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants