Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some regressions in ASP.NET benchmarks #46120

Merged
merged 1 commit into from
Dec 16, 2020
Merged

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Dec 16, 2020

  • ConcurrentQueueSegment allows spinning threads to sleep. #44265 seems to have caused large regressions on Windows and Linux-arm64 on machines with a larger processor count. On Windows, Sleep(1) is closer to Sleep(15) so it may be even worse, but in both cases the change has the effect of removing thread pool worker threads from the system for relatively long periods of time, and in some cases the fewer remaining threads are not enough to get the expected throughput. During that change we had tested adding the Sleep(1) to some ConcurrentQueue operations in contending cases, and not spin-waiting at all in forward-progressing cases, compared with the prior default of spin-waiting without Sleep(1). Not spin-waiting at all where possible in contending cases seemed to be better or equal for the most part compared with the prior default, so I have removed spin-waiting in forward-progressing cases in ConcurrentQueue.
  • There were some smaller regressions from the portable thread pool on Windows. I have moved/tweaked a slight delay that I had added early on, after changes thereafter it lost its intention, with these changes it goes back to the original intention and seems to resolve some of the gap, but maybe not all of it in some tests. We'll check the graphs after this change and see if there is more to investigate. There are also other things to improve on Windows, and many of those may be separate from the portable thread pool but some may be relevant to the changes in perf characteristics.

Fixes #45716

- dotnet#44265 seems to have caused large regressions on Windows and Linux-arm64. During that change we had tested adding the `Sleep(1)` to some `ConcurrentQueue` operations in contending cases, and not spin-waiting at all in forward-progressing cases. Not spin-waiting at all where possible in contending cases seemed to be better or equal for the most part (compared with spin-waiting without `Sleep(1)`), so I have removed spin-waiting in forward-progressing cases in `ConcurrentQueue`.
- There were some regressions from the portable thread pool on Windows. I have moved/tweaked a slight delay that I had added early on, after changes thereafter it lost its intention, with the changes it goes back to the original intention and seems to resolve some of the gap, but maybe not all of it in some tests. We'll check the graphs after this change and see if there is more to investigate. There are also other things to improve on Windows, and many of those may be separate from the portable thread pool but some may be relevant to the changes in perf characteristics.
@kouvel kouvel added this to the 6.0.0 milestone Dec 16, 2020
@kouvel kouvel self-assigned this Dec 16, 2020
@kouvel
Copy link
Member Author

kouvel commented Dec 16, 2020

RPS numbers for some ASP.NET perf tests:

. . 5.0.1 Before Diff before-5 After Diff after-5 Diff after-before
citrine-win PlaintextPlatform 9176549.3 6702841.8 -27.0% 9064924.3 -1.2% 35.2%
citrine-win JsonPlatform 757732.6 594388.2 -21.6% 759449.8 0.2% 27.8%
citrine-win FortunesPlatform 336643.7 280116.8 -16.8% 328945.8 -2.3% 17.4%
citrine-win Plaintext 6027880.2 5938829.0 -1.5% 5985553.0 -0.7% 0.8%
citrine-win Json 708022.2 617337.3 -12.8% 698703.7 -1.3% 13.2%
citrine-win Fortunes 290025.7 279021.6 -3.8% 290303.8 0.1% 4.0%
citrine-arm PlaintextPlatform 7900866.8 3075682.5 -61.1% 8065507.0 2.1% 162.2%
citrine-arm JsonPlatform 630127.4 392300.3 -37.7% 646741.0 2.6% 64.9%
citrine-arm FortunesPlatform 72910.5 63670.3 -12.7% 77050.3 5.7% 21.0%
citrine-arm Plaintext 2980570.8 2998581.8 0.6% 3019476.0 1.3% 0.7%
citrine-arm Json 410387.0 380828.4 -7.2% 418505.7 2.0% 9.9%
citrine-arm Fortunes 58481.8 61590.2 5.3% 59983.5 2.6% -2.6%
citrine-amd PlaintextPlatform 13610332.8 12421484.9 -8.7% 13618838.5 0.1% 9.6%
citrine-amd JsonPlatform 1236138.0 1278186.0 3.4% 1364998.7 10.4% 6.8%
citrine-amd FortunesPlatform 449874.2 446096.6 -0.8% 465650.0 3.5% 4.4%
citrine-amd Plaintext 8241441.3 7864252.3 -4.6% 8222975.3 -0.2% 4.6%
citrine-amd Json 705725.7 739578.3 4.8% 949315.8 34.5% 28.4%
citrine-amd Fortunes 385240.7 353941.3 -8.1% 399461.0 3.7% 12.9%
citrine-lin PlaintextPlatform 13690889.4 12787144.2 -6.6% 13734629.5 0.3% 7.4%
citrine-lin JsonPlatform 1259474.3 1253970.7 -0.4% 1276371.5 1.3% 1.8%
citrine-lin FortunesPlatform 439379.7 433790.7 -1.3% 451493.0 2.8% 4.1%
citrine-lin Plaintext 5122837.0 5373168.8 4.9% 5363460.8 4.7% -0.2%
citrine-lin Json 983669.8 970836.0 -1.3% 988557.7 0.5% 1.8%

I have omitted citrine-lin/Fortunes because it wasn't yielding appropriate results when compared with historical graphs, doesn't seem to be working as expected at the moment.

citrine-win/PlaintextPlatform and citrine-win/FortunesPlatform may still be slight regressions compared to .NET 5, not sure for what reason, possibly the portable thread pool. The difference in numbers is somewhat consistent, but the error margin across runs in those benchmarks is fairly high, so I can't really tell. There may be something to investigate further there, will leave that for a separate investigation once it's more clear that there is a regression.

@kouvel
Copy link
Member Author

kouvel commented Dec 16, 2020

CC @alexcovington

@sebastienros
Copy link
Member

I have a hard time reading this table ;) but I assume this PR has the best numbers.

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@sebastienros
Copy link
Member

Looks great, even on Linux, net6 being faster than net5 on Fortunes for instance.

@kouvel
Copy link
Member Author

kouvel commented Dec 16, 2020

I have a hard time reading this table ;) but I assume this PR has the best numbers.

The "Diff before-5" column indicates the diff from before this change (in .NET 6) to 5.0.1, showing the regressions including all changes in-between. The "Diff after-5" column indicates the diff from after this change to 5.0.1, to me the most relevant column, though not directly tied to this change, shows if cumulative regressions since have been recovered. And "Diff after-before" indicates the diff from after this change to before this change, with no other changes involved.

@kouvel kouvel merged commit 8e6415d into dotnet:master Dec 16, 2020
@kouvel kouvel deleted the TpWinFix branch December 16, 2020 23:27
@adamsitnik
Copy link
Member

The numbers for the JsonPlatform benchmark for the AMD machine (around 1.2kk RPS) look much better than what is reported in the Power BI dashboard :

obraz

and what I am getting when I run them using crank:

 crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/master/scenarios/platform.benchmarks.yml --scenario json --profile aspnet-citrine-amd --application.framework net6.0

@kouvel were you using any custom settings like DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS: 1 or Disabled Hill Climbing when running the PlaintextPlatform and JsonPlatform benchmarks? If so, I wonder how our default settings (no env var not set) perform before and after your changes

@kouvel
Copy link
Member Author

kouvel commented Dec 17, 2020

Ah yea I forgot to mention, I disabled hill climbing. I did a few runs here and there with hill climbing and the error margin was so high between runs that I didn't bother anymore, in normal mode it would take a lot of runs of each test to weed out the error and make any reasonable determination of improvement or regression. The hill climbing issue is known and needs to be fixed, I don't think it's worth the time to compare with hill climbing enabled at the moment, as after it is fixed it would behave closer to hill climbing disabled during the benchmark (though maybe not fully, will have to see, but at least the intention of the fix is to reduce the extra error to unnoticeable levels). There is a work item for fixing it in .NET 6, not sure yet about prioritization.

@kouvel
Copy link
Member Author

kouvel commented Dec 17, 2020

We should be able to see from new graphs what the range is after the change compared to before, hopefully that will show if anything has not been fully fixed. I'm expecting that two tests still have regressions, but there could be more.

@kouvel
Copy link
Member Author

kouvel commented Dec 17, 2020

I don't think all of the high error margin is due to hill climbing, for instance even with it disabled I saw ~80K RPS differences in citrine-win/PlaintextPlatform at different times, consistently for some time. I'm not sure where that would be coming from. Anyway hill climbing definitely adds a lot of error, I had pretty stable results without it and high error with it, when running in quick succession with somewhat repeatable results.

@adamsitnik
Copy link
Member

I disabled hill climbing.

What about DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS? Was it set to 1?

For some reason it's now the default setting for PlaintextPlatform and JsonPlatform benchmarks (I believe that it's wrong, we then have 1 epoll thread per core, not 1 per 30 cores which is the default setting that 99.9999% of our customers are using in production)).

It can have a big impact on the benchmark results when you are changing somethinng related to ThreadPool. I would just feel safer about any changes if they were benchmarked with the default settings (with the exception of hill climbing on machines with many cores).

There is a work item for fixing it in .NET 6

I am really happy to hear that! Is there a GH issue that I could follow or for now it's just somewhere in our internal docs?

@adamsitnik
Copy link
Member

I have omitted citrine-lin/Fortunes because it wasn't yielding appropriate results when compared with historical graphs, doesn't seem to be working as expected at the moment.

@sebastienros are we aware of this issue?

cc @roji

@kouvel
Copy link
Member Author

kouvel commented Dec 17, 2020

What about DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS? Was it set to 1?
For some reason it's now the default setting for PlaintextPlatform and JsonPlatform benchmarks

I didn't set DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS. If that's the default for PlaintextPlatform and JsonPlatform should I explicitly set it to 0 to override? Is it just those two tests that are affected?

I am really happy to hear that! Is there a GH issue that I could follow or for now it's just somewhere in our internal docs?

Just in our internal docs at the moment, I haven't gotten around to filing GH issues for the work items yet, will do that soon.

@adamsitnik
Copy link
Member

I didn't set DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS. If that's the default for PlaintextPlatform and JsonPlatform should I explicitly set it to 0 to override?

Then it depends on which config file you have used. I can see that the docs mention https://raw.githubusercontent.com/aspnet/Benchmarks/master/scenarios/platform.benchmarks.yml sets it to 1 in an explicit way:

scenarios:
  plaintext:
    application:
      job: platformbenchmarks
      environmentVariables:
        DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS: 1
    load:

but the aspnet/benchmarks repo has one more config for platform benchmarks that does not set it: https://github.com/aspnet/Benchmarks/blob/master/src/BenchmarksApps/Kestrel/PlatformBenchmarks/platform.benchmarks.yml

@kouvel you can just remove the env var or set it to 0

Is it just those two tests that are affected?

Yes, exactly.

@stephentoub
Copy link
Member

What about DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS? Was it set to 1?
For some reason it's now the default setting for PlaintextPlatform and JsonPlatform benchmarks

I would like to understand when that changed and what the gains were/are from setting it.

@sebastienros
Copy link
Member

For some reason it's now the default setting for PlaintextPlatform and JsonPlatform benchmarks (I believe that it's wrong, we then have 1 epoll thread per core, not 1 per 30 cores which is the default setting that 99.9999% of our customers are using in production)).

99.9999% of our customer don't use Platform variants either ;) We made it the default for these 2 benchmarks as it makes them faster, so this is what we are setting on TE and what we are tracking. The middleware variants don't have this flag, so do not the other Platform ones.

I still want to track this one since it's the one we'll submit to TE if it's faster, but if you prefer them to be without this ENV by default I can add the special ones as a new scenario name also in the charts.

@kouvel
Copy link
Member Author

kouvel commented Dec 17, 2020

Then it depends on which config file you have used.

Ah ok I see that now. I used the config file from src/BenchmarksApps/Kestrel/PlatformBenchmarks/platform.benchmarks.yml that does not set DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS.

@adamsitnik
Copy link
Member

that does not set

@kouvel great, thanks for checking!

@adamsitnik
Copy link
Member

@sebastienros my main concern with this setting is that we can potentially miss a regression similar to the one that I've introduced with my first epoll PR (#33669) where all benchmarks have improved except JsonPlatform.

JsonPlatform is very specific as it produces the biggest workload for ThreadPool. Plaintext has a higher RPS, but to get the number of socket reads (which are scheduled to TP, writes are not as they always succeed with the first try) we have to divide the RPS by 16 (due to pipelining). So for the current TE network setup, it's 437k (7kk/16) vs 1.2kk socket reads. By monitoring the JsonPlatform with default settings we ensure that the architectural changes that we have introduced in .NET 5 (1 pool thread with parallel TP work items scheduling) are not regressing in extreme scenarios.

@adamsitnik
Copy link
Member

Another thing about DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1 is that my understanding was that we try to not cheat by setting custom settings and using non-released builds. If we changed our view on that I can provide few more extra settings (like disabling hill climbing and lowering the number of threads in Thread Pool) that would give us another 50k RPS for JSON.

I would like to understand when that changed and what the gains were/are from setting it

I am not sure when exactly, but my understanding is that has given us an extra 20-30k RPS for JsonPlatform.

@sebastienros
Copy link
Member

we try to not cheat by setting custom settings and using non-released builds

We don't cheat. Platform is the set of benchmarks where we apply all the tricks we can to get the fastest benchmarks. The middleware ones are the scenarios where we try to follow what users would do. And we have always used release or go-live versions. This setting is available, the same way we used to enable tiered compilation when it was just a flag.

If there are other knobs we can change to make the Platform benchmarks faster on these machines, we should do it.

@kouvel
Copy link
Member Author

kouvel commented Dec 17, 2020

At least for regular tracking purposes it would be nice to maybe use a different config file that does not set DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS, as @adamsitnik points out JsonPlatform so far has been very useful to find regressions in the thread pool but with that env var set the thread pool would be used much less and we may miss regressions. Some thread pool changes don't show up as large enough changes in Json as they do in JsonPlatform.

@sebastienros
Copy link
Member

I will make 0 the default then, and track an additional scenario for TE purposes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PortableThreadPool] ASP.NET TE benchmarks on Windows are slower
5 participants