-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebSocket Refactoring #48749
WebSocket Refactoring #48749
Conversation
… complexity and allow adding in deflate support later.
…e will not allocate even if the outgoing message is very big. Also this paves the road for deflate support.
Tagging subscribers to this area: @dotnet/ncl Issue DetailsThis PR contains part of the work done in #48470 and is prerequisite for it. We need lightweight abstractions when receiving / sending, so we can easily introduce other stuff on top of it (deflate). I also added several tests to improve the code coverage for System.Net.WebSockets. I removed from some of the tests the need for System.Net.Sockets and instead introduced WebSocketTestStream which can be used in duplex situations and adds the ability to inspect what is being sent / received without relying on raw sockets and NetworkStream. I use this extensively in the deflate PR. Here are some benchmarks comparing the current WebSocket with the version in this PR: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
[Host] : .NET Core 5.0.3 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Job-AKRFKA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT Sending to
Receiving from stream with data available, no async paths:
/cc @CarnaViire @karelz
|
Uploaded the benchmarks used to https://github.com/zlatanov/websocket-benchmarks in case anyone wants to run them. Here is another more useful benchmark. Using the current version of AspNetCore, in memory echo server (https://github.com/zlatanov/websocket-benchmarks/blob/main/src/Benchmarks/EchoBenchmark.cs): BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
[Host] : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
DefaultJob : .NET Core 5.0.3 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
The last benchmark is done with DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS. //cc @stephentoub My question is whether it's worthed to spend some time and try to avoid async ValueTask methods as much as possible, or to assume DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS is going to be default in .NET 6 and we can simplify the code in some places where async ValueTask is trying to be avoided by introducing a second method, which is called when async path needs to be taken? |
This is very unlikely. |
Thanks for sharing the numbers. I'm concerned by some of the benchmarks. Most of them show 5-20% regressions. The only one that improves is for a 1MB buffer, and I assume that's because something was previously allocating and now isn't, but that should be fixable in the current implementation as well (I've not yet actually looked at the changes in the PR). |
@zlatanov That looks like a non-trivial perf cost we would pay for a refactoring... Are there options to get the perf back? Or would it be better to choose less ideal solution which does not hurt perf that much? |
Here is updated benchmark for sending to BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
[Host] : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Job-DANIKK : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
Toolchain=CoreRun
Client side send is now consistently faster than the original implementation. Server side sending is still slower and so far I couldn't figure out why. PerfView insists that ArrayPool.Shared Rent / Return take almost 25% of the execution time, which I find impossible. Also I don't understand what is this coreclr!JIT_GetSharedGCThreadStaticBaseDynamicClass and why perfview thinks it takes so much of the execution time. |
…d with a very lightweight SendLock,
Results for the send after the last commit: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
[Host] : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Current : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
PR : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
//cc @CarnaViire @stephentoub |
…payload wasn't consumed fully.
…ete synchronously.
OK, I think I am ready. Here are the latest results: Send to Stream.Null BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
[Host] : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Receive, no async paths
Echo in process, with AspNetCore, sending and receiving 10 000 messages
Obviously these tests doesn't represent very real world examples, and if anyone has a suggestion for a benchmark, I will happily include it. @CarnaViire, @stephentoub any feedback is welcome. |
I would hate to see websocket performance regress. |
The shown benchmarks for sync completion get a bit faster but the echo example gets slower. Does that imply that async completions get significantly slower? I've not yet had a chance to look at the code to see what the changes are doing. What explains the improvements and regressions? |
@stephentoub The echo benchmark fluctuates by about 1-2% in either direction every time I run it. Here are results from a run I just did: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
[Host] : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Job-JOGWPO : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
Job-KYUCJS : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
Profiling it shows me most of the time is spent by the threadpool and i/o by aspnet core server. Maybe I haven't written the benchmarks with aspnetcore well? @davidfowl if you have time, can you take a look at https://github.com/zlatanov/websocket-benchmarks/blob/main/src/Benchmarks/EchoBenchmark.cs and see if it looks ok? |
6d9bf04
to
7c69845
Compare
The only concern is a regression. If the before after for the same ASP.NET Core code looks the same then its probably OK. |
Whats the allocation profile before and after (you can use the allocation profiler in VS hit alt+f2) |
Echo benchmark for a single connection (server, client), 1_000_000 messages of 100 bytes each, the total allocations:
|
…eived message is Text or Binary.
With the last commit, now ReceiveAsync for WebSocket has no allocations even in the async code path, but only for Text and Binary messages. Control messages (ping, pong, close) still have async code paths. Here is what the echo benchmark looks: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
[Host] : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
|
@stephentoub When profiling some of the benchmark to see where the CPU spends most of its time, I found something suspicious. 30% of CPU time is spent here. Is this normal, or is there something amiss? |
|
It's normal in highly bursty test cases. The worker threads must be running out of work very quickly, if not for the spin-waiting that or more CPU time would otherwise be taken by context switching. There is a related known issue in the thread pool, fixing of which can improve it a little, but it would still show up. |
/// Lightweight async lock that allows single owner at any given time. | ||
/// </summary> | ||
[StructLayout(LayoutKind.Auto)] | ||
private struct SendLock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned with the addition of this custom lock... are you sure there aren't any existing locks you could use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent of SemaphoreSlim(1, 1), but optimized for the case where there is no contention for the lock. I did extensive profiling and the semaphore that was used before used about 10-15% of the CPU in some of the cases.
Because the WebSocket explicitly doesn't support concurrent SendAsync, the only time where there might be contending for this lock is the case when the socket has received a control message and needs to reply and in the mean time there might be in flight send operation.
I am not aware of any lock primitive other than SemaphoreSlim, that allows acquiring in one thread and releasing in another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @stephentoub
Thanks @kouvel, I didn't know that. I just also found about Here are the results for the echo benchmark with BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
[Host] : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
|
In WebSocketCreateTest.cs there are 3 tests that rely on external servers to test core WebSocket functionality. Is it really necessary to use external servers for this, or should I refactor them to use the new WebSocketTestStream which supports duplex communication? This would resolve #28957. The mentioned issue suggest using loopback server, but it isn't necessary at all. runtime/src/libraries/System.Net.WebSockets/tests/WebSocketCreateTest.cs Lines 245 to 283 in 79ae74f
//cc @stephentoub @CarnaViire |
We need at least some tests that are communicating with a different WebSocket protocol implementation; otherwise, we're only testing against ourselves, and we could easily violate the protocol and not realize it. Chances of that happening when talking to a different implementation are significantly reduced. That said, it doesn't matter to me which tests are communicating with a different implementation, just that there are some (or alternatively that we incorporate a testing suite like https://github.com/crossbario/autobahn-testsuite to help validate the implementation... we tested against that when we initially wrote ManagedWebSocket, but haven't since to my knowledge). |
@stephentoub we run this test suite. So maybe we can run this change on ASP.NET Core. |
🙈 We don't run this suite currently. It hasn't been updated to run after we changed our infrastructure from travis/appveyor I believe. |
@davidfowl @BrennanConroy Can we still somehow leverage existing SignalR benchmarks or something of this sort, please? 😊 We are really interested in benchmarking a real-world scenario. |
@CarnaViire take a look at this command: https://github.com/aspnet/Benchmarks/tree/master/scenarios#sample-8 You will need to add And this point explains how to use locally built dlls: https://github.com/aspnet/Benchmarks/tree/master/scenarios#how-to-upload-custom-files This is using the |
Closing this one as it introduced too much unnecessary changes to the underlying WebSocket without sufficient gains other than less memory allocations, but with the cost of more complicated code. |
Less allocations might be a worthwhile endeavor but it depends on how complicated it makes the code. |
This PR contains part of the work done in #48470 and is prerequisite for it. We need lightweight abstractions when receiving / sending, so we can easily introduce other stuff on top of it (deflate).
I also added several tests to improve the code coverage for System.Net.WebSockets. I removed from some of the tests the need for System.Net.Sockets and instead introduced WebSocketTestStream which can be used in duplex situations and adds the ability to inspect what is being sent / received without relying on raw sockets and NetworkStream. I use this extensively in the deflate PR.
Here are some benchmarks comparing the current WebSocket with the version in this PR:
Sending to
Stream.Null
Receiving from stream with data available, no async paths:
/cc @CarnaViire @karelz