Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket Refactoring #48749

Closed
wants to merge 24 commits into from
Closed

WebSocket Refactoring #48749

wants to merge 24 commits into from

Conversation

zlatanov
Copy link
Contributor

This PR contains part of the work done in #48470 and is prerequisite for it. We need lightweight abstractions when receiving / sending, so we can easily introduce other stuff on top of it (deflate).

I also added several tests to improve the code coverage for System.Net.WebSockets. I removed from some of the tests the need for System.Net.Sockets and instead introduced WebSocketTestStream which can be used in duplex situations and adds the ability to inspect what is being sent / received without relying on raw sockets and NetworkStream. I use this extensively in the deflate PR.

Here are some benchmarks comparing the current WebSocket with the version in this PR:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 5.0.3 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
  Job-AKRFKA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Sending to Stream.Null

Send Server Size Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
PR False 0 209.63 ns 0.822 ns 0.686 ns - - - -
net5.0 False 0 200.54 ns 0.389 ns 0.364 ns - - - -
PR False 64 242.33 ns 0.949 ns 0.793 ns - - - -
net5.0 False 64 219.61 ns 1.671 ns 1.563 ns - - - -
PR False 128 235.83 ns 1.159 ns 1.084 ns - - - -
net5.0 False 128 209.63 ns 0.505 ns 0.472 ns - - - -
PR False 5012 390.27 ns 1.706 ns 1.596 ns - - - -
net5.0 False 5012 362.77 ns 0.774 ns 0.724 ns - - - -
PR False 1048576 61,535.57 ns 155.147 ns 145.125 ns - - - -
net5.0 False 1048576 310,599.77 ns 4,642.522 ns 4,559.577 ns 93.2617 93.2617 93.2617 1048717 B
PR True 0 129.81 ns 0.636 ns 0.564 ns - - - -
net5.0 True 0 98.02 ns 0.223 ns 0.198 ns - - - -
PR True 64 135.48 ns 0.871 ns 0.815 ns - - - -
net5.0 True 64 103.17 ns 0.306 ns 0.286 ns - - - -
PR True 128 135.00 ns 0.385 ns 0.360 ns - - - -
net5.0 True 128 110.24 ns 0.654 ns 0.611 ns - - - -
PR True 5012 215.31 ns 0.857 ns 0.802 ns - - - -
net5.0 True 5012 179.02 ns 0.238 ns 0.186 ns - - - -
PR True 1048576 35,018.27 ns 482.770 ns 451.583 ns - - - -
net5.0 True 1048576 332,306.25 ns 5,451.668 ns 4,832.761 ns 64.9414 64.9414 64.9414 1048686 B

Receiving from stream with data available, no async paths:

Receive Server Size Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
PR False 0 188.3 ns 0.80 ns 0.75 ns - - - -
net5.0 False 0 155.6 ns 0.72 ns 0.64 ns - - - -
PR False 64 196.2 ns 0.50 ns 0.42 ns - - - -
net5.0 False 64 163.3 ns 0.57 ns 0.51 ns - - - -
PR False 128 202.8 ns 0.59 ns 0.53 ns - - - -
net5.0 False 128 218.5 ns 0.69 ns 0.64 ns - - - -
PR False 5012 335.1 ns 0.95 ns 0.84 ns - - - -
net5.0 False 5012 302.0 ns 0.73 ns 0.61 ns - - - -
PR False 1048576 34,233.5 ns 546.18 ns 456.09 ns - - - -
net5.0 False 1048576 33,838.8 ns 114.11 ns 101.15 ns - - - -
PR True 0 203.1 ns 4.09 ns 7.59 ns - - - -
net5.0 True 0 188.1 ns 1.50 ns 1.26 ns - - - -
PR True 64 212.0 ns 1.34 ns 1.19 ns - - - -
net5.0 True 64 204.7 ns 0.93 ns 0.87 ns - - - -
PR True 128 214.8 ns 0.65 ns 0.58 ns - - - -
net5.0 True 128 236.3 ns 1.37 ns 1.21 ns - - - -
PR True 5012 428.5 ns 2.23 ns 1.86 ns - - - -
net5.0 True 5012 372.9 ns 2.38 ns 2.22 ns - - - -
PR True 1048576 62,565.7 ns 846.33 ns 660.76 ns - - - -
net5.0 True 1048576 63,085.1 ns 842.69 ns 747.02 ns - - - -

/cc @CarnaViire @karelz

@ghost
Copy link

ghost commented Feb 25, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR contains part of the work done in #48470 and is prerequisite for it. We need lightweight abstractions when receiving / sending, so we can easily introduce other stuff on top of it (deflate).

I also added several tests to improve the code coverage for System.Net.WebSockets. I removed from some of the tests the need for System.Net.Sockets and instead introduced WebSocketTestStream which can be used in duplex situations and adds the ability to inspect what is being sent / received without relying on raw sockets and NetworkStream. I use this extensively in the deflate PR.

Here are some benchmarks comparing the current WebSocket with the version in this PR:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 5.0.3 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
  Job-AKRFKA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Sending to Stream.Null

Send Server Size Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
PR False 0 209.63 ns 0.822 ns 0.686 ns - - - -
net5.0 False 0 200.54 ns 0.389 ns 0.364 ns - - - -
PR False 64 242.33 ns 0.949 ns 0.793 ns - - - -
net5.0 False 64 219.61 ns 1.671 ns 1.563 ns - - - -
PR False 128 235.83 ns 1.159 ns 1.084 ns - - - -
net5.0 False 128 209.63 ns 0.505 ns 0.472 ns - - - -
PR False 5012 390.27 ns 1.706 ns 1.596 ns - - - -
net5.0 False 5012 362.77 ns 0.774 ns 0.724 ns - - - -
PR False 1048576 61,535.57 ns 155.147 ns 145.125 ns - - - -
net5.0 False 1048576 310,599.77 ns 4,642.522 ns 4,559.577 ns 93.2617 93.2617 93.2617 1048717 B
PR True 0 129.81 ns 0.636 ns 0.564 ns - - - -
net5.0 True 0 98.02 ns 0.223 ns 0.198 ns - - - -
PR True 64 135.48 ns 0.871 ns 0.815 ns - - - -
net5.0 True 64 103.17 ns 0.306 ns 0.286 ns - - - -
PR True 128 135.00 ns 0.385 ns 0.360 ns - - - -
net5.0 True 128 110.24 ns 0.654 ns 0.611 ns - - - -
PR True 5012 215.31 ns 0.857 ns 0.802 ns - - - -
net5.0 True 5012 179.02 ns 0.238 ns 0.186 ns - - - -
PR True 1048576 35,018.27 ns 482.770 ns 451.583 ns - - - -
net5.0 True 1048576 332,306.25 ns 5,451.668 ns 4,832.761 ns 64.9414 64.9414 64.9414 1048686 B

Receiving from stream with data available, no async paths:

Receive Server Size Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
PR False 0 188.3 ns 0.80 ns 0.75 ns - - - -
net5.0 False 0 155.6 ns 0.72 ns 0.64 ns - - - -
PR False 64 196.2 ns 0.50 ns 0.42 ns - - - -
net5.0 False 64 163.3 ns 0.57 ns 0.51 ns - - - -
PR False 128 202.8 ns 0.59 ns 0.53 ns - - - -
net5.0 False 128 218.5 ns 0.69 ns 0.64 ns - - - -
PR False 5012 335.1 ns 0.95 ns 0.84 ns - - - -
net5.0 False 5012 302.0 ns 0.73 ns 0.61 ns - - - -
PR False 1048576 34,233.5 ns 546.18 ns 456.09 ns - - - -
net5.0 False 1048576 33,838.8 ns 114.11 ns 101.15 ns - - - -
PR True 0 203.1 ns 4.09 ns 7.59 ns - - - -
net5.0 True 0 188.1 ns 1.50 ns 1.26 ns - - - -
PR True 64 212.0 ns 1.34 ns 1.19 ns - - - -
net5.0 True 64 204.7 ns 0.93 ns 0.87 ns - - - -
PR True 128 214.8 ns 0.65 ns 0.58 ns - - - -
net5.0 True 128 236.3 ns 1.37 ns 1.21 ns - - - -
PR True 5012 428.5 ns 2.23 ns 1.86 ns - - - -
net5.0 True 5012 372.9 ns 2.38 ns 2.22 ns - - - -
PR True 1048576 62,565.7 ns 846.33 ns 660.76 ns - - - -
net5.0 True 1048576 63,085.1 ns 842.69 ns 747.02 ns - - - -

/cc @CarnaViire @karelz

Author: zlatanov
Assignees: -
Labels:

area-System.Net

Milestone: -

@zlatanov
Copy link
Contributor Author

Uploaded the benchmarks used to https://github.com/zlatanov/websocket-benchmarks in case anyone wants to run them. Here is another more useful benchmark. Using the current version of AspNetCore, in memory echo server (https://github.com/zlatanov/websocket-benchmarks/blob/main/src/Benchmarks/EchoBenchmark.cs):

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
  DefaultJob : .NET Core 5.0.3 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Method MessageSize MessageCount Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Existing 100 10000 562.7 ms 7.39 ms 6.55 ms 3000.0000 - - 14.5 MB
PR 100 10000 572.1 ms 11.37 ms 19.00 ms 4000.0000 - - 19.38 MB
PR With ValueTask Pooling 100 10000 584.3 ms 9.85 ms 9.21 ms - - - 1.68 MB

The last benchmark is done with DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS. //cc @stephentoub
I have to say though, if I try to run the test for the existing implementation with async task pooling enabled, I get runtime errors in ManagedWebSocket where ValueTask is invalid.

My question is whether it's worthed to spend some time and try to avoid async ValueTask methods as much as possible, or to assume DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS is going to be default in .NET 6 and we can simplify the code in some places where async ValueTask is trying to be avoided by introducing a second method, which is called when async path needs to be taken?

@stephentoub
Copy link
Member

or to assume DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS is going to be default in .NET 6

This is very unlikely.

@stephentoub
Copy link
Member

stephentoub commented Feb 25, 2021

Here are some benchmarks comparing the current WebSocket with the version in this PR:

Thanks for sharing the numbers. I'm concerned by some of the benchmarks. Most of them show 5-20% regressions. The only one that improves is for a 1MB buffer, and I assume that's because something was previously allocating and now isn't, but that should be fixable in the current implementation as well (I've not yet actually looked at the changes in the PR).

@CarnaViire
Copy link
Member

@zlatanov That looks like a non-trivial perf cost we would pay for a refactoring... Are there options to get the perf back? Or would it be better to choose less ideal solution which does not hurt perf that much?

@zlatanov
Copy link
Contributor Author

Here is updated benchmark for sending to Stream.Null

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
  Job-DANIKK : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Toolchain=CoreRun  
Method Server Size Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Send False 0 145.0 ns 0.46 ns 0.41 ns - - - -
Send False 64 166.2 ns 0.50 ns 0.44 ns - - - -
Send False 128 169.2 ns 0.41 ns 0.38 ns - - - -
Send False 5012 316.0 ns 5.33 ns 4.99 ns - - - -
Send False 1048576 61,473.5 ns 265.17 ns 235.07 ns - - - -
Send True 0 125.1 ns 0.23 ns 0.20 ns - - - -
Send True 64 129.1 ns 0.18 ns 0.15 ns - - - -
Send True 128 132.5 ns 0.54 ns 0.48 ns - - - -
Send True 5012 204.8 ns 0.66 ns 0.62 ns - - - -
Send True 1048576 35,213.9 ns 572.60 ns 535.61 ns - - - -

Client side send is now consistently faster than the original implementation. Server side sending is still slower and so far I couldn't figure out why. PerfView insists that ArrayPool.Shared Rent / Return take almost 25% of the execution time, which I find impossible.

PerfViewData flameGraph1

Also I don't understand what is this coreclr!JIT_GetSharedGCThreadStaticBaseDynamicClass and why perfview thinks it takes so much of the execution time.

@zlatanov
Copy link
Contributor Author

Results for the send after the last commit:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
  Current : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  PR : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Toolchain Server Size Mean Error StdDev Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Current False 0 192.8 ns 1.75 ns 1.46 ns 192.6 ns 1.57 0.02 - - - -
PR False 0 122.4 ns 1.40 ns 1.24 ns 121.7 ns 1.00 0.00 - - - -
Current False 64 210.0 ns 0.34 ns 0.28 ns 209.8 ns 1.47 0.00 - - - -
PR False 64 142.6 ns 0.44 ns 0.39 ns 142.7 ns 1.00 0.00 - - - -
Current False 128 214.9 ns 2.79 ns 2.61 ns 213.2 ns 1.44 0.02 - - - -
PR False 128 149.1 ns 0.41 ns 0.36 ns 149.0 ns 1.00 0.00 - - - -
Current False 5012 370.4 ns 2.90 ns 2.57 ns 369.1 ns 1.23 0.01 - - - -
PR False 5012 300.4 ns 0.63 ns 0.53 ns 300.4 ns 1.00 0.00 - - - -
Current False 1048576 141,064.7 ns 2,790.04 ns 3,321.34 ns 140,003.2 ns 2.24 0.09 197.5098 197.5098 197.5098 1048608 B
PR False 1048576 63,118.2 ns 1,221.92 ns 1,500.63 ns 64,083.5 ns 1.00 0.00 - - - -
Current True 0 100.3 ns 0.32 ns 0.26 ns 100.3 ns 0.93 0.01 - - - -
PR True 0 107.8 ns 0.64 ns 0.86 ns 107.7 ns 1.00 0.00 - - - -
Current True 64 105.1 ns 1.98 ns 1.75 ns 104.3 ns 1.02 0.02 - - - -
PR True 64 103.4 ns 0.53 ns 0.50 ns 103.5 ns 1.00 0.00 - - - -
Current True 128 106.7 ns 1.09 ns 1.02 ns 106.6 ns 0.99 0.02 - - - -
PR True 128 107.6 ns 2.16 ns 2.31 ns 106.6 ns 1.00 0.00 - - - -
Current True 5012 186.1 ns 1.08 ns 1.01 ns 185.8 ns 1.04 0.01 - - - -
PR True 5012 178.6 ns 2.19 ns 1.94 ns 177.7 ns 1.00 0.00 - - - -
Current True 1048576 108,895.6 ns 2,093.10 ns 2,326.47 ns 108,227.3 ns 3.02 0.10 173.4619 173.4619 173.4619 1048608 B
PR True 1048576 36,168.6 ns 713.12 ns 927.26 ns 35,854.3 ns 1.00 0.00 - - - -

//cc @CarnaViire @stephentoub

@zlatanov
Copy link
Contributor Author

OK, I think I am ready. Here are the latest results:

Send to Stream.Null

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Toolchain Server Size Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
Current False 0 199.93 ns 0.430 ns 0.335 ns 1.00 - - - -
PR False 0 122.20 ns 0.151 ns 0.126 ns 0.61 - - - -
Current False 64 212.00 ns 1.978 ns 1.544 ns 1.00 - - - -
PR False 64 141.45 ns 0.184 ns 0.163 ns 0.67 - - - -
Current False 128 210.42 ns 0.638 ns 0.533 ns 1.00 - - - -
PR False 128 146.44 ns 2.871 ns 2.819 ns 0.70 - - - -
Current False 4096 349.55 ns 1.067 ns 0.946 ns 1.00 - - - -
PR False 4096 267.75 ns 0.653 ns 0.579 ns 0.77 - - - -
Current False 16384 755.87 ns 3.062 ns 2.865 ns 1.00 - - - -
PR False 16384 658.21 ns 2.336 ns 2.070 ns 0.87 - - - -
Current False 1048576 137,646.14 ns 2,617.555 ns 2,800.755 ns 1.00 187.2559 187.2559 187.2559 1048608 B
PR False 1048576 61,587.04 ns 696.172 ns 617.138 ns 0.45 - - - -
Current True 0 103.09 ns 1.585 ns 1.405 ns 1.00 - - - -
PR True 0 96.53 ns 0.182 ns 0.170 ns 0.94 - - - -
Current True 64 104.97 ns 0.220 ns 0.183 ns 1.00 - - - -
PR True 64 102.70 ns 1.417 ns 1.326 ns 0.98 - - - -
Current True 128 108.45 ns 1.343 ns 1.257 ns 1.00 - - - -
PR True 128 105.47 ns 0.235 ns 0.220 ns 0.97 - - - -
Current True 4096 171.06 ns 0.111 ns 0.093 ns 1.00 - - - -
PR True 4096 184.62 ns 0.169 ns 0.141 ns 1.08 - - - -
Current True 16384 424.67 ns 1.250 ns 1.169 ns 1.00 - - - -
PR True 16384 415.21 ns 1.217 ns 0.950 ns 0.98 - - - -
Current True 1048576 101,729.16 ns 2,022.408 ns 2,557.699 ns 1.00 175.2930 175.2930 175.2930 1048608 B
PR True 1048576 36,614.73 ns 711.404 ns 1,148.785 ns 0.36 - - - -

Receive, no async paths

Toolchain Server Size Mean Error StdDev Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Current False 0 153.8 ns 1.41 ns 1.25 ns 153.3 ns 1.00 0.00 - - - -
PR False 0 126.9 ns 1.67 ns 1.48 ns 126.8 ns 0.82 0.01 - - - -
Current False 64 156.7 ns 0.93 ns 0.83 ns 156.5 ns 1.00 0.00 - - - -
PR False 64 163.0 ns 1.82 ns 1.62 ns 162.4 ns 1.04 0.01 - - - -
Current False 128 207.9 ns 0.90 ns 0.79 ns 207.6 ns 1.00 0.00 - - - -
PR False 128 181.9 ns 0.25 ns 0.23 ns 181.9 ns 0.87 0.00 - - - -
Current False 4096 295.1 ns 5.48 ns 5.13 ns 293.5 ns 1.00 0.00 - - - -
PR False 4096 253.4 ns 0.33 ns 0.27 ns 253.5 ns 0.86 0.02 - - - -
Current False 16384 580.7 ns 1.02 ns 1.37 ns 580.2 ns 1.00 0.00 - - - -
PR False 16384 468.1 ns 6.43 ns 6.02 ns 464.4 ns 0.81 0.01 - - - -
Current False 1048576 35,508.5 ns 696.06 ns 1,020.27 ns 35,098.1 ns 1.00 0.00 - - - -
PR False 1048576 33,798.8 ns 90.36 ns 75.46 ns 33,777.2 ns 0.94 0.02 - - - -
Current True 0 172.2 ns 0.58 ns 0.48 ns 172.3 ns 1.00 0.00 - - - -
PR True 0 127.8 ns 0.13 ns 0.10 ns 127.7 ns 0.74 0.00 - - - -
Current True 64 192.9 ns 0.59 ns 0.55 ns 192.6 ns 1.00 0.00 - - - -
PR True 64 169.3 ns 0.28 ns 0.25 ns 169.2 ns 0.88 0.00 - - - -
Current True 128 232.8 ns 2.82 ns 2.50 ns 231.5 ns 1.00 0.00 - - - -
PR True 128 192.7 ns 0.40 ns 0.37 ns 192.7 ns 0.83 0.01 - - - -
Current True 4096 340.7 ns 0.69 ns 0.62 ns 340.5 ns 1.00 0.00 - - - -
PR True 4096 305.3 ns 4.39 ns 4.11 ns 305.3 ns 0.90 0.01 - - - -
Current True 16384 778.5 ns 1.79 ns 1.59 ns 777.5 ns 1.00 0.00 - - - -
PR True 16384 776.6 ns 0.72 ns 0.57 ns 776.8 ns 1.00 0.00 - - - -
Current True 1048576 61,929.4 ns 1,210.45 ns 1,774.27 ns 63,263.2 ns 1.00 0.00 - - - -
PR True 1048576 59,379.7 ns 175.80 ns 137.26 ns 59,349.9 ns 0.95 0.03 - - - -

Echo in process, with AspNetCore, sending and receiving 10 000 messages

Toolchain Size Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Current 50 10000 450.0 ms 8.64 ms 8.49 ms 1.00 0.00 3000.0000 - - 14.5 MB
PR 50 10000 468.5 ms 7.77 ms 7.27 ms 1.04 0.03 2000.0000 - - 11.44 MB
Current 100 10000 452.0 ms 8.93 ms 16.33 ms 1.00 0.00 3000.0000 - - 14.5 MB
PR 100 10000 460.6 ms 5.76 ms 5.39 ms 1.01 0.04 2000.0000 - - 11.45 MB
Current 1000 10000 461.0 ms 7.50 ms 6.65 ms 1.00 0.00 3000.0000 - - 14.5 MB
PR 1000 10000 471.3 ms 6.29 ms 5.88 ms 1.02 0.02 2000.0000 - - 11.45 MB

Obviously these tests doesn't represent very real world examples, and if anyone has a suggestion for a benchmark, I will happily include it.

@CarnaViire, @stephentoub any feedback is welcome.

@davidfowl
Copy link
Member

I would hate to see websocket performance regress.
@BrennanConroy @sebastienros can we run the signalr benchmarks with this PR

@stephentoub
Copy link
Member

The shown benchmarks for sync completion get a bit faster but the echo example gets slower. Does that imply that async completions get significantly slower?

I've not yet had a chance to look at the code to see what the changes are doing. What explains the improvements and regressions?

@zlatanov
Copy link
Contributor Author

@stephentoub The echo benchmark fluctuates by about 1-2% in either direction every time I run it. Here are results from a run I just did:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
  Job-JOGWPO : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-KYUCJS : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Size Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Current 50 10000 564.1 ms 8.44 ms 7.05 ms 1.00 0.00 3000.0000 - - 14.5 MB
PR 50 10000 558.5 ms 11.12 ms 12.80 ms 0.99 0.03 2000.0000 - - 11.29 MB
Current 100 10000 556.1 ms 9.41 ms 7.86 ms 1.00 0.00 3000.0000 - - 14.5 MB
PR 100 10000 592.4 ms 11.69 ms 23.07 ms 1.02 0.05 2000.0000 - - 11.28 MB
Current 1000 10000 602.0 ms 9.09 ms 8.51 ms 1.00 0.00 3000.0000 - - 14.49 MB
PR 1000 10000 595.2 ms 9.49 ms 8.87 ms 0.99 0.03 2000.0000 - - 11.29 MB

Profiling it shows me most of the time is spent by the threadpool and i/o by aspnet core server. Maybe I haven't written the benchmarks with aspnetcore well? @davidfowl if you have time, can you take a look at https://github.com/zlatanov/websocket-benchmarks/blob/main/src/Benchmarks/EchoBenchmark.cs and see if it looks ok?

@davidfowl
Copy link
Member

Profiling it shows me most of the time is spent by the threadpool and i/o by aspnet core server. Maybe I haven't written the benchmarks with aspnetcore well? @davidfowl if you have time, can you take a look at https://github.com/zlatanov/websocket-benchmarks/blob/main/src/Benchmarks/EchoBenchmark.cs and see if it looks ok?

The only concern is a regression. If the before after for the same ASP.NET Core code looks the same then its probably OK.

@davidfowl
Copy link
Member

Whats the allocation profile before and after (you can use the allocation profiler in VS hit alt+f2)

@zlatanov
Copy link
Contributor Author

Whats the allocation profile before and after (you can use the allocation profiler in VS hit alt+f2)

Echo benchmark for a single connection (server, client), 1_000_000 messages of 100 bytes each, the total allocations:

  • Current 29_234 bytes
  • PR 29_886 bytes

Base automatically changed from master to main March 1, 2021 09:08
@zlatanov
Copy link
Contributor Author

zlatanov commented Mar 1, 2021

With the last commit, now ReceiveAsync for WebSocket has no allocations even in the async code path, but only for Text and Binary messages. Control messages (ping, pong, close) still have async code paths. Here is what the echo benchmark looks:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Size Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Current 50 10000 562.9 ms 11.00 ms 17.13 ms 1.00 0.00 3000.0000 - - 14.49 MB
PR 50 10000 554.2 ms 11.03 ms 13.96 ms 0.99 0.02 1000.0000 - - 7.63 MB
Current 100 10000 590.6 ms 11.59 ms 26.39 ms 1.00 0.00 3000.0000 - - 14.49 MB
PR 100 10000 563.0 ms 11.21 ms 11.01 ms 0.96 0.04 1000.0000 - - 7.63 MB
Current 1000 10000 582.1 ms 11.38 ms 17.38 ms 1.00 0.00 3000.0000 - - 14.5 MB
PR 1000 10000 565.9 ms 4.28 ms 3.57 ms 0.96 0.04 1000.0000 - - 7.63 MB

@zlatanov
Copy link
Contributor Author

zlatanov commented Mar 1, 2021

@stephentoub When profiling some of the benchmark to see where the CPU spends most of its time, I found something suspicious.

image

30% of CPU time is spent here.

image

Is this normal, or is there something amiss?

@stephentoub
Copy link
Member

30% of CPU time is spent here. Os this normal, or is there something amiss?

@kouvel

@kouvel
Copy link
Member

kouvel commented Mar 1, 2021

It's normal in highly bursty test cases. The worker threads must be running out of work very quickly, if not for the spin-waiting that or more CPU time would otherwise be taken by context switching. There is a related known issue in the thread pool, fixing of which can improve it a little, but it would still show up.

/// Lightweight async lock that allows single owner at any given time.
/// </summary>
[StructLayout(LayoutKind.Auto)]
private struct SendLock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned with the addition of this custom lock... are you sure there aren't any existing locks you could use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent of SemaphoreSlim(1, 1), but optimized for the case where there is no contention for the lock. I did extensive profiling and the semaphore that was used before used about 10-15% of the CPU in some of the cases.

Because the WebSocket explicitly doesn't support concurrent SendAsync, the only time where there might be contending for this lock is the case when the socket has received a control message and needs to reply and in the mean time there might be in flight send operation.

I am not aware of any lock primitive other than SemaphoreSlim, that allows acquiring in one thread and releasing in another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zlatanov
Copy link
Contributor Author

zlatanov commented Mar 1, 2021

Thanks @kouvel, I didn't know that. I just also found about UnsafePreferInlineScheduling in SocketTransportOptions for AspNetCore. After setting it to true, the LowLevelSpinWaiter was no longer a hot spot.

Here are the results for the echo benchmark with UnsafePreferInlineScheduling = true

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 5.0 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Size Count Mean Error StdDev Median Ratio RatioSD Gen0 Gen1 Gen2 Allocated
Current 50 10K 515.4 ms 10.08 ms 19.18 ms 516.0 ms 1.00 0.00 3000 - - 14.5 MB
PR 50 10K 488.7 ms 9.47 ms 14.47 ms 488.2 ms 0.95 0.05 1000 - - 7.63 MB
Current 100 10K 514.3 ms 10.20 ms 26.52 ms 513.4 ms 1.00 0.00 3000 - - 14.49 MB
PR 100 10K 484.4 ms 9.53 ms 20.31 ms 476.2 ms 0.94 0.06 1000 - - 7.63 MB
Current 1000 10K 525.9 ms 10.39 ms 15.55 ms 526.6 ms 1.00 0.00 3000 - - 14.49 MB
PR 1000 10K 493.6 ms 9.75 ms 15.75 ms 497.7 ms 0.94 0.05 1000 - - 7.63 MB

@zlatanov
Copy link
Contributor Author

zlatanov commented Mar 2, 2021

In WebSocketCreateTest.cs there are 3 tests that rely on external servers to test core WebSocket functionality. Is it really necessary to use external servers for this, or should I refactor them to use the new WebSocketTestStream which supports duplex communication? This would resolve #28957. The mentioned issue suggest using loopback server, but it isn't necessary at all.

[ActiveIssue("https://github.com/dotnet/runtime/issues/28957")]
[OuterLoop("Uses external servers")]
[Theory]
[MemberData(nameof(EchoServersAndBoolean))]
[PlatformSpecific(~TestPlatforms.Browser)] // System.Net.Sockets is not supported on this platform.
public async Task WebSocketProtocol_CreateFromConnectedStream_CloseAsyncAfterCloseReceivedClosesStream(Uri echoUri, bool useCloseOutputAsync)
{
using (var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp))
{
bool secure = echoUri.Scheme == "wss";
client.Connect(echoUri.Host, secure ? 443 : 80);
using (Stream stream = await CreateWebSocketStream(echoUri, client, secure))
using (WebSocket socket = CreateFromStream(stream, false, null, TimeSpan.FromSeconds(10)))
{
Assert.NotNull(socket);
Assert.Equal(WebSocketState.Open, socket.State);
// Ask server to send us a close
await socket.SendAsync(new ArraySegment<byte>(Encoding.UTF8.GetBytes(".close")), WebSocketMessageType.Text, true, default);
// Verify received server-initiated close message.
WebSocketReceiveResult recvResult = await socket.ReceiveAsync(new ArraySegment<byte>(new byte[256]), default);
Assert.Equal(WebSocketCloseStatus.NormalClosure, recvResult.CloseStatus);
Assert.Equal(WebSocketCloseStatus.NormalClosure, socket.CloseStatus);
Assert.Equal(WebSocketState.CloseReceived, socket.State);
Assert.True(stream.CanRead);
Assert.True(stream.CanWrite);
await (useCloseOutputAsync ?
socket.CloseOutputAsync(WebSocketCloseStatus.NormalClosure, "", CancellationToken.None) :
socket.CloseAsync(WebSocketCloseStatus.NormalClosure, "", CancellationToken.None));
Assert.False(stream.CanRead);
Assert.False(stream.CanWrite);
}
}
}

//cc @stephentoub @CarnaViire

@stephentoub
Copy link
Member

stephentoub commented Mar 2, 2021

there are 3 tests that rely on external servers to test core WebSocket functionality

We need at least some tests that are communicating with a different WebSocket protocol implementation; otherwise, we're only testing against ourselves, and we could easily violate the protocol and not realize it. Chances of that happening when talking to a different implementation are significantly reduced.

That said, it doesn't matter to me which tests are communicating with a different implementation, just that there are some (or alternatively that we incorporate a testing suite like https://github.com/crossbario/autobahn-testsuite to help validate the implementation... we tested against that when we initially wrote ManagedWebSocket, but haven't since to my knowledge).

@davidfowl
Copy link
Member

davidfowl commented Mar 3, 2021

@stephentoub we run this test suite. So maybe we can run this change on ASP.NET Core.

cc @BrennanConroy

@BrennanConroy
Copy link
Member

🙈 We don't run this suite currently. It hasn't been updated to run after we changed our infrastructure from travis/appveyor I believe.

@CarnaViire
Copy link
Member

@davidfowl @BrennanConroy Can we still somehow leverage existing SignalR benchmarks or something of this sort, please? 😊 We are really interested in benchmarking a real-world scenario.

@sebastienros
Copy link
Member

@CarnaViire take a look at this command: https://github.com/aspnet/Benchmarks/tree/master/scenarios#sample-8

You will need to add --application.channel latest --application.framework net6.0 to run against the latest runtime builds instead of 5.0.

And this point explains how to use locally built dlls: https://github.com/aspnet/Benchmarks/tree/master/scenarios#how-to-upload-custom-files

This is using the crank dotnet tool and services. It's explained at the top, you can read about it. And ping me if you have any questions.

@zlatanov
Copy link
Contributor Author

zlatanov commented Mar 8, 2021

Closing this one as it introduced too much unnecessary changes to the underlying WebSocket without sufficient gains other than less memory allocations, but with the cost of more complicated code.

@zlatanov zlatanov closed this Mar 8, 2021
@davidfowl
Copy link
Member

Less allocations might be a worthwhile endeavor but it depends on how complicated it makes the code.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 7, 2021
@zlatanov zlatanov deleted the websocket-refactor branch April 28, 2021 21:42
@karelz karelz added this to the 6.0.0 milestone May 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants