-
-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed throughput issues by fixing lack of guarentee on SCTP -> DTLS packet ordering #513
Fixed throughput issues by fixing lack of guarentee on SCTP -> DTLS packet ordering #513
Conversation
…okio::spawn. The ordering mixup triggered SCTPs congestion control, severely limitting throughput in practice.
Thanks for fixing the longstanding sctp throughput issue. |
Could fix fmt/clippy errors? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #513 +/- ##
==========================================
- Coverage 61.57% 61.55% -0.03%
==========================================
Files 529 529
Lines 48865 48865
Branches 12363 12383 +20
==========================================
- Hits 30088 30077 -11
- Misses 9579 9581 +2
- Partials 9198 9207 +9 ☔ View full report in Codecov by Sentry. |
Nice! It's very interesting that this issue only arose when actually communicating over the internet. We only tested this locally when I worked on this. Probably because the congestion window is basically infinite if you stay on your local machine. Using I also think I did not try this version, I had a few where I moved the whole loop over the vector of raw packets into a spawn_blocking and then processed the whole marshalled vector or a channel in another loop in the write task. I think the key here is the frequent calls to |
Thanks for merging!
Before before this PR:
After my this PR:
Of course for a real non-local test, the performance gains are still immense due to the original congestion control issue. But, due to the overhead added by I tested keeping the original code but setting the semaphore limit to 1, but that had some significant issues I didn't have time to diagnose. I might circle back to this later when I have time, but let me know if you guys have any ideas for alternatives to try. |
@onnoowl at one point I was thinking about using two separate tokio runtimes to separate the read and write loops and guarantee paralleism. That seemed like a very intrusive change, which is why I didn't do it. I don't have things setup to test this quickly but the performance of this would be an interesting data point. Another thing that is still in the pipeline is performance upgrades in the crc implementation which is iirc one of the last bottlenecks in the marshalling of the packets. But this depends on the maintainers of the crc crate to make a new release, which they are currently working on. |
I forgot that we actually made the local throughput measuring into an example! I went ahead and collected some data. Sadly I don't have convenient access to a server right now to test this over the internet, which is an important factor as this PR showed. I think the results are nonetheless interesting: Using before this PR as a Baseline performances look like this (on a m1 mac):
It seems like just setting the limit to 1 delivers the best performance in a local scenario on macos but onoowl said that
I think I remember that the limit was chosen as 8 because on linux limiting this to 1 tanked performance or rather made it very jittery but on macos it doesn't but I am not totally sure. I think we also never found out why performance on linux behaved so weirdly. Is that what you meant by "significant issues"? As I expected using separate runtimes to force parallelism brings back some of the performance, but it comes at an big-ish extra of complexity because we would have to ask users to provide a secondary runtime to this library. Before this PR with limit set to the default 8
Before this PR but with limit set to 1
After this PR
With forced parallelism
|
Another datapoint to consider is the effects of eliminating the crc bottleneck (by using the 3.1.0-beta.1 version of the crate and enabling the "slice16-mem-limit" feature): Speedup for current master: ~x1.3
Speed up for forced parallel: ~x1.9
|
Hi all,
I ran into some severe throughput issues with webrtc-rs, but I've managed to track down the problem. I was using webrtc through the Matchbox crate, and found my total throughput was limited to about 3 Mbps across several different machines and networks.
TLDR
The use of
tokio::task::spawn
when SCTP sends its packets to the underlying DTLS layer causes packets to send out of order. These out of order packets trigger SCTP's congestion control mechanisms, limiting total throughput to around 3 Mbps on machines I tested. Changing the code to guarantee packet ordering at the time of sending resolves the issue.Bug Details
Here's the source of the problem:
https://github.com/webrtc-rs/webrtc/pull/363/files#diff-26a945150f33caddc941f57968f74224cba89af5a0ac4ab4eac4d803f50a6a70R524 (snippet included below)
@KillingSpark made some lovely performance improvements in Issue: #360, Pr: #363 to remove mutex contention, but it looks like the
tokio::task::spawn
that was introduced can cause out of order packet delivery, which has severe consequences for SCTP's congestion control and the resulting throughput.While I don't see throughput issues when testing locally, the moment I did any testing over the internet, I was seeing 3 Mbps throughput, sometimes up to 10, but never any higher.
To test I sent 10 megabytes of data all at once through a single stream, and timed the difference between delivery of the first and last packets. Turning on logging, I found that there would often be 70 to 100 fast re-transmission events in the SCTP logic during that 10 megabyte delivery. These fast re-transmissions kept the congestion control window very low, causing the poor throughput.
Using Wireshark I was able to confirm that no packets were being re-ordered or lost over the internet in my tests, the packets were already out of order at the time of sending. Specifically, they were out of order between the SCTP and DTLS layers. The DTLS packets were being sent with strictly increasing sequence numbers, as you would expect, but the SCTP sequence numbers were all over the place at the time of sending. Packets were frequently displaced by up to 8 positions. These frequent misorderings were causing the frequent re-transmissions.
The Solution
While simply removing the
tokio::task::spawn
completely resolved the issue (giving me hundreds of Mbps of throughput), I wanted to address the comment left behind by @KillingSpark.Simply setting the semaphore limit to
1
should be a good solution. However, given the issues @KillingSpark found, maybe the best thing to do would be to put the packet marshaling in atokio::task::spawn_blocking
? This would help ensure that the CPU heavy work never interferes with performance in general. I've written a potential implementation with spawn_blocking in this PR for you to take a look at.