Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve data channel (SCTP) performance #62

Closed
enobufs opened this issue Sep 25, 2019 · 31 comments
Closed

Improve data channel (SCTP) performance #62

enobufs opened this issue Sep 25, 2019 · 31 comments

Comments

@enobufs
Copy link
Member

enobufs commented Sep 25, 2019

Motivation

SCTP (datachannel) performance is perceived very low particularly over a real network with latency with limited bandwidth. No one appears to have properly measured performance. We should identify underlying problems causing the slowness with correct measurement, then tackle those to improve it.

  1. Benchmark
  2. Improve performance (we may create separate tickets, and this issue will be used as an umprella ticket for those)
@enobufs
Copy link
Member Author

enobufs commented Oct 1, 2019

Preparation in progress. I am going to use this tool. My focus is to measure the performance of SCTP protocol (not really about the bufferedAmount control).

@enobufs
Copy link
Member Author

enobufs commented Oct 6, 2019

The first pass of a benchmark on pion/[email protected] in comparison with TCP on various RTT (using Network Link Conditioner).

Conditions:

  • SCTP recv buf size: 64KB
  • SCTP message size: 32KB
  • TCP recv buf size: 208KB (cat /proc/sys/net/core/rmem_default)

pion_sctp@v1 6 13 and TCP

Obviously:

  • pion/sctp's congestion window is not growing for some reason
  • TCP's perceived receive buffer size is MUCH LARGER than the default 208KB. (it must be dynamically increasing!)

@enobufs
Copy link
Member Author

enobufs commented Oct 7, 2019

Further investigation revealed:

  • TCP dynamically increases the receiver buffer size
  • Pion/sctp's congestion window (cwnd) does not grow fast enough
  • Retransmission receovery is very very slow when massive packet loss happens

I attempted to increase the receive buffer size from 64KB (current) up to 1MB, but throughput did not improve beyond the size of 128KB.

Pion/sctp uses delayed Ack to reduce the number of Acks. The downside of it is since Ack is delayed, perceived RTT becomes large, and it would affect the overall throughput. To mitigate that, it uses "Immdidate Ack" to request the receiver to generate Ack immediately. But I found that there was a bug that prevented the immediate-ack from being sent frequently enough. By fixing this (fix-1), it improved the speed of cwnd growth a lot.

My test environment involves WiFi network, and I am seeing packet loss constantly, and sometimes a large number of packets are lost (large enought the fast-recovery algorithm could not recover). When T3-Rtx timer fires, Pion/sctp tries to retransmit unack'ed packets but it was very very slow due to a bug. By fixing it (fix-2), it was able to recover from the massive loss as fast as what TCP seems to be doing.

By having both fix-1 and fix-2, Pion/sctp was able to yield close to the theoretical throughput:

SCTP Receive Buf Size and Throughput

RTT is set to 163 [msec] in the above measurement

Now I am comfortable with increasing the SCTP's receive buffer to 1MB.

enobufs added a commit that referenced this issue Oct 7, 2019
enobufs added a commit that referenced this issue Oct 7, 2019
Num of sacks has been slighly increased
Relates to #62
enobufs added a commit that referenced this issue Oct 7, 2019
enobufs added a commit that referenced this issue Oct 7, 2019
Num of sacks has been slighly increased
Relates to #62
@enobufs
Copy link
Member Author

enobufs commented Oct 7, 2019

Those fix-1 and fix-2 have landed in pion/sctp v1.7.0.
Next:

  • Benchmark throughput responses to packet loss
  • CPU / Mem profiles

@Sean-Der Sean-Der mentioned this issue Oct 8, 2019
56 tasks
@chrisprobst
Copy link

@Sean-Der @enobufs
First, thanks for the great work! We merged and tested pion 2.1.6, which includes your fix.
We would like to share our feedback and hopes for the future with you :-)

Context:

  • We use pion for P2P-CDN live streaming
  • Bandwidth-intensive use case, we always use negotiated and ordered data channels
  • We usually send binary messages (we fragment messages into 15 * 1024 sized packets) every 4 seconds (each message is roughly 50-150KB) to every connected peer

Our problem before 2.1.6 was the following:

  • Pion -> Pion was Ok-ish in terms of bandwidth (not best, but it worked)
  • Pion -> Browser was also Ok-ish
  • Browser -> Pion has always been a big problem, messages were received with high delays
    (We therefore deactivated matching for Pion / Browser )
  • Browser -> Browser works flawlessly

We hoped that this release might improve this situation, however, it turns out that the Browser -> Pion issue remains.

I could start a new issue but I somehow believe that it might be related to your performance improvements and maybe we just have to tune some "knobs" to get also the browsers on board.

If you need more information about our use case, I would be very happy to tell more about it.

Kind Regards,
Chris

@enobufs
Copy link
Member Author

enobufs commented Oct 10, 2019

Thanks @chrisprobst so much for the feedback! I personally have not looked closely at the interaction between pion and browser, just yet. I will definitely look into it as soon as I can.

Let me ask you a few questions...

  • Which browser did you use?
  • "Browser->Pion issue remains" - how about other pairs? Was 2.1.6 better or worse than older versions?
  • Did you use any partial-reliability options?
  • "every 4 seconds (each message is roughly 50-150KB)" - Are sending the message at a constant rate?
  • What is your criteria to judge the performance? (delay, jitter, and/or throughput?)

@chrisprobst
Copy link

chrisprobst commented Oct 10, 2019

Thanks @enobufs for your quick response!

  • We use many different browsers, but for testing out this behavior we always use the latest Chrome browser. I am testing on macOS and Linux during development.

  • All other pairs are good enough for our use case, so we do not send like 100MB per second or so. Just the Browser->Pion direction is really slow. I was just able to reproduce this problem with a pion example itself:

  • You mean like removing "ordered"? Yes we tried unordered, however, the result was the same.
    [EDIT]

  • "every 4 seconds (each message is roughly 50-150KB)" - Are sending the message at a constant rate?: Pretty much. We operate on live streaming like HLS or DASH, so every 4-6 seconds a new video segment is created which we split in chunks and then transfer over to connected peers. The chunks themselves are not very large as I said. Usually 50 - 150 KB (more like 60KB on average)

  • What is your criteria to judge the performance? (delay, jitter, and/or throughput?): If a peer sends over a chunk of video content, it has to be delivered fast enough (I mean 50KB should not take like 2 seconds or so over the internet if you have a relatively good connection - even for mobile users)

Thanks for your time and investigating!

Best,
Chris

@enobufs
Copy link
Member Author

enobufs commented Oct 10, 2019

Browser->Pion direction is really slow.

Ahhh I see! I have some idea about what is going on. Pion's delayed ack might have a problem. Great information! Thank you!

@chrisprobst
Copy link

Browser->Pion direction is really slow.

Ahhh I see! I have some idea about what is going on. Pion's delayed ack might have a problem. Great information! Thank you!

This gives us hope! It would be extremely helpful for us to rely on this direction (Browser->Pion).

If you need help / further information, please ask! :-)

@chrisprobst
Copy link

Dear @enobufs, do you have an idea, how much time & effort this would take? We are really interested in this bugfix and would like to help. Is there a good starting point or is it just a tiny issue better solved by a core developer?

@enobufs
Copy link
Member Author

enobufs commented Oct 14, 2019

@chrisprobst I got some time today and will let you know my findings at the end of the day. (sorry, I was sick for the last few days... now I am feeling fine!)

@chrisprobst
Copy link

@enobufs Oh I wish you the very best! Your help is really appreciated!

@enobufs
Copy link
Member Author

enobufs commented Oct 16, 2019

@chrisprobst I was distracted by another bug (unrelated) to fix, but I got my data channel between pion and chrome working began writing test cases. My time on weekdays is very limited, but I'm working on it, hoping to get something for you by this weekend.

@enobufs enobufs closed this as completed Oct 16, 2019
@enobufs enobufs reopened this Oct 16, 2019
@enobufs
Copy link
Member Author

enobufs commented Oct 16, 2019

oops - was a wrong button.

@enobufs
Copy link
Member Author

enobufs commented Oct 20, 2019

@chrisprobst
Sorry, it took me some time to stabilize my browser <-> pion test tool, but the good news is now I am seeing the problem with the browser -> pion throughput! I will fix that!

@enobufs
Copy link
Member Author

enobufs commented Oct 20, 2019

My findings:

  • Chrome sends SACK always immediately (no delayed-ack! as far as I can see)
  • Chrome DATA chunk never set "immediate ACK" bit
  • By reducing the ack delay from 200ms to 20ms, the throughput drastically improved.
  • Pion does not follow what RFC says "an acknowledgement SHOULD be generated for at least every second packet". Pion only waits for 200ms of ack timeout or the immediate ack bit set on the DATA chunk sent by the peer. Because Chrome does not use immediate Ack, pion always delays the ack and ends up the very bad throughput for incoming traffic.

RFC 4960 Sec. 6.2:

   The guidelines on delayed acknowledgement algorithm specified in
   Section 4.2 of [RFC2581] SHOULD be followed.  Specifically, an
   acknowledgement SHOULD be generated for at least every second packet
   (not every second DATA chunk) received, and SHOULD be generated
   within 200 ms of the arrival of any unacknowledged DATA chunk.  In
   some situations, it may be beneficial for an SCTP transmitter to be
   more conservative than the algorithms detailed in this document
   allow.  However, an SCTP transmitter MUST NOT be more aggressive than
   the following algorithms allow.

TODO:

  1. Modify the code to send SACK at every second packet.
  2. Consider stop using the immediate ACK.
  3. Benchmark (including a test with Firefox also)

@enobufs
Copy link
Member Author

enobufs commented Oct 20, 2019

Current throughput...
Throughput in each direction

@enobufs
Copy link
Member Author

enobufs commented Oct 20, 2019

Now I am sending the Ack at every second packet that hold DATA chunk.
Chrome to Pion throughput is very close to theoretical throughput!

Chrome to Pion Throughput Improvement

@Sean-Der
Copy link
Member

😱 that is an amazing graph, that is so exciting!!!

Amazing @enobufs I can't wait to see what people think

@enobufs
Copy link
Member Author

enobufs commented Oct 21, 2019

I have removed the use of the immediate SACK bit.
Measured the throughput with Chrome and Firefox. It all looks good to me.

Throughput with Browsers (with fix) (1)

enobufs added a commit that referenced this issue Oct 21, 2019
@enobufs
Copy link
Member Author

enobufs commented Oct 21, 2019

@chrisprobst
I'm pretty sure PR #79 would fix your case also. Please try if you have a chance!

enobufs added a commit that referenced this issue Oct 21, 2019
@chrisprobst
Copy link

chrisprobst commented Oct 21, 2019

@enobufs What a great work, I will test it out for sure! Short question, because I am not too familiar with the release management of pion. I was using the latest 2.1.6. I saw that there is now a 2.1.8, can I use it, will it contain the patch?

And one more time @enobufs, what a great work! We appreciate your work so much. Of course, we appreciate the work of all committers ;-).

@Sean-Der
Copy link
Member

@chrisprobst just tagged v2.1.9! This contains the patch.

We just run the whole test suite with updated libs (to make sure nothing regresses) so takes a little time to go across :) excited to hear how the testing goes!

@chrisprobst
Copy link

Very nice, thanks for the release.

@AeroNotix
Copy link
Contributor

Great job @enobufs !

@chrisprobst
Copy link

@enobufs @Sean-Der We did a lot of testing, we are happy with the results. No more delay, this solves a huge pain for us, really! We used the work-around to only match pion devices, because pion-pion always worked superb. This restriction can now be removed and our initial production tests show that it works as expected. Thanks guys!

@Sean-Der
Copy link
Member

That is amazing news, @enobufs is really a rockstar :D

@chrisprobst if you ever have ANYTHING that can be better I would love to know. I am sure people are hitting issues, but just not telling us. I am ready to jump on anything I can :)

@enobufs
Copy link
Member Author

enobufs commented Oct 29, 2019

All Pion need is love!

@chrisprobst
Copy link

@enobufs @Sean-Der Thank you very much, we appreciate this offer a lot. For us, Pion is an enabling technology, because Go is so easy to cross compile on mobile, there are simply no other options for WebRTC cross-device development. Fyi: Pion will soon be used for a very large radio station in South America, we expect a lot of devices using Pion to hear radio every day. We will keep you posted. Thanks again for your hard work and good product. Of, course this goes to all other committers as well.

@AeroNotix
Copy link
Contributor

@chrisprobst could you share any results of using the latest Pion versions in your use-case? I have a window to update the versions I use in production and would love to know how much better things perform.

@enobufs
Copy link
Member Author

enobufs commented Jun 18, 2020

I think this is done. Closing.

@enobufs enobufs closed this as completed Jun 18, 2020
enobufs added a commit that referenced this issue Jul 6, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
enobufs added a commit that referenced this issue Jul 6, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
enobufs added a commit that referenced this issue Jul 13, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
enobufs added a commit that referenced this issue Jul 15, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
enobufs added a commit that referenced this issue Jul 15, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants