Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCTP] limit the bytes in the PendingQueue by using a semaphore #367

Merged
merged 23 commits into from
Jan 2, 2023

Conversation

KillingSpark
Copy link
Contributor

@KillingSpark KillingSpark commented Dec 15, 2022

As discussed in #360 the pending queue can grow indefinitly if the sender writes packets faster than the association is able to transmit them.

This PR solves this by enforcing a limit on the pendig queue. This blocks the sender until enough space is free.

@codecov
Copy link

codecov bot commented Dec 15, 2022

Codecov Report

Base: 59.97% // Head: 59.87% // Decreases project coverage by -0.09% ⚠️

Coverage data is based on head (ee22bc5) compared to base (225cec0).
Patch coverage: 56.68% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #367      +/-   ##
==========================================
- Coverage   59.97%   59.87%   -0.10%     
==========================================
  Files         504      504              
  Lines       48070    48000      -70     
  Branches    12519    12516       -3     
==========================================
- Hits        28828    28739      -89     
- Misses      10021    10025       +4     
- Partials     9221     9236      +15     
Impacted Files Coverage Δ
sctp/examples/ping.rs 0.00% <0.00%> (ø)
sctp/examples/pong.rs 0.00% <0.00%> (ø)
sctp/examples/throughput.rs 0.00% <0.00%> (ø)
sctp/src/error.rs 41.66% <ø> (ø)
sctp/src/queue/queue_test.rs 65.12% <14.28%> (-3.59%) ⬇️
.../association_internal/association_internal_test.rs 60.67% <25.00%> (-0.42%) ⬇️
sctp/src/stream/stream_test.rs 49.60% <40.00%> (+0.80%) ⬆️
sctp/src/queue/pending_queue.rs 33.33% <60.00%> (-55.24%) ⬇️
sctp/src/association/association_test.rs 59.50% <72.22%> (+0.55%) ⬆️
... and 9 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@KillingSpark KillingSpark marked this pull request as draft December 15, 2022 09:44
@KillingSpark
Copy link
Contributor Author

KillingSpark commented Dec 15, 2022

Converted to draft because this has some weird behaviour on my branch where I have all the performance PRs merged.

It causes the sender to no longer print its Send xxxx pkts lines but there is still throughput even long after it stops printing. I need to investigate what is happening there.

Apart from that this works well. Speeds are good (maybe even a little improved probably because of less pressure on the allocator?). And the backpressure does its thing. The amount of packets sent are close the packets received.

Throughput: 108919170 Bytes/s, 1662 pkts, 1662 loops
Send 1733 pkts
Throughput: 80870190 Bytes/s, 1234 pkts, 1234 loops
Send 1666 pkts
Throughput: 137295825 Bytes/s, 2095 pkts, 2095 loops
Send 2097 pkts
Throughput: 109967730 Bytes/s, 1678 pkts, 1678 loops
Send 1680 pkts
Throughput: 134871030 Bytes/s, 2058 pkts, 2058 loops
Send 1672 pkts
Throughput: 137951175 Bytes/s, 2105 pkts, 2105 loops
Send 2126 pkts
Throughput: 54721725 Bytes/s, 835 pkts, 835 loops
Throughput: 1048560 Bytes/s, 16 pkts, 16 loops
Throughput: 137689035 Bytes/s, 2101 pkts, 2101 loops
Throughput: 139720620 Bytes/s, 2132 pkts, 2132 loops
Throughput: 138606525 Bytes/s, 2115 pkts, 2115 loops
Throughput: 52296930 Bytes/s, 798 pkts, 798 loops
Throughput: 139982760 Bytes/s, 2136 pkts, 2136 loops
Throughput: 113047875 Bytes/s, 1725 pkts, 1725 loops
Throughput: 137099220 Bytes/s, 2092 pkts, 2092 loops
Throughput: 110229870 Bytes/s, 1682 pkts, 1682 loops
Throughput: 138016710 Bytes/s, 2106 pkts, 2106 loops
Throughput: 139524015 Bytes/s, 2129 pkts, 2129 loops
Throughput: 137885640 Bytes/s, 2104 pkts, 2104 loops
Throughput: 139851690 Bytes/s, 2134 pkts, 2134 loops
Throughput: 111737175 Bytes/s, 1705 pkts, 1705 loops
Throughput: 124778640 Bytes/s, 1904 pkts, 1904 loops
Throughput: 138278850 Bytes/s, 2110 pkts, 2110 loops
Throughput: 138082245 Bytes/s, 2107 pkts, 2107 loops
Throughput: 109508985 Bytes/s, 1671 pkts, 1671 loops
Throughput: 136116195 Bytes/s, 2077 pkts, 2077 loops
Throughput: 49675530 Bytes/s, 758 pkts, 758 loops
Throughput: 137361360 Bytes/s, 2096 pkts, 2096 loops
Throughput: 111606105 Bytes/s, 1703 pkts, 1703 loops
Throughput: 137033685 Bytes/s, 2091 pkts, 2091 loops
Throughput: 108591495 Bytes/s, 1657 pkts, 1657 loops
Throughput: 106101165 Bytes/s, 1619 pkts, 1619 loops

Edit: this is even stranger, the receiver too stops the printing at some point (way later) but netstat still shows traffic being delivered and the process is still using the same amount of CPU... wth

sctp/src/queue/pending_queue.rs Outdated Show resolved Hide resolved

impl PushLimitSemaphore {
/// blocks until the credits where sucessfully taken
fn aquire(&self, credits: u64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ultimately called from an async context right? It's not goodo to block in a way that doesn't yield to the tokio runtime. In fact if we manage to block all threads it's possible to dead lock the whole runtime since no forward progress an be made to release any permits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell the stream presents a sync interface?

Copy link
Contributor Author

@KillingSpark KillingSpark Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm nope you are right, I was fooled because write, write_sctp, and append are sync code. They are called a lot in async context.

So these should just become async and use the tokio semaphore?

Edit: The tokio semaphore does not really provide the interface we need, they only allow taking/releasing one permit at a time.

Edit2: TIL tokio does not have a CondVar so building the same thing with async primitives isn't easy

Copy link
Member

@k0nserv k0nserv Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokio semaphore does not really provide the interface we need, they only allow taking/releasing one permit at a time.

It does? add_permits and acquire_many. acquire_many presumably blocks until there are enough permits, so it can act as a condvar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... never mind. I was looking at outdated docs. I'll just update to this.

But it will still be a breaking change to public API, is that ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no it doesn't have what we need. It releases the permits as soon as the guard is dropped.

We need to aquire capacity in one place and grant capacity in another place. I don't think that is doable with the API tokio Semaphore presents

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you can move the permit guard around but perhaps that's a fools errand. Maybe ask in the tokio discord if they have thoughts, otherwise your lock based approach with a tokio::sync::Notify is appropriate

Copy link
Contributor Author

@KillingSpark KillingSpark Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask on their discord if anyone has a better idea but yeah the Notify based impl would satisfy me too.

I did not see the stange missing prints with the async version so maybe this was actually caused by blocking something... I'll keep it running for a while and see if it occurs again.

In any case what are your thoughts on breaking the API? Do you see a way around it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care much about breaking because we are still pre 1.0.0 and are finding out the way these things should work to be optimal.

It is interesting that this change would undo work by @melekes in #344 which was quite recent. That's not necessarily wrong though, but it would be good to have @melekes's thoughts too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind bringing back async if it's needed. I changed it to sync because there were no reason for it to be async.

@KillingSpark KillingSpark marked this pull request as ready for review December 18, 2022 11:51
@KillingSpark
Copy link
Contributor Author

KillingSpark commented Dec 20, 2022

Edit 3: Thought it might be nice to not make everyone read my ramblings when they dont have to. I misunderstood the AsyncWrite traits doc language and got confused. You may disregard this comment entirely...

Edit:

Gotta go back on this. Looking at how tokio implements this for their TCP socket they don't seem to care and flush is just a noop. https://docs.rs/tokio/1.23.0/src/tokio/net/tcp/split.rs.html#374-405

This does NOT seem to match with the language in the AsyncWrite trait Attempts to flush the object, ensuring that any buffered data reach their destination.

Edit 2: On second thought, I might have misinterpreted the traits language, flush apparently doesn't need to wait until all data has been received, it just has to make sure it will be received at some point. So as long as the data has been put into the pending queue and the association lives long enough to send all the data we should be fine...?

-- original --

I think the AsyncWriter implementation was and is broken with respect to the poll_flush and poll_shutdown functions.

  • Flush should wait until there are no more bytes buffered or in flight (which isn't easy to do in the current implementation outside of just busy looping).
  • shutdown should
    1. stream.shutdown(Shutdown::Write) and await the result, so new writes get rejected
    2. self.poll_flush() and await the result, so nothing is buffered anymore
    3. only now return Poll::Ready(Ok(()))

I'll need your opinion on how to proceed on this. I think it's clear that this is an entire new can of worms that I'd like not to open in this PR. Secondly I don't think this PR makes the situation significantly worse than before. poll_flush is now essentially a no-op, but it didn't do the right thing before this PR anyways.

Honestly I'd recommend just deleting (or disabling) the AsyncWriter impl on the Stream. At least poll_flush() does not do what the trait says it must do. (I think the same goes for the implementation for the datachannel). I'm sorry if this sounds harsh, because it looks like a lot of work went into these, but the current stream just doesn't support a correct implementation of poll_flush.

I thought about how to fix this but it definitely isn't easy because you need to tie the amount of buffered / inflight bytes to some kind of signalling that needs to be reset if new bytes came in before anyone could see the signal.

To summarize the options in order of my preference:

  1. Delete/Disable the AsyncWrite impl for Stream and Datachannel until Stream is able to support a flush and a shutdown
  2. Keep them as is, and maybe put a warning somewhere?
  3. Fix them in this PR

Sorry that this PR is such a mess with regards to the AsyncWriter impl, I am learning about this stuff as I go.

@melekes
Copy link
Contributor

melekes commented Dec 21, 2022

This limit should probably be configurable by the user right?

I'd recommend to be critical about adding more config options as they tend to grow indefinitely and most users don't change defaults any way.

In this case, if we know upper bound of UDP socket, then can't we deduct the max queue len from it?

@KillingSpark
Copy link
Contributor Author

I'd recommend to be critical about adding more config options as they tend to grow indefinitely and most users don't change defaults any way.

In this case, if we know upper bound of UDP socket, then can't we deduct the max queue len from it?

Agreed. I think finding out the upper limit for the UDP sockets is platform dependent. One can get the current sendbuf size, but I don't know if this is a good value it might be too little for optimal througput. Just playing around with the limit indicates that as little as 128kB as the upper limit seems enough to sustain maximum throughput. MacOS seems to use this value as the send buffer size for TCP according to sysctl -n net.inet.tcp.sendspace.

@KillingSpark
Copy link
Contributor Author

KillingSpark commented Dec 21, 2022

Just realized this has a bug with messages that are bigger than the limit, they will never be able to acquire enough permits.

But the append function is used with this comment // NOTE: append is used here instead of push in order to prevent chunks interlacing. Is this actually important? If not I'd change this to append the chunks in bundles with size <= limit / 2 so smooth operation is possible even in the presence of big packets.

Edit:
https://www.rfc-editor.org/rfc/rfc4960#section-3.3

   When a user message is fragmented into multiple chunks, the TSNs are
   used by the receiver to reassemble the message.  This means that the
   TSNs for each fragment of a fragmented user message MUST be strictly
   sequential.

So, this is important.

@melekes
Copy link
Contributor

melekes commented Dec 21, 2022

Is this actually important?

I think if chunks interlace, the reader won't be able to reconstruct the message and will skip it entirely.

@KillingSpark
Copy link
Contributor Author

Yeah the rfc states that the fragments of one message must be in sequential TSNs.

Fixed it by serializing the access to the semaphore. This whole thing is kinda wonky now, but I hope my comments in the code explain whats going on

Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your work 👍 while I do agree with the general direction, this PR brings more complexity, which we should try to avoid. A possible solution is to make inner pending queues non blocking, which may require some architecture changes. However, I believe we should still try to seek a solution easier to maintain and understand for future contributors.

sctp/CHANGELOG.md Outdated Show resolved Hide resolved
sctp/CHANGELOG.md Outdated Show resolved Hide resolved
sctp/src/stream/mod.rs Outdated Show resolved Hide resolved
@KillingSpark
Copy link
Contributor Author

KillingSpark commented Dec 21, 2022

A possible solution is to make inner pending queues non blocking, which may require some architecture changes.

I don't agree. The point of this PR is, that the Association might be sending slower than the application is pushing new data into it.

If this situation exists for a sustained amount of time, the data will accumulate in the pending queue. This leads to more and more memory being allocated and in the end the OOM killer will kick in.

Edit: to make my point clearer: at some point the association needs to apply backpressure to the application. Otherwise at some point of the system the data will accumulate as described above. There might be other ways / better places to apply this concept, but it is necessary IMO.

Edit 2: one example for these situations is the throughput example that started all of these PRs. If this PR is not merged, the throughput example just sends packets as fast as possible which overloads my 16Gb of RAM in seconds causing the OOM killer to kick in.

KillingSpark and others added 2 commits December 21, 2022 16:33
Co-authored-by: Anton <[email protected]>
@melekes
Copy link
Contributor

melekes commented Dec 22, 2022

Edit: to make my point clearer: at some point the association needs to apply backpressure to the application. Otherwise at some point of the system the data will accumulate as described above. There might be other ways / better places to apply this concept, but it is necessary IMO.

I agree ☝️

To make my point more clear 😄 I'd rather see simple bounded channels / queues than the current version with locks & semaphores. But I guess we can refactor later

sctp/src/stream/mod.rs Outdated Show resolved Hide resolved
@KillingSpark
Copy link
Contributor Author

I'd rather see simple bounded channels / queues than the current version with locks & semaphores. But I guess we can refactor later

I don't see how that would work, but if it does it would for sure be better than this weird mutex-semaphore combo.

let mut ordered_queue = self.ordered_queue.write();
ordered_queue.push_back(c);
}
drop(sem_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure we need to drop lock manually? why it's not dropped automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope it's not required, but for functions where the Lock is not needed the whole time I like to do it like this. I know this would be dropped and unlocked at the end of the block anyways but I think it communicates intent better.

If you don't like it, it can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially since the construct with the sem_lock isn't required by any language level checks to actually push to the queue I think it is good to make sure the lock actually lives as long, but not longer than needed.

// unwrap ok because we never close the semaphore unless we have dropped self
permits.unwrap().forget();
self.append_unlimited(chunks, total_user_data_len);
drop(sem_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

self.queue_len.fetch_add(1, Ordering::SeqCst);
}

drop(sem_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

sctp/src/stream/mod.rs Outdated Show resolved Hide resolved
sctp/src/stream/mod.rs Outdated Show resolved Hide resolved
@melekes melekes enabled auto-merge (squash) January 2, 2023 09:19
@melekes melekes merged commit daaf05d into webrtc-rs:master Jan 2, 2023
melekes added a commit that referenced this pull request Jan 5, 2023
* add comments for QUEUE_BYTES_LIMIT and QUEUE_APPEND_LARGE

* remove unnecessary drop(sem_lock)

lock will be dropped automatically

* use PADDING_MULTIPLE instead of 16

contract: padding_needed <= PADDING_MULTIPLE

Refs #364 and #367
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants