-
-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data channel fails to send packets at a frequently speed when works as a rtc server and connected by a browser. #360
Comments
There is issue #101 which might be what you are running up against? |
@KillingSpark , I tried the example in the issue, and it works well with 1024 bytes per send call. My question here is what exactly blocks the |
I can only guess, I don't have deep knowledge on this library (I am just a user with a few small contributions :)). But the default for datachannels is probably reliable transport, so if the bandwidth is used to the maximum, it will start buffering data, just like tcp would. Just out of interest, does this change if you open a datachannel with an unreliable transport? Should be a setting on the javascript API somewhere. |
I've got no idea but I'll try it out later. |
It seems something is wrong within the I just wrote a simple POC to do a throughput test use clap::{App, AppSettings, Arg};
use std::io::Write;
use std::sync::Arc;
use tokio::net::UdpSocket;
use util::{conn::conn_disconnected_packet::DisconnectedPacketConn, Conn};
use webrtc_sctp::association::*;
use webrtc_sctp::chunk::chunk_payload_data::PayloadProtocolIdentifier;
use webrtc_sctp::stream::*;
use webrtc_sctp::Error;
fn main() -> Result<(), Error> {
env_logger::Builder::new()
.format(|buf, record| {
writeln!(
buf,
"{}:{} [{}] {} - {}",
record.file().unwrap_or("unknown"),
record.line().unwrap_or(0),
record.level(),
chrono::Local::now().format("%H:%M:%S.%6f"),
record.args()
)
})
.filter(None, log::LevelFilter::Warn)
.init();
let mut app = App::new("SCTP Throughput")
.version("0.1.0")
.about("An example of SCTP Server")
.setting(AppSettings::DeriveDisplayOrder)
.setting(AppSettings::SubcommandsNegateReqs)
.arg(
Arg::with_name("FULLHELP")
.help("Prints more detailed help information")
.long("fullhelp"),
)
.arg(
Arg::with_name("port")
.required_unless("FULLHELP")
.takes_value(true)
.long("port")
.help("use port ."),
);
let matches = app.clone().get_matches();
if matches.is_present("FULLHELP") {
app.print_long_help().unwrap();
std::process::exit(0);
}
let port1 = matches.value_of("port").unwrap().to_owned();
let port2 = port1.clone();
std::thread::spawn(|| {
tokio::runtime::Runtime::new()
.unwrap()
.block_on(async move {
let conn = DisconnectedPacketConn::new(Arc::new(
UdpSocket::bind(format!("127.0.0.1:{}", port1))
.await
.unwrap(),
));
println!("listening {}...", conn.local_addr().unwrap());
let config = Config {
net_conn: Arc::new(conn),
max_receive_buffer_size: 0,
max_message_size: 0,
name: "server".to_owned(),
};
let a = Association::server(config).await?;
println!("created a server");
let stream = a.accept_stream().await.unwrap();
println!("accepted a stream");
// set unordered = true and 10ms treshold for dropping packets
stream.set_reliability_params(true, ReliabilityType::Timed, 10);
let mut buff = vec![0u8; 65535];
let mut recv = 0;
let mut pkt_num = 0;
let mut loop_num = 0;
let mut now = tokio::time::Instant::now();
while let Ok(n) = stream.read(&mut buff).await {
recv += n;
if n != 0 {
pkt_num += 1;
}
loop_num += 1;
if now.elapsed().as_secs() == 1 {
println!(
"Throughput: {} Bytes/s, {} pkts, {} loops",
recv, pkt_num, loop_num
);
now = tokio::time::Instant::now();
recv = 0;
loop_num = 0;
pkt_num = 0;
}
}
Result::<(), Error>::Ok(())
})
});
std::thread::spawn(|| {
tokio::runtime::Runtime::new()
.unwrap()
.block_on(async move {
let conn = Arc::new(UdpSocket::bind("0.0.0.0:0").await.unwrap());
conn.connect(format!("127.0.0.1:{}", port2)).await.unwrap();
println!("connecting {}..", format!("127.0.0.1:{}", port2));
let config = Config {
net_conn: conn,
max_receive_buffer_size: 0,
max_message_size: 0,
name: "client".to_owned(),
};
let a = Association::client(config).await.unwrap();
println!("created a client");
let stream = a
.open_stream(0, PayloadProtocolIdentifier::Binary)
.await
.unwrap();
println!("opened a stream");
let mut buf = Vec::with_capacity(65535);
unsafe {
buf.set_len(65535);
}
let mut now = tokio::time::Instant::now();
let mut pkt_num = 0;
while stream.write(&buf.clone().into()).is_ok() {
pkt_num += 1;
if now.elapsed().as_secs() == 1 {
println!("Send {} pkts", pkt_num);
now = tokio::time::Instant::now();
pkt_num = 0;
}
}
Result::<(), Error>::Ok(())
})
});
loop {}
} And got logs like
whatever the But if I remove the
|
My UDP snippets were like this fn main() -> Result<(), Error> {
env_logger::Builder::new()
.format(|buf, record| {
writeln!(
buf,
"{}:{} [{}] {} - {}",
record.file().unwrap_or("unknown"),
record.line().unwrap_or(0),
record.level(),
chrono::Local::now().format("%H:%M:%S.%6f"),
record.args()
)
})
.filter(None, log::LevelFilter::Warn)
.init();
let mut app = App::new("SCTP Throughput")
.version("0.1.0")
.about("An example of SCTP Server")
.setting(AppSettings::DeriveDisplayOrder)
.setting(AppSettings::SubcommandsNegateReqs)
.arg(
Arg::with_name("FULLHELP")
.help("Prints more detailed help information")
.long("fullhelp"),
)
.arg(
Arg::with_name("port")
.required_unless("FULLHELP")
.takes_value(true)
.long("port")
.help("use port ."),
);
let matches = app.clone().get_matches();
if matches.is_present("FULLHELP") {
app.print_long_help().unwrap();
std::process::exit(0);
}
let port1 = matches.value_of("port").unwrap().to_owned();
let port2 = port1.clone();
std::thread::spawn(|| {
tokio::runtime::Runtime::new()
.unwrap()
.block_on(async move {
let conn = UdpSocket::bind(format!("127.0.0.1:{}", port1))
.await
.unwrap();
println!("listening {}", format!("127.0.0.1:{}", port1));
let mut buff = vec![0u8; 65535];
let mut recv = 0;
let mut pkt_num = 0;
let mut now = tokio::time::Instant::now();
while let Ok(n) = conn.recv(&mut buff).await {
recv += n;
if n != 0 {
pkt_num += 1;
}
if now.elapsed().as_secs() == 1 {
println!("Throughput: {} Bytes/s, {} pkts", recv, pkt_num);
now = tokio::time::Instant::now();
recv = 0;
pkt_num = 0;
}
}
Result::<(), Error>::Ok(())
})
});
std::thread::spawn(|| {
tokio::runtime::Runtime::new()
.unwrap()
.block_on(async move {
let conn = UdpSocket::bind("0.0.0.0:0").await.unwrap();
println!("Connect to {}", format!("127.0.0.1:{}", port2));
conn.connect(format!("127.0.0.1:{}", port2)).await.unwrap();
println!("Connected");
let mut buf = Vec::with_capacity(16384);
unsafe {
buf.set_len(16384);
}
let mut now = tokio::time::Instant::now();
let mut pkt_num = 0;
while conn.send(&buf).await.is_ok() {
pkt_num += 1;
if now.elapsed().as_secs() == 1 {
println!("Send {} pkts", pkt_num);
now = tokio::time::Instant::now();
pkt_num = 0;
}
}
Result::<(), Error>::Ok(())
})
});
loop {}
} With this I think we can rule out the possibility that tokio goes wrong. BRs. |
Ok so I did some more digging and need to take back what I wrote earlier. It is NOT the received queue of the socket filling up. Two things i found:
So there are two separate problems at play here:
|
More info: it seems like the ACK mechanism is borked: I just did dumb println! debugging in
Note that Send/Handle/ACK seem to match pretty well while ACKRCV is laggin behind significantly. So either
Further investigation shows: it is very likely option number 2. Now to find out why... |
Yup it's lock contention. Specifically on the Mutex around the InternalAssociation struct. Occupied time in readloop measures time used to process an SACK and in writeloop it's gathering packets to send. Taking time is just measuring the time it took to actually lock the mutex before doing the above operation. It seems pretty clear to me that the SACK processing is stalled by the lock contention. The solution to this is probably non-trivial. Maybe instead of making the writeloop gather packets to send, the packets should be put into a channel either when they are queued and can be sent immediatly within the rwnd or when a SACK arrives that increases the rwnd?
Modified code for better understanding of the measurements above: async fn read_loop(
name: String,
bytes_received: Arc<AtomicUsize>,
net_conn: Arc<dyn Conn + Send + Sync>,
mut close_loop_ch: broadcast::Receiver<()>,
association_internal: Arc<Mutex<AssociationInternal>>,
) {
log::debug!("[{}] read_loop entered", name);
let mut buffer = vec![0u8; RECEIVE_MTU];
let mut done = false;
let mut n;
while !done {
tokio::select! {
_ = close_loop_ch.recv() => break,
result = net_conn.recv(&mut buffer) => {
match result {
Ok(m) => {
n=m;
}
Err(err) => {
log::warn!("[{}] failed to read packets on net_conn: {}", name, err);
break;
}
}
}
};
// Make a buffer sized to what we read, then copy the data we
// read from the underlying transport. We do this because the
// user data is passed to the reassembly queue without
// copying.
log::debug!("[{}] recving {} bytes", name, n);
let inbound = Bytes::from(buffer[..n].to_vec());
bytes_received.fetch_add(n, Ordering::SeqCst);
{
let x = std::time::Instant::now();
let mut ai = association_internal.lock().await;
if inbound.len() < 1200 {
eprintln!("Taking readloop lock: {}us", x.elapsed().as_micros());
}
let x = std::time::Instant::now();
if let Err(err) = ai.handle_inbound(&inbound).await {
log::warn!("[{}] failed to handle_inbound: {:?}", name, err);
done = true;
}
if inbound.len() < 1200 {
eprintln!("readloop lock occupied: {}us", x.elapsed().as_micros());
}
}
}
{
let mut ai = association_internal.lock().await;
if let Err(err) = ai.close().await {
log::warn!("[{}] failed to close association: {:?}", name, err);
}
}
log::debug!("[{}] read_loop exited", name);
}
async fn write_loop(
name: String,
bytes_sent: Arc<AtomicUsize>,
net_conn: Arc<dyn Conn + Send + Sync>,
mut close_loop_ch: broadcast::Receiver<()>,
association_internal: Arc<Mutex<AssociationInternal>>,
mut awake_write_loop_ch: mpsc::Receiver<()>,
) {
log::debug!("[{}] write_loop entered", name);
let mut done = false;
while !done {
//log::debug!("[{}] gather_outbound begin", name);
let (raw_packets, mut ok) = {
let x = std::time::Instant::now();
let mut ai = association_internal.lock().await;
eprintln!("Taking Writeloop lock: {}us", x.elapsed().as_micros());
let x = std::time::Instant::now();
let r = ai.gather_outbound().await;
eprintln!("Writeloop lock occupied: {}us", x.elapsed().as_micros());
r
};
//log::debug!("[{}] gather_outbound done with {}", name, raw_packets.len());
for raw in &raw_packets {
log::debug!("[{}] sending {} bytes", name, raw.len());
if let Err(err) = net_conn.send(raw).await {
log::warn!("[{}] failed to write packets on net_conn: {}", name, err);
ok = false;
break;
} else {
bytes_sent.fetch_add(raw.len(), Ordering::SeqCst);
}
//log::debug!("[{}] sending {} bytes done", name, raw.len());
}
if !ok {
break;
}
if raw_packets.is_empty() {
//log::debug!("[{}] wait awake_write_loop_ch", name);
tokio::select! {
_ = awake_write_loop_ch.recv() =>{}
_ = close_loop_ch.recv() => {
done = true;
}
};
}
//log::debug!("[{}] wait awake_write_loop_ch done", name);
}
{
let mut ai = association_internal.lock().await;
if let Err(err) = ai.close().await {
log::warn!("[{}] failed to close association: {:?}", name, err);
}
}
log::debug!("[{}] write_loop exited", name);
} Edit: My current suspicion is, that the marshalling of the packets is the culprit, because it is done under while holding the lock. I think it is easy enough to fix this and maybe performance will be good enough without bigger changes to the architecture |
Ok so my suspicion was right. Pulling the marshalling our from under the lock reduces the time the mutex is locked in total drastically...but... Even if we trick tokio into not running the read_loop and write_loop on the same thread we see something unfortunate:
537836 Bytes in 7789 microseconds is 69.050.712 Bytes/s and since the write_loop is not ideling, send performance is entirely bottlenecked by:
I can put together a PR that provides the behaviour above, which would allow optimizations on Packet::marshal to result in immediate throughput benefits Doing a few low hanging optimizations on the marshaling code gets me to this throughput
95MByte/s is still not great but it's something. Changes to the write_loop async fn write_loop(
name: String,
bytes_sent: Arc<AtomicUsize>,
net_conn: Arc<dyn Conn + Send + Sync>,
mut close_loop_ch: broadcast::Receiver<()>,
association_internal: Arc<Mutex<AssociationInternal>>,
mut awake_write_loop_ch: mpsc::Receiver<()>,
) {
log::debug!("[{}] write_loop entered", name);
let mut done = false;
while !done {
//log::debug!("[{}] gather_outbound begin", name);
let (raw_packets, mut ok) = {
let x = std::time::Instant::now();
eprintln!("{name} Try taking writeloop lock");
let mut ai = association_internal.lock().await;
eprintln!(
"{name} Taking Writeloop lock: {}us",
x.elapsed().as_micros()
);
let x = std::time::Instant::now();
let r = ai.gather_outbound().await;
drop(ai);
eprintln!(
"{name} Writeloop lock occupied: {}us",
x.elapsed().as_micros()
);
r
};
let x = std::time::Instant::now();
eprintln!("{name} Run write body on {:?}", std::thread::current().id());
let name2 = name.clone();
let net_conn = Arc::clone(&net_conn);
let bytes_sent = Arc::clone(&bytes_sent);
// THIS IS IMPORTANT
// This task::spawn makes tokio spawn this on another thread, allowing the read_loop
// to make progress while we send out this batch of packets
let bytes_sent = tokio::task::spawn(async move {
let mut b = 0;
for raw in raw_packets {
let raw = raw.marshal().unwrap();
if let Err(err) = net_conn.send(raw.as_ref()).await {
ok = false;
break;
} else {
b += raw.len();
bytes_sent.fetch_add(raw.len(), Ordering::SeqCst);
}
//log::debug!("[{}] sending {} bytes done", name, raw.len());
}
b
//log::debug!("[{}] gather_outbound done with {}", name, raw_packets.len());
})
.await
.unwrap();
if !ok {
break;
}
eprintln!(
"{name} Writeloop body took: {}us for {bytes_sent}Bytes",
x.elapsed().as_micros()
);
//log::debug!("[{}] wait awake_write_loop_ch", name);
let x = std::time::Instant::now();
tokio::select! {
_ = awake_write_loop_ch.recv() =>{}
_ = close_loop_ch.recv() => {
done = true;
}
};
eprintln!("{name} Writeloop slept for: {}us", x.elapsed().as_micros());
//log::debug!("[{}] wait awake_write_loop_ch done", name);
}
{
let mut ai = association_internal.lock().await;
if let Err(err) = ai.close().await {
log::warn!("[{}] failed to close association: {:?}", name, err);
}
}
log::debug!("[{}] write_loop exited", name);
} |
TLDR for this whole issue:
Separately I noticed that the pending queue does not apply backpressure so if the sender continously sends faster than the connection is allowed to send this queue will grow indefinitely and in the long run cause an OOM. Way forward@rainliu @k0nserv First of all sorry for hijacking (and kinda spamming :D) an issue on your repo I hope that's ok. I got a bit carried away. Do you have any objections to the changes I did in the write_loop (see comment directly above) ? Obviously the If not I'd prepare two PRs one for the changes in the write_loop and one for the optimizations to the marshal code. And probably a new issue for the pending queue not applying back pressure. |
Awesome work! BRs |
Even before the changes I could get ~70MB/s in release mode, but the changes I did to get to 95 where pretty small, I'd guess there is even more potential for optimization. Still a nice improvement :) |
Excellent research! Sounds like your changes are promising @KillingSpark. Please roll it up into PRs for review |
The three PRs combined have this effect for me: Current master
All three PRs combined
Still not at pion levels but the next big botlleneck that exists now seems to be the CRC32 impementation. @HsuJv If you want to try this out: I pushed a branch with the current state of the three PRs merged here: https://github.com/KillingSpark/webrtc/tree/merged |
@KillingSpark is there a specific order of merging these that would be simpler than the others? |
Also, can you please add changelog entries in your PRs? |
Will do! Order of merging shouldn't matter they should be independent. But hold off from merging, I have one more idea on the write_loop behaviour I want to try. |
Ok, changelogs have been done, my idea didn't work out, so you can merge if you want to |
I tested with the release build in my WSL2 Arch (another pc), and the throughput of the SCTP increased from 140-200 MBps to 400-500 MBps, awesome! |
Definitely should do that. I have some ideas for that as well (and prototyping shows it increases throughput even more) but I wanted to get these PRs merged before starting another one :) Edit: nevermind, found a good solution and couldn't wait |
@melekes would you mind helping with some reviews? You have a more subject matter expertise than me in this area |
I am a bit lost here to be honest. If I use the optimized sctp marshalling the performance just tanks on my linux PC and I cannot figure out why. On the Mac this improves performance a lot. But it isn't really about the scheduling either as far as I can tell. tokio-console also doesn't reveal any glaring issues. Does anyone have ideas? HsuJv mentioned that in WSL this brought big improvements too, so windows and mac seem to see an improvement but it seems very weird that on linux this would cause such a drastic performance decrease... In the end it is a pretty straight forward optimization... If anyone wants to try I am just running the throughput example on my "merged" branch https://github.com/KillingSpark/webrtc/tree/merged where all PRs of this issue have been merged together. But trial and error has shown that the optimized marshalling causes the performance drop. |
On my ubuntu 20.04 server (i7-7700 16G RAM)
|
This is somewhat consistent with what I see on my machine though my spikes are much rarer. I am on a Ryzen 3800. Does Tokio employe different scheduling depending on the OS? |
Fixed it. @HsuJv could you try the current state of the merged branch one more time just to confirm this? |
Yes it gets smoothly at around 130MB
(My server always runs some heavy tasks, so it is expected that the performance is lower than my WSL) |
@HsuJv @KillingSpark thanks for looking into this 👍 fantastic work |
As discussed in #360 the lock on the internal association is very contended in high send-bandwidth situations. This PR achieves two things: 1. Pull the marshalling of packets outside of the critical section thus reducing the time the lock is taken by the write loop 2. Schedule the marshalling and sending as a new tokio::task::Task which makes tokio schedule this on another thread allowing the read loop to make progress in parallel while the write loop is working on that This in itself does not really increase the bandwidth but with improvements to the marshalling code itself gains can be head now, that were previously (at least partially) blocked by the lock contention. It should also improve situations where both sides send a lot of data because then the write and read loops would both be very busy and fighting for the lock even more.
As discussed in #360 the marshalling code has become a bottle neck in high bandwidth sending situations. I found two places that had a big effect on the performance, the hot path for this situation is marshalling packets with exactly one data chunk in them. After this PR the marshalling is largely dominated by the CRC32 calculation which is... not easy to speed up.
As discussed in #360 Gathering packets to send is a big chunk of the work the Association::write_loop is doing while in a critical section. This PR improves the performance of this by making the payload queue more performant to push to. Previously this did a full mergesort (O(n*log(n))) on all the in-flight TSNs, now it does a binary search (O(log(n)))
As discussed in #360 the pending queue can grow indefinitely if the sender writes packets faster than the association is able to transmit them. This PR solves this by enforcing a limit on the pending queue. This blocks the sender until enough space is free.
Btw, worth keeping in mind how this work intersects with #136 and a whether this potential move to a sans-IO version of SCTP helps improve the performance problems. In any case, excellent work on this 👍🏼 |
I think the only PR that really interferes with a sans-io implementation would be the changes in #367. Those would probably need to be lifted into the sctp-async part somehow. The other PRs should still be applicable to a sans-io refactored version. Is this refactor still active? Because I would be interested in a pure rust sans-io sctp implementation to integrate in our own project. We currently use usrsctp via FFI which works but is not ideal. |
We were discussing it Discord the other day and would like to move forward with switching out the implementation of SCTP with it. However, no one is committing to do that work at the moment so it would only happen when someone finds the time for it |
Hi all,
I'm trying to test the bandwidth of the Data Channel based on your example
I didn't change too many codes except the data channel callbacks.
I tried to send as much as possible to the client when a data channel was created (will show in the codes later).
And with a controllable delay, I tried send frequency at10Hz, 20Hz, 25Hz, 40Hz, 50Hz...etc
The problem is, when it comes to over than 40Hz, the data received in the browser mismatches the data I'm trying to send.
With the send frequency set to 40Hz (65535 bytes per send call), I've got the throughput jitter sharply and the browser receives fewer packets than I sent.
The gap increases when the send frequency becomes larger.
(Note here in the picture 192.168.1.2 is my webrtc server which runs the codes and sends data)
With the send frequency set to 25Hz (65535 bytes per send call as well), I've got the throughput smoothly and the browser receives all the data I sent.
I've got no idea what is going wrong, please if anyone here can help?
P.S.
Below are my code snippets if they can be helpful.
I rewrite these codes
webrtc/examples/examples/data-channels/data-channels.rs
Lines 117 to 155 in 20b59b7
to the following
And my Javascript codes are quite simple as shown below
The text was updated successfully, but these errors were encountered: