Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Task has lost its waker" and freeze when sending large packets #92

Closed
r58Playz opened this issue Jan 25, 2025 · 12 comments · Fixed by #96
Closed

"Task has lost its waker" and freeze when sending large packets #92

r58Playz opened this issue Jan 25, 2025 · 12 comments · Fixed by #96

Comments

@r58Playz
Copy link

When sending large packets very quickly, there is a chance the future completely locks up and loses its waker according to tokio-console and console-subscriber.

I am using rustc 1.86.0-nightly (1e9b0177d 2025-01-24)
Reproduction code:

[package]
name = "tokio-ws-reproduce"
version = "0.1.0"
edition = "2021"

[dependencies]
bytes = "1.9.0"
console-subscriber = "0.4.1"
futures = { version = "0.3.31", default-features = false, features = ["std", "async-await"] }
tokio = { version = "1.43.0", features = ["full"] }
tokio-websockets = { version = "0.11.0", features = ["client", "simd", "sha1_smol", "rand", "native-tls"] }
use bytes::Bytes;
use futures::{SinkExt, StreamExt, future::join};
use tokio_websockets::Message;

#[tokio::main]
async fn main() {
    console_subscriber::init();

    let (mut tx, mut rx) = tokio_websockets::ClientBuilder::new()
        .uri("ws://127.0.0.1:3000")
        .unwrap()
        .connect()
        .await
        .unwrap()
        .0
        .split();

    let len = std::env::args().nth(1).unwrap().parse().unwrap();
    let a = tokio::spawn(async move {
        let x = Bytes::from(vec![0; len]);
        let mut cnt = 0usize;
        loop {
            let x = Message::binary(x.clone());
            tx.send(x).await.unwrap();
            println!("sent {cnt}");
            cnt += 1;
        }
    });
    let b = tokio::spawn(async move {
        loop {
            rx.next().await;
        }
    });

    join(a, b).await.0.unwrap();
}

Run the echo-server example from this repo (commit b8b3418): cargo r -r --example echo_server -F server,sha1_smol

Then run the code: RUSTFLAGS='--cfg tokio_unstable' cargo r -r -- 51205

On my machine, it locks up after around 100-150 messages with a size of 51205. It also locks up with smaller message sizes, but not quickly or reliably.

Once it locks up, check tokio-console:
Tokio-console main view
Tokio-console detailed view
The task location matches up with the tokio::spawn for the writer task.

@vilgotf
Copy link
Collaborator

vilgotf commented Jan 25, 2025

I don't think this is a tws issue as we don't do anything with wakers but pass them along. It seems to instead be this old issue about locking inside of StreamExt::split: rust-lang/futures-rs#2500


I've been thinking about adding a true duplex API (created from two separate streams), but I don't know if that's something anyone wants considering you'd be unable to use TLS1, see rustls/rustls#288 about its challenges.

Footnotes

  1. Unless your rocking ktls, which I'm not familiar with

@r58Playz
Copy link
Author

r58Playz commented Jan 25, 2025

I tried without using StreamExt::split or any tasks at all and it still locks up at 170-230 packets:

use bytes::Bytes;
use futures::SinkExt;
use tokio_websockets::Message;

#[tokio::main]
async fn main() {
    console_subscriber::init();

    let mut ws = tokio_websockets::ClientBuilder::new()
        .uri("ws://127.0.0.1:3000")
        .unwrap()
        .connect()
        .await
        .unwrap()
        .0;

    let len = std::env::args().nth(1).unwrap().parse().unwrap();

    let x = Bytes::from(vec![0; len]);
    let mut cnt = 0usize;
    loop {
        let x = Message::binary(x.clone());
        ws.send(x).await.unwrap();
        println!("sent {cnt}");
        cnt += 1;
    }
}

EDIT:
It works completely fine if I add ws.next().await.unwrap().unwrap(); after the ws.send() call. Not sure why.

@r58Playz
Copy link
Author

Also, SplitStream and SplitSink use the two different halves of the BiLock, so I think that issue is unrelated as it's about sharing one half of the BiLock between two tasks.
https://github.com/rust-lang/futures-rs/blob/cc670fdf41703282c1c124fdaa4083637d460096/futures-util/src/stream/stream/split.rs#L119-L124

@Gelbpunkt
Copy link
Owner

You cannot just send data to a remote server which sends messages back and never read them. If I run your reproducer and check ss -tni, I see:

State               Recv-Q                Send-Q                                                       Local Address:Port                                                           Peer Address:Port                
ESTAB               5958437               2566774                                                          127.0.0.1:3000                                                              127.0.0.1:46514
	 cubic wscale:7,7 rto:206 backoff:10 rtt:5.153/10.262 ato:80 mss:47424 pmtu:65535 rcvmss:65483 advmss:65483 cwnd:10 bytes_sent:127818 bytes_acked:127818 bytes_received:8673032 segs_out:58 segs_in:174 data_segs_out:5 data_segs_in:151 send 736254609bps lastsnd:227166 lastrcv:226950 lastack:11894 pacing_rate 1472473496bps delivery_rate 25292800000bps delivered:6 app_limited busy:227478ms rwnd_limited:227478ms(100.0%) rcv_rtt:0.247 rcv_space:172237 rcv_ssthresh:5816165 notsent:2566774 minrtt:0.015 rcv_ooopack:1

Note the rwnd_limited: 227478ms (100.0%). The connection has been entirely receiver window-limited, meaning no further data can be sent because the receiver’s buffer is full. You should read out the responses to fix this. Your problem is due to abuse of the TCP stack, not because of tokio-websockets.

@r58Playz
Copy link
Author

I think the issue might be a combination of that and how BiLock is not fair, so nothing ever gets read as the write task keeps locking the whole stream. I'll close the issue now as it's not related to tokio-websockets.

@Gelbpunkt
Copy link
Owner

Gelbpunkt commented Jan 26, 2025

and how BiLock is not fair, so nothing ever gets read as the write task keeps locking the whole stream

Ah, that makes a ton of sense, actually. And would be a strong reason for proper full-duplex streams in the future, once the TLS libraries support it. As a workaround, try wrapping the call in a timeout and re-try later to give the receiver task some time?

@r58Playz
Copy link
Author

I seem to have mixed up two different issues, sorry.
I am still losing wakers, even with a fair mutex from parking-lot.

The rwnd_limited is at 0.6% according to ss -tni.

Image

ESTAB 0      0                   127.0.0.1:44568              127.0.0.1:3000  users:(("tokio-ws-reprod",pid=420975,fd=9)) 
         cubic wscale:7,7 rto:203.333 rtt:0.074/0.11 ato:40 mss:65483 pmtu:65535 rcvmss:65483 advmss:65483 cwnd:22 ssthresh:18 bytes_sent:2740356570 bytes_acked:2740356571 bytes_received:2740142510 segs_out:53789 segs_in:53681 data_segs_out:53406 data_segs_in:53509 send 155743351351bps lastsnd:1150 lastrcv:1107 lastack:1107 pacing_rate 185949708904bps delivery_rate 174621333328bps delivered:53407 busy:569ms rwnd_limited:3ms(0.6%) rcv_rtt:1 rcv_space:5018482 rcv_ssthresh:6169963 minrtt:0.001 snd_wnd:6141952 rcv_wnd:6169984
ESTAB 0      0                   127.0.0.1:3000               127.0.0.1:44568 users:(("echo_server",pid=392729,fd=7))     
         cubic wscale:7,7 rto:206.666 rtt:4.428/8.794 ato:40 mss:65483 pmtu:65535 rcvmss:65483 advmss:65483 cwnd:19 bytes_sent:2740142510 bytes_acked:2740142510 bytes_received:2740356570 segs_out:53680 segs_in:53789 data_segs_out:53509 data_segs_in:53406 send 2247835592bps lastsnd:1104 lastrcv:1147 lastack:1104 pacing_rate 4495417376bps delivery_rate 7275888888bps delivered:53510 app_limited busy:939ms rcv_rtt:0.07 rcv_space:4762809 rcv_ssthresh:6141857 minrtt:0.001 snd_wnd:6169984 rcv_wnd:6141952

Here's the new reproducer:

[package]
name = "tokio-ws-reproduce"
version = "0.1.0"
edition = "2021"

[dependencies]
bytes = "1.9.0"
console-subscriber = "0.4.1"
futures = { version = "0.3.31", default-features = false, features = ["std", "async-await"] }
parking_lot = "0.12.3"
tokio = { version = "1.43.0", features = ["full"] }
tokio-websockets = { version = "0.11.0", features = ["client", "simd", "sha1_smol", "rand", "native-tls"] }
use std::{pin::Pin, sync::Arc, task::Poll};

use bytes::Bytes;
use futures::{future::join, Sink, SinkExt, Stream, StreamExt};
use parking_lot::FairMutex;
use tokio::net::TcpStream;
use tokio_websockets::{MaybeTlsStream, Message, WebSocketStream};

pub struct SplitRead(Arc<FairMutex<WebSocketStream<MaybeTlsStream<TcpStream>>>>);
impl Stream for SplitRead {
    type Item = Result<Message, tokio_websockets::Error>;

    fn poll_next(
        self: Pin<&mut Self>,
        cx: &mut std::task::Context<'_>,
    ) -> Poll<Option<Self::Item>> {
        self.0.lock().poll_next_unpin(cx)
    }
}
pub struct SplitWrite(Arc<FairMutex<WebSocketStream<MaybeTlsStream<TcpStream>>>>);
impl Sink<Message> for SplitWrite {
    type Error = tokio_websockets::Error;

    fn poll_ready(
        self: Pin<&mut Self>,
        cx: &mut std::task::Context<'_>,
    ) -> Poll<Result<(), Self::Error>> {
        self.0.lock().poll_ready_unpin(cx)
    }

    fn start_send(self: Pin<&mut Self>, item: Message) -> Result<(), Self::Error> {
        self.0.lock().start_send_unpin(item)
    }

    fn poll_flush(
        self: Pin<&mut Self>,
        cx: &mut std::task::Context<'_>,
    ) -> Poll<Result<(), Self::Error>> {
        self.0.lock().poll_flush_unpin(cx)
    }

    fn poll_close(
        self: Pin<&mut Self>,
        cx: &mut std::task::Context<'_>,
    ) -> Poll<Result<(), Self::Error>> {
        self.0.lock().poll_close_unpin(cx)
    }
}

#[tokio::main]
async fn main() {
    console_subscriber::init();

    let inner = Arc::new(FairMutex::new(
        tokio_websockets::ClientBuilder::new()
            .uri("ws://127.0.0.1:3000")
            .unwrap()
            .connect()
            .await
            .unwrap()
            .0,
    ));

    let mut rx = SplitRead(inner.clone());
    let mut tx = SplitWrite(inner);

    let len = std::env::args().nth(1).unwrap().parse().unwrap();
    let a = tokio::spawn(async move {
        let x = Bytes::from(vec![0; len]);
        let mut cnt = 0usize;
        loop {
            let x = Message::binary(x.clone());
            tx.send(x).await.unwrap();
            println!("sent {cnt}");
            cnt += 1;
        }
    });
    let b = tokio::spawn(async move {
        loop {
            rx.next().await;
        }
    });

    join(a, b).await.0.unwrap();
}

@r58Playz r58Playz reopened this Jan 26, 2025
@r58Playz
Copy link
Author

r58Playz commented Jan 26, 2025

Commenting this code seems to fix it as the reproducer hasn't locked up after a couple minutes of constant writing:

// If there are pending items, try to flush the sink
if !self.frame_queue.is_empty() {
_ = self.as_mut().poll_flush(cx)?;
}

The reader task calling writer functions on the inner stream while the writer task was writing made the inner stream replace and drop the waker for the writer task. I'm not sure what the proper solution for this is, though.

@vilgotf
Copy link
Collaborator

vilgotf commented Jan 26, 2025

The reader task calling writer functions on the inner stream while the writer task was writing made the inner stream replace and drop the waker for the writer task. I'm not sure what the proper solution for this is, though.

Right... Because WebSocketStream is Sync we cannot assume that its Stream and Sink set of methods are called on the same task. That is unfortunate, as this assumption has some benefits. We could perhaps delegate the current implementation to a new API, warning about it registering for write notifications and that it should not be concurrently called while shared with other tasks.

This quote and discussion from tokio_rustls may be relevant too:

Why don't we write during poll_read?

We did this in the early days of tokio-rustls, but it caused some bugs.
We can solve these bugs through some solutions, but this will cause performance degradation (reverse false wakeup).

And reverse write will also prevent us implement full duplex in the future.

see tokio-rs/tls#40

@Gelbpunkt
Copy link
Owner

Can you try if e8d517e fixes the problem for you?

@r58Playz
Copy link
Author

Yes, that commit fixes the problem.

@Gelbpunkt
Copy link
Owner

Release version 0.11.1 includes the fix, thanks for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants