Huge timeout in `L1` QNS test #2088

larseggert · 2024-09-05T10:42:10Z

We're failing this L1 QNS test against the kwik server due to the test taking longer than 300s.

That seems to be because we are setting huge timeouts of ~90 seconds during the connection-close process, and we don't continue on to the next connection in a reasonable time.

client  | 93s 96ms DEBUG Setting timeout of 19.971247ms
client  | 93s 96ms DEBUG stream 0 complete
client  | 93s 96ms DEBUG stream BiDi creatable
client  | 93s 96ms DEBUG [Client 2cc1e081e1fe5325] output_path send_profile SendProfile { limit: 1152, pto: None, probe: (), paced: false }
client  | 93s 96ms DEBUG Building Short dcid Some(CID [8]: f734cc7d2e99a52a)
client  | 93s 96ms DEBUG TX blocked, profile=SendProfile { limit: 1152, pto: None, probe: (), paced: false }
client  | 93s 96ms DEBUG [Client 2cc1e081e1fe5325] delay duration 19.923328ms
client  | 93s 96ms DEBUG Setting timeout of 19.923328ms
client  | 93s 96ms DEBUG [Client 2cc1e081e1fe5325] State change from Confirmed -> Closing { error: Application(0), timeout: Instant { tv_sec: 667, tv_nsec: 656536104 } }
client  | 93s 96ms WARN Unhandled event StateChange(Closing { error: Application(0), timeout: Instant { tv_sec: 667, tv_nsec: 656536104 } })
client  | 93s 96ms DEBUG [Client 2cc1e081e1fe5325] output_path send_profile SendProfile { limit: 1152, pto: None, probe: (), paced: false }
client  | 93s 96ms DEBUG Building Short dcid Some(CID [8]: f734cc7d2e99a52a)
client  | 93s 96ms DEBUG [Recvd-ap] immediate_ack at Instant { tv_sec: 573, tv_nsec: 125867796 }
client  | 93s 96ms DEBUG [Client 2cc1e081e1fe5325] pn=3 type=Short pri-path:f734cc7d2e99a52a 0.0.0.0:42465->193.167.100.100:443 IpTos(Cs0, NotEct) len 42
client  |   TX -> Ack { largest_acknowledged: 2, ack_delay: 13, first_ack_range: 2, ack_ranges: [], ecn_count: None }
client  |   TX -> ConnectionClose { error_code: Application(0), frame_type: 0, reason_phrase: "kthxbye!" }
client  | 93s 96ms DEBUG Setting timeout of 94.530598598s
sim     | Forwarding 82 bytes 193.167.0.100:42465 -> 193.167.100.100:443, dropped 48/149 (32.2%)
sim     | Forwarding 1352 bytes 193.167.0.100:42465 -> 193.167.100.100:443, dropped 48/150 (32.0%)
sim     | Forwarding 31 bytes 193.167.0.100:42465 -> 193.167.100.100:443, dropped 48/151 (31.8%)
sim     | Dropping 42 bytes 193.167.0.100:42465 -> 193.167.100.100:443, dropped 49/152 (32.2%)
sim     | Forwarding 24 bytes 193.167.100.100:443 -> 193.167.0.100:42465, dropped 34/135 (25.2%)
client  | 93s146ms DEBUG [Client 2cc1e081e1fe5325] output_path send_profile SendProfile { limit: 1152, pto: None, probe: (), paced: false }
client  | 93s146ms DEBUG Building Short dcid Some(CID [8]: f734cc7d2e99a52a)
client  | 93s146ms DEBUG [Recvd-ap] immediate_ack at Instant { tv_sec: 573, tv_nsec: 176033952 }
client  | 93s146ms DEBUG [Client 2cc1e081e1fe5325] pn=4 type=Short pri-path:f734cc7d2e99a52a 0.0.0.0:42465->193.167.100.100:443 IpTos(Cs0, NotEct) len 43
client  |   TX -> Ack { largest_acknowledged: 2, ack_delay: 6283, first_ack_range: 2, ack_ranges: [], ecn_count: None }
client  |   TX -> ConnectionClose { error_code: Application(0), frame_type: 0, reason_phrase: "kthxbye!" }
client  | 93s147ms DEBUG Setting timeout of 94.48035613s
client  | 93s147ms DEBUG Setting timeout of 94.480342525s
sim     | Forwarding 43 bytes 193.167.0.100:42465 -> 193.167.100.100:443, dropped 49/153 (32.0%)
sim     | Forwarding 29 bytes 193.167.100.100:443 -> 193.167.0.100:42465, dropped 34/136 (25.0%)
client  | 93s178ms DEBUG [Client 2cc1e081e1fe5325] output_path send_profile SendProfile { limit: 1152, pto: None, probe: (), paced: false }
client  | 93s178ms DEBUG Building Short dcid Some(CID [8]: f734cc7d2e99a52a)
client  | 93s178ms DEBUG [Recvd-ap] immediate_ack at Instant { tv_sec: 573, tv_nsec: 207809755 }
client  | 93s178ms DEBUG [Client 2cc1e081e1fe5325] pn=5 type=Short pri-path:f734cc7d2e99a52a 0.0.0.0:42465->193.167.100.100:443 IpTos(Cs0, NotEct) len 43
client  |   TX -> Ack { largest_acknowledged: 2, ack_delay: 10255, first_ack_range: 2, ack_ranges: [], ecn_count: None }
client  |   TX -> ConnectionClose { error_code: Application(0), frame_type: 0, reason_phrase: "kthxbye!" }
client  | 93s178ms DEBUG Setting timeout of 94.448622416s
client  | 93s178ms DEBUG Setting timeout of 94.44861391s
sim     | Dropping 43 bytes 193.167.0.100:42465 -> 193.167.100.100:443, dropped 50/154 (32.5%)
sim     | Forwarding 21 bytes 193.167.100.100:443 -> 193.167.0.100:42465, dropped 34/137 (24.8%)
client  | 108s111ms DEBUG [Client 2cc1e081e1fe5325] output_path send_profile SendProfile { limit: 1152, pto: None, probe: (), paced: false }
client  | 108s111ms DEBUG Building Short dcid Some(CID [8]: f734cc7d2e99a52a)
client  | 108s111ms DEBUG [Recvd-ap] immediate_ack at Instant { tv_sec: 588, tv_nsec: 140480140 }
client  | 108s111ms DEBUG [Client 2cc1e081e1fe5325] pn=6 type=Short pri-path:f734cc7d2e99a52a 0.0.0.0:42465->193.167.100.100:443 IpTos(Cs0, NotEct) len 45
client  |   TX -> Ack { largest_acknowledged: 2, ack_delay: 1876839, first_ack_range: 2, ack_ranges: [], ecn_count: None }
client  |   TX -> ConnectionClose { error_code: Application(0), frame_type: 0, reason_phrase: "kthxbye!" }
client  | 108s111ms DEBUG Setting timeout of 79.515901617s
client  | 108s111ms DEBUG Setting timeout of 79.515885948s
sim     | Dropping 21 bytes 193.167.100.100:443 -> 193.167.0.100:42465, dropped 35/138 (25.4%)
sim     | Dropping 45 bytes 193.167.0.100:42465 -> 193.167.100.100:443, dropped 51/155 (32.9%)
sim     | Forwarding 204 bytes 193.167.100.100:443 -> 193.167.0.100:42465, dropped 35/139 (25.2%)
client  | 108s773ms DEBUG [Client 2cc1e081e1fe5325] output_path send_profile SendProfile { limit: 1152, pto: None, probe: (), paced: false }
client  | 108s773ms DEBUG Building Short dcid Some(CID [8]: f734cc7d2e99a52a)
client  | 108s773ms DEBUG [Recvd-ap] immediate_ack at Instant { tv_sec: 588, tv_nsec: 803084953 }
client  | 108s773ms DEBUG [Client 2cc1e081e1fe5325] pn=7 type=Short pri-path:f734cc7d2e99a52a 0.0.0.0:42465->193.167.100.100:443 IpTos(Cs0, NotEct) len 45
client  |   TX -> Ack { largest_acknowledged: 2, ack_delay: 1959665, first_ack_range: 2, ack_ranges: [], ecn_count: None }
client  |   TX -> ConnectionClose { error_code: Application(0), frame_type: 0, reason_phrase: "kthxbye!" }
client  | 108s774ms DEBUG Setting timeout of 78.853327351s
client  | 108s774ms DEBUG Setting timeout of 78.853304208s
sim     | Forwarding 204 bytes 193.167.100.100:443 -> 193.167.0.100:42465, dropped 35/140 (25.0%)
client  | 108s774ms DEBUG [Client 2cc1e081e1fe5325] output_path send_profile SendProfile { limit: 1152, pto: None, probe: (), paced: false }
client  | 108s774ms DEBUG Building Short dcid Some(CID [8]: f734cc7d2e99a52a)
client  | 108s774ms DEBUG [Recvd-ap] immediate_ack at Instant { tv_sec: 588, tv_nsec: 804030443 }
client  | 108s774ms DEBUG [Client 2cc1e081e1fe5325] pn=8 type=Short pri-path:f734cc7d2e99a52a 0.0.0.0:42465->193.167.100.100:443 IpTos(Cs0, NotEct) len 45
client  |   TX -> Ack { largest_acknowledged: 2, ack_delay: 1959783, first_ack_range: 2, ack_ranges: [], ecn_count: None }
client  |   TX -> ConnectionClose { error_code: Application(0), frame_type: 0, reason_phrase: "kthxbye!" }
client  | 108s774ms DEBUG Setting timeout of 78.852422597s
client  | 108s774ms DEBUG Setting timeout of 78.852411897s
sim     | Forwarding 45 bytes 193.167.0.100:42465 -> 193.167.100.100:443, dropped 51/156 (32.7%)
sim     | Forwarding 45 bytes 193.167.0.100:42465 -> 193.167.100.100:443, dropped 51/157 (32.5%)
server  | QLog: done with 9840ca8b42f950a3.qlog
server  | QLog: done with d58f290ebd9bcfa9d5.qlog
server  | QLog: done with 2cc1e081e1fe5325.qlog
client  | 187s628ms DEBUG [Client 2cc1e081e1fe5325] State change from Closing { error: Application(0), timeout: Instant { tv_sec: 667, tv_nsec: 656536104 } } -> Closed(Application(0))
client  | 187s628ms INFO Closing timer expired
client  | 187s628ms DEBUG Output::None
client  | 187s628ms WARN Unhandled event StateChange(Closed(Application(0)))
client  | 187s628ms DEBUG Timer fired while closed
client  | 187s628ms DEBUG Output::None

The text was updated successfully, but these errors were encountered:

larseggert · 2024-09-05T10:47:55Z

I wonder if the stack has always done this, but whether the old demo client would not wait until a previous connection is completely gone before starting the next one. Any idea, @mxinden?

larseggert · 2024-09-05T14:06:14Z

Similar issue here, PTO of ~17 seconds:

client  | 206s733ms DEBUG [Client 710a2ea0c5c12d7ba3d2c3c5e0ab45] output_path send_profile SendProfile { limit: 1252, pto: None, probe: (), paced: false }
client  | 206s733ms DEBUG Building Short dcid Some(CID [20]: 04afedde3e680e1abe8bf7ef01165b6e88757819)
client  | 206s733ms DEBUG TX blocked, profile=SendProfile { limit: 1252, pto: None, probe: (), paced: false }
client  | 206s733ms DEBUG [Client 710a2ea0c5c12d7ba3d2c3c5e0ab45] delay duration 17.26252894s
client  | 206s733ms DEBUG Setting timeout of 17.26252894s
sim     | Dropping 1352 bytes 193.167.0.100:38464 -> 193.167.100.100:443, dropped 238/762 (31.2%)
client  | 223s998ms DEBUG [LossRecovery] PTO timer fired for ap

mxinden · 2024-09-06T09:05:48Z

but whether the old demo client would not wait until a previous connection is completely gone before starting the next one.

I believe it did.

neqo/neqo-client/src/main.rs

Lines 391 to 393 in 83735a8

    
           if let Http3State::Closed(..) = client.state() { 
        
               return Ok(client.state()); 
        
           }

To make sure I understand the above correctly, it is not that we should start a new connection while another is still closing, but that Neqo should be more aggressive in considering a connection lost, instead of trying to close it for 90+ seconds, right?

Because under lossy conditions (e.g., QNS `handshakeloss` test), the guess can be multiple times the actual RTT, which when encoded in the resumption ticket will cause an extremely slow second handshake, often causing the test to time out. Broken out of mozilla#1998 Fixes mozilla#2088

@mxinden

* fix: Don't encode large RTT guesses in tickets Because under lossy conditions (e.g., QNS `handshakeloss` test), the guess can be multiple times the actual RTT, which when encoded in the resumption ticket will cause an extremely slow second handshake, often causing the test to time out. Broken out of #1998 Fixes #2088 * Fixes & tests * Suggestion from @mxinden * Fix

larseggert changed the title ~~Huuge timeout in L1 QNS test~~ Huge timeout in L1 QNS test Sep 5, 2024

larseggert linked a pull request Sep 13, 2024 that will close this issue

fix: No time-based loss detection before largest_acked & fixes #1998

Draft

larseggert mentioned this issue Sep 16, 2024

fix: Don't encode large RTT guesses in tickets #2114

Merged

larseggert closed this as completed in #2114 Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge timeout in `L1` QNS test #2088

Huge timeout in `L1` QNS test #2088

larseggert commented Sep 5, 2024

larseggert commented Sep 5, 2024

larseggert commented Sep 5, 2024

mxinden commented Sep 6, 2024

Huge timeout in L1 QNS test #2088

Huge timeout in L1 QNS test #2088

Comments

larseggert commented Sep 5, 2024

larseggert commented Sep 5, 2024

larseggert commented Sep 5, 2024

mxinden commented Sep 6, 2024

Huge timeout in `L1` QNS test #2088

Huge timeout in `L1` QNS test #2088