Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datachannel hangs with packet loss #1270

Closed
chrbsg opened this issue Jun 22, 2020 · 19 comments
Closed

datachannel hangs with packet loss #1270

chrbsg opened this issue Jun 22, 2020 · 19 comments
Assignees

Comments

@chrbsg
Copy link

chrbsg commented Jun 22, 2020

Your environment.

  • Version: 2.2.17
  • Browser: Linux Chrome 83 and iOS Safari 12
  • Other Information -

What did you do?

This is a pion program that receives microphone RTP audio from a web browser and sends back packets on a datachannel. (I tried to reproduce this with a simple example but it does not occur, something is different code attached below).
I'm trying to simulate a datachannel issue seen on poor wifi. On Linux, simulate data loss with:

    tc qdisc add dev lo root handle 1:0 netem loss 2.5%

(alternatively, same happens with iPad on wifi)

At some point,usually in first 20 seconds, web client stops calling data channel packet receive handler. Pion log shows:

sctp DEBUG: 18:06:07.296561 association.go:2113: [0xc00011e340] T3-rtx timed out: nRtos=1 cwnd=1228 ssthresh=4912
sctp DEBUG: 18:06:09.296765 association.go:2113: [0xc00011e340] T3-rtx timed out: nRtos=2 cwnd=1228 ssthresh=4912
sctp DEBUG: 18:06:13.297110 association.go:2113: [0xc00011e340] T3-rtx timed out: nRtos=3 cwnd=1228 ssthresh=4912
sctp DEBUG: 18:06:21.297200 association.go:2113: [0xc00011e340] T3-rtx timed out: nRtos=4 cwnd=1228 ssthresh=4912
(these messages repeat...)

There is no indication that the datachannel is hung - the ICE state is connected, the RTP microphone audio is still being transmitted by the browser and received by pion, and the datachannel state has not changed.

What did you expect?

Datachannel should be stable on poor wifi

What happened?

Datachannel hangs and never recovers

Full log PION_LOG_DEBUG=all

rx offer from client-1eddf0cb-6120-4976-af11-c2076cee14c2
pc INFO: 2020/06/22 18:06:03 signaling state changed to have-remote-offer
pc INFO: 2020/06/22 18:06:03 signaling state changed to stable
ice DEBUG: 18:06:03.615143 agent.go:595: Started agent: isControlling? false, remoteUfrag: "lv+g", remotePwd: "aQnIP1LuW/bbwIOg1Ng5WVKt"
ice INFO: 2020/06/22 18:06:03 Setting new connection state: Checking
pc INFO: 2020/06/22 18:06:03 ICE connection state changed: checking
ICE state changed: checking
ice INFO: 2020/06/22 18:06:03 Setting new connection state: Connected
pc INFO: 2020/06/22 18:06:03 ICE connection state changed: connected
ICE state changed: connected
pc INFO: 2020/06/22 18:06:03 peer connection state changed: connected
sctp DEBUG: 18:06:03.722797 association.go:778: [0xc00011e340] state change: 'Closed' => 'CookieWait'
sctp DEBUG: 18:06:03.722812 association.go:323: [0xc00011e340] sending INIT
sctp DEBUG: 18:06:03.722831 association.go:414: [0xc00011e340] readLoop entered
sctp DEBUG: 18:06:03.722851 association.go:438: [0xc00011e340] writeLoop entered
sctp DEBUG: 18:06:03.723045 association.go:813: [0xc00011e340] chunkInit received in state 'CookieWait'
sctp DEBUG: 18:06:03.723122 association.go:875: [0xc00011e340] chunkInitAck received in state 'CookieWait'
sctp DEBUG: 18:06:03.723136 association.go:897: [0xc00011e340] initial rwnd=131072
sctp DEBUG: 18:06:03.723146 association.go:349: [0xc00011e340] sending COOKIE-ECHO
sctp DEBUG: 18:06:03.723156 association.go:778: [0xc00011e340] state change: 'CookieWait' => 'CookieEchoed'
sctp DEBUG: 18:06:03.723350 association.go:965: [0xc00011e340] COOKIE-ECHO received in state 'CookieEchoed'
sctp DEBUG: 18:06:03.723360 association.go:778: [0xc00011e340] state change: 'CookieEchoed' => 'Established'
sctp DEBUG: 18:06:03.723373 association.go:1146: [0xc00011e340] accepted a new stream (streamIdentifier: 1)
sctp DEBUG: 18:06:03.723383 stream.go:118: [1:0xc00011e340] reassemblyQueue readable=true
sctp DEBUG: 18:06:03.723389 stream.go:120: [1:0xc00011e340] readNotifier.signal()
sctp DEBUG: 18:06:03.723394 stream.go:122: [1:0xc00011e340] readNotifier.signal() done
ortc WARNING: 2020/06/22 18:06:03 webrtc.DetachDataChannels() enabled but didn't Detach, call Detach from OnOpen
sctp DEBUG: 18:06:03.723564 association.go:995: [0xc00011e340] COOKIE-ACK received in state 'Established'
pc DEBUG: 18:06:03.748571 peerconnection.go:280: got new track: &{mu:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} id:ecc9ba16-9b43-4dce-8816-cb7c60a935b8 payloadType:111 kind:1 label:- ssrc:3263351828 codec:0xc000179a40 packetizer:<nil> receiver:0xc000136c00 activeSenders:[] totalSenderCount:0}
onTrack type=111 codec=opus 
sctp DEBUG: 18:06:03.923549 association.go:650: [0xc00011e340] sending SACK: SACK cumTsnAck=2488767936 arwnd=1048576 dupTsn=[]

sctp DEBUG: 18:06:07.296561 association.go:2113: [0xc00011e340] T3-rtx timed out: nRtos=1 cwnd=1228 ssthresh=4912
sctp DEBUG: 18:06:09.296765 association.go:2113: [0xc00011e340] T3-rtx timed out: nRtos=2 cwnd=1228 ssthresh=4912
sctp DEBUG: 18:06:13.297110 association.go:2113: [0xc00011e340] T3-rtx timed out: nRtos=3 cwnd=1228 ssthresh=4912
sctp DEBUG: 18:06:21.297200 association.go:2113: [0xc00011e340] T3-rtx timed out: nRtos=4 cwnd=1228 ssthresh=4912

Another log

ice INFO: 2020/06/22 16:50:57 Setting new connection state: Connected
pc INFO: 2020/06/22 16:50:57 ICE connection state changed: connected
ICE state changed: connected
pc INFO: 2020/06/22 16:50:57 peer connection state changed: connected
sctp DEBUG: 16:50:57.326479 association.go:778: [0xc0008ec1a0] state change: 'Closed' => 'CookieWait'
sctp DEBUG: 16:50:57.326563 association.go:323: [0xc0008ec1a0] sending INIT
sctp DEBUG: 16:50:57.326504 association.go:414: [0xc0008ec1a0] readLoop entered
sctp DEBUG: 16:50:57.326627 association.go:438: [0xc0008ec1a0] writeLoop entered
sctp DEBUG: 16:50:57.327773 association.go:875: [0xc0008ec1a0] chunkInitAck received in state 'CookieWait'
sctp DEBUG: 16:50:57.327814 association.go:897: [0xc0008ec1a0] initial rwnd=131072
sctp DEBUG: 16:50:57.327845 association.go:349: [0xc0008ec1a0] sending COOKIE-ECHO
sctp DEBUG: 16:50:57.327870 association.go:778: [0xc0008ec1a0] state change: 'CookieWait' => 'CookieEchoed'
sctp DEBUG: 16:50:57.328894 association.go:995: [0xc0008ec1a0] COOKIE-ACK received in state 'CookieEchoed'
sctp DEBUG: 16:50:57.328939 association.go:778: [0xc0008ec1a0] state change: 'CookieEchoed' => 'Established'
sctp DEBUG: 16:50:57.328979 association.go:1146: [0xc0008ec1a0] accepted a new stream (streamIdentifier: 1)
sctp DEBUG: 16:50:57.329014 stream.go:118: [1:0xc0008ec1a0] reassemblyQueue readable=true
sctp DEBUG: 16:50:57.329067 stream.go:120: [1:0xc0008ec1a0] readNotifier.signal()
sctp DEBUG: 16:50:57.329097 stream.go:122: [1:0xc0008ec1a0] readNotifier.signal() done
ortc WARNING: 2020/06/22 16:50:57 webrtc.DetachDataChannels() enabled but didn't Detach, call Detach from OnOpen
pc DEBUG: 16:50:57.342435 peerconnection.go:280: got new track: &{mu:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} id:435a165f-5faf-4c8d-a802-3be6d4e38e5c payloadType:111 kind:1 label:- ssrc:2047485769 codec:0xc000178930 packetizer:<nil> receiver:0xc0001366c0 activeSenders:[] totalSenderCount:0}
onTrack type=111 codec=opus 
sctp DEBUG: 16:50:57.529259 association.go:650: [0xc0008ec1a0] sending SACK: SACK cumTsnAck=2488767936 arwnd=1048576 dupTsn=[]
sctp DEBUG: 16:51:01.820344 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=1 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:51:03.820602 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=2 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:51:07.820821 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=3 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:51:15.821210 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=4 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:51:31.821456 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=5 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:52:03.821745 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=6 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:54:03.822332 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=8 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:54:03.822332 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=8 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:55:03.822567 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=9 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:56:03.822903 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=10 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:57:03.823265 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=11 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:58:03.823586 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=12 cwnd=1228 ssthresh=4912
sctp DEBUG: 16:59:03.823803 association.go:2113: [0xc0008ec1a0] T3-rtx timed out: nRtos=13 cwnd=1228 ssthresh=4912
@chrbsg
Copy link
Author

chrbsg commented Jun 22, 2020

I managed to reproduce this with a test case. It appears related to datachannel options used in browser:

pc.createDataChannel('foo', {ordered: false, maxRetransmits: 0})

Those options along with 20% packet loss help to reproduce.

Code attached: pionbug.zip for server and both go and javascript client.

Logs for go client and server follow.

server

+ export PION_LOG_DEBUG=all
+ PION_LOG_DEBUG=all
+ ./server
offer?
eyJ0eXBlIjoib2ZmZXIiLCJzZHAiOiJ2PTBcclxubz0tIDQwNTM1MDcyMiAxNTkyODQ5MDMxIElOIElQNCAwLjAuMC4wXHJcbnM9LVxyXG50PTAgMFxyXG5hPWZpbmdlcnByaW50OnNoYS0yNTYgM0E6ODE6QTY6Qjg6OTY6MzM6ODE6MUE6M0I6QUM6QUI6RUI6RTA6ODg6MUM6RTM6NjE6NzQ6RjA6RDY6REY6NTc6REI6RTM6Q0Q6MTQ6MjI6QTI6QzQ6NDg6NUM6NThcclxuYT1ncm91cDpCVU5ETEUgMFxyXG5tPWFwcGxpY2F0aW9uIDkgRFRMUy9TQ1RQIDUwMDBcclxuYz1JTiBJUDQgMC4wLjAuMFxyXG5hPXNldHVwOmFjdHBhc3NcclxuYT1taWQ6MFxyXG5hPXNlbmRyZWN2XHJcbmE9c2N0cG1hcDo1MDAwIHdlYnJ0Yy1kYXRhY2hhbm5lbCAxMDI0XHJcbmE9aWNlLXVmcmFnOlVOaHNZZXdwT2hJcVRFU3FcclxuYT1pY2UtcHdkOklGYUdiRlNDUEtUZkdQb2NFQXRub3lKaUNneVVLTnJFXHJcbmE9Y2FuZGlkYXRlOmZvdW5kYXRpb24gMSB1ZHAgMjEzMDcwNjQzMSAxOTIuMTY4LjEuMTYgMzU2OTYgdHlwIGhvc3QgZ2VuZXJhdGlvbiAwXHJcbmE9Y2FuZGlkYXRlOmZvdW5kYXRpb24gMiB1ZHAgMjEzMDcwNjQzMSAxOTIuMTY4LjEuMTYgMzU2OTYgdHlwIGhvc3QgZ2VuZXJhdGlvbiAwXHJcbmE9ZW5kLW9mLWNhbmRpZGF0ZXNcclxuIn0=
pc INFO: 2020/06/22 19:03:54 signaling state changed to have-remote-offer
ice DEBUG: 19:03:54.179547 agent.go:595: Started agent: isControlling? false, remoteUfrag: "UNhsYewpOhIqTESq", remotePwd: "IFaGbFSCPKTfGPocEAtnoyJiCgyUKNrE"
ice INFO: 2020/06/22 19:03:54 Setting new connection state: Checking
pc INFO: 2020/06/22 19:03:54 ICE connection state changed: checking
server: ICE state changed: checking
pc INFO: 2020/06/22 19:03:54 signaling state changed to stable
answer = eyJ0eXBlIjoiYW5zd2VyIiwic2RwIjoidj0wXHJcbm89LSAxMDY4NDM4MDM1IDE1OTI4NDkwMzQgSU4gSVA0IDAuMC4wLjBcclxucz0tXHJcbnQ9MCAwXHJcbmE9ZmluZ2VycHJpbnQ6c2hhLTI1NiAyNDo3NDoyQjpGOTpGQTozRDo1ODo0QjozMTo4QToxNzowNzo4RjowQTpCRTo5NDoxQzpGQjo5Mzo1QzpBMTo0MzowNzpGQjoyNjpFQjoyQzpBQjo1NDoyRTo1NzpBNlxyXG5hPWdyb3VwOkJVTkRMRSAwXHJcbm09YXBwbGljYXRpb24gOSBEVExTL1NDVFAgNTAwMFxyXG5jPUlOIElQNCAwLjAuMC4wXHJcbmE9c2V0dXA6YWN0aXZlXHJcbmE9bWlkOjBcclxuYT1zZW5kcmVjdlxyXG5hPXNjdHBtYXA6NTAwMCB3ZWJydGMtZGF0YWNoYW5uZWwgMTAyNFxyXG5hPWljZS11ZnJhZzpLSUNvc1JyU3BRWHlHT1VhXHJcbmE9aWNlLXB3ZDphQnBLdXZqYmlVcklNVmluelNSUFh6RHFXTVF5WVNCeFxyXG5hPWNhbmRpZGF0ZTpmb3VuZGF0aW9uIDEgdWRwIDIxMzA3MDY0MzEgMTkyLjE2OC4xLjE2IDU2NDcyIHR5cCBob3N0IGdlbmVyYXRpb24gMFxyXG5hPWNhbmRpZGF0ZTpmb3VuZGF0aW9uIDIgdWRwIDIxMzA3MDY0MzEgMTkyLjE2OC4xLjE2IDU2NDcyIHR5cCBob3N0IGdlbmVyYXRpb24gMFxyXG5hPWVuZC1vZi1jYW5kaWRhdGVzXHJcbiJ9
ice INFO: 2020/06/22 19:04:03 Setting new connection state: Connected
pc INFO: 2020/06/22 19:04:03 ICE connection state changed: connected
server: ICE state changed: connected
pc INFO: 2020/06/22 19:04:06 peer connection state changed: connected
sctp DEBUG: 19:04:06.407062 association.go:778: [0xc000310000] state change: 'Closed' => 'CookieWait'
sctp DEBUG: 19:04:06.407068 association.go:414: [0xc000310000] readLoop entered
sctp DEBUG: 19:04:06.407120 association.go:323: [0xc000310000] sending INIT
sctp DEBUG: 19:04:06.407331 association.go:438: [0xc000310000] writeLoop entered
sctp DEBUG: 19:04:06.408052 association.go:875: [0xc000310000] chunkInitAck received in state 'CookieWait'
sctp DEBUG: 19:04:06.408097 association.go:897: [0xc000310000] initial rwnd=1048576
sctp DEBUG: 19:04:06.408125 association.go:349: [0xc000310000] sending COOKIE-ECHO
sctp DEBUG: 19:04:06.408151 association.go:778: [0xc000310000] state change: 'CookieWait' => 'CookieEchoed'
sctp DEBUG: 19:04:06.408842 association.go:995: [0xc000310000] COOKIE-ACK received in state 'CookieEchoed'
sctp DEBUG: 19:04:06.408888 association.go:778: [0xc000310000] state change: 'CookieEchoed' => 'Established'
sctp DEBUG: 19:04:06.409080 association.go:1146: [0xc000310000] accepted a new stream (streamIdentifier: 1)
sctp DEBUG: 19:04:06.409157 stream.go:118: [1:0xc000310000] reassemblyQueue readable=true
sctp DEBUG: 19:04:06.409184 stream.go:120: [1:0xc000310000] readNotifier.signal()
sctp DEBUG: 19:04:06.409205 stream.go:122: [1:0xc000310000] readNotifier.signal() done
ondatachannel
tx 0
sctp DEBUG: 19:04:06.451350 association.go:2011: [0xc000310000] Error chunk, with following errors: (Unrecognized Chunk Type)
sctp DEBUG: 19:04:06.609588 association.go:650: [0xc000310000] sending SACK: SACK cumTsnAck=1617359952 arwnd=1048576 dupTsn=[]
tx 50
sctp DEBUG: 19:04:07.430961 association.go:2113: [0xc000310000] T3-rtx timed out: nRtos=1 cwnd=1228 ssthresh=4912
tx 100
tx 150
sctp DEBUG: 19:04:09.431121 association.go:2113: [0xc000310000] T3-rtx timed out: nRtos=2 cwnd=1228 ssthresh=4912
sctp DEBUG: 19:04:09.431440 association.go:2011: [0xc000310000] Error chunk, with following errors: (Unrecognized Chunk Type)
tx 200
tx 250
tx 300
tx 350
sctp DEBUG: 19:04:13.431384 association.go:2113: [0xc000310000] T3-rtx timed out: nRtos=3 cwnd=1228 ssthresh=4912
sctp DEBUG: 19:04:13.432126 association.go:2011: [0xc000310000] Error chunk, with following errors: (Unrecognized Chunk Type)
tx 400
tx 450
tx 500
tx 550
tx 600
tx 650
tx 700
tx 750
sctp DEBUG: 19:04:21.431644 association.go:2113: [0xc000310000] T3-rtx timed out: nRtos=4 cwnd=1228 ssthresh=4912
sctp DEBUG: 19:04:21.432415 association.go:2011: [0xc000310000] Error chunk, with following errors: (Unrecognized Chunk Type)
tx 800
tx 850
tx 900
tx 950
tx 1000
tx 1050
tx 1100
tx 1150
tx 1200
tx 1250
tx 1300
tx 1350
tx 1400
tx 1450
tx 1500
tx 1550
sctp DEBUG: 19:04:37.431856 association.go:2113: [0xc000310000] T3-rtx timed out: nRtos=5 cwnd=1228 ssthresh=4912
sctp DEBUG: 19:04:37.432551 association.go:2011: [0xc000310000] Error chunk, with following errors: (Unrecognized Chunk Type)
tx 1600
tx 1650
tx 1700
tx 1750
tx 1800
tx 1850
tx 1900

client

+ export PION_LOG_DEBUG=all
+ PION_LOG_DEBUG=all
+ ./client
pc INFO: 2020/06/22 19:03:51 signaling state changed to have-local-offer
offer = eyJ0eXBlIjoib2ZmZXIiLCJzZHAiOiJ2PTBcclxubz0tIDQwNTM1MDcyMiAxNTkyODQ5MDMxIElOIElQNCAwLjAuMC4wXHJcbnM9LVxyXG50PTAgMFxyXG5hPWZpbmdlcnByaW50OnNoYS0yNTYgM0E6ODE6QTY6Qjg6OTY6MzM6ODE6MUE6M0I6QUM6QUI6RUI6RTA6ODg6MUM6RTM6NjE6NzQ6RjA6RDY6REY6NTc6REI6RTM6Q0Q6MTQ6MjI6QTI6QzQ6NDg6NUM6NThcclxuYT1ncm91cDpCVU5ETEUgMFxyXG5tPWFwcGxpY2F0aW9uIDkgRFRMUy9TQ1RQIDUwMDBcclxuYz1JTiBJUDQgMC4wLjAuMFxyXG5hPXNldHVwOmFjdHBhc3NcclxuYT1taWQ6MFxyXG5hPXNlbmRyZWN2XHJcbmE9c2N0cG1hcDo1MDAwIHdlYnJ0Yy1kYXRhY2hhbm5lbCAxMDI0XHJcbmE9aWNlLXVmcmFnOlVOaHNZZXdwT2hJcVRFU3FcclxuYT1pY2UtcHdkOklGYUdiRlNDUEtUZkdQb2NFQXRub3lKaUNneVVLTnJFXHJcbmE9Y2FuZGlkYXRlOmZvdW5kYXRpb24gMSB1ZHAgMjEzMDcwNjQzMSAxOTIuMTY4LjEuMTYgMzU2OTYgdHlwIGhvc3QgZ2VuZXJhdGlvbiAwXHJcbmE9Y2FuZGlkYXRlOmZvdW5kYXRpb24gMiB1ZHAgMjEzMDcwNjQzMSAxOTIuMTY4LjEuMTYgMzU2OTYgdHlwIGhvc3QgZ2VuZXJhdGlvbiAwXHJcbmE9ZW5kLW9mLWNhbmRpZGF0ZXNcclxuIn0=
answer?
ice DEBUG: 19:03:54.179841 agent.go:1078: inbound isControlled && a.isControlling == false
ice DEBUG: 19:03:56.180178 agent.go:1078: inbound isControlled && a.isControlling == false
eyJ0eXBlIjoiYW5zd2VyIiwic2RwIjoidj0wXHJcbm89LSAxMDY4NDM4MDM1IDE1OTI4NDkwMzQgSU4gSVA0IDAuMC4wLjBcclxucz0tXHJcbnQ9MCAwXHJcbmE9ZmluZ2VycHJpbnQ6c2hhLTI1NiAyNDo3NDoyQjpGOTpGQTozRDo1ODo0QjozMTo4QToxNzowNzo4RjowQTpCRTo5NDoxQzpGQjo5Mzo1QzpBMTo0MzowNzpGQjoyNjpFQjoyQzpBQjo1NDoyRTo1NzpBNlxyXG5hPWdyb3VwOkJVTkRMRSAwXHJcbm09YXBwbGljYXRpb24gOSBEVExTL1NDVFAgNTAwMFxyXG5jPUlOIElQNCAwLjAuMC4wXHJcbmE9c2V0dXA6YWN0aXZlXHJcbmE9bWlkOjBcclxuYT1zZW5kcmVjdlxyXG5hPXNjdHBtYXA6NTAwMCB3ZWJydGMtZGF0YWNoYW5uZWwgMTAyNFxyXG5hPWljZS11ZnJhZzpLSUNvc1JyU3BRWHlHT1VhXHJcbmE9aWNlLXB3ZDphQnBLdXZqYmlVcklNVmluelNSUFh6RHFXTVF5WVNCeFxyXG5hPWNhbmRpZGF0ZTpmb3VuZGF0aW9uIDEgdWRwIDIxMzA3MDY0MzEgMTkyLjE2OC4xLjE2IDU2NDcyIHR5cCBob3N0IGdlbmVyYXRpb24gMFxyXG5hPWNhbmRpZGF0ZTpmb3VuZGF0aW9uIDIgdWRwIDIxMzA3MDY0MzEgMTkyLjE2OC4xLjE2IDU2NDcyIHR5cCBob3N0IGdlbmVyYXRpb24gMFxyXG5hPWVuZC1vZi1jYW5kaWRhdGVzXHJcbiJ9
pc INFO: 2020/06/22 19:03:59 signaling state changed to stable
ice DEBUG: 19:03:59.397637 agent.go:595: Started agent: isControlling? true, remoteUfrag: "KICosRrSpQXyGOUa", remotePwd: "aBpKuvjbiUrIMVinzSRPXzDqWMQyYSBx"
ice INFO: 2020/06/22 19:03:59 Setting new connection state: Checking
pc INFO: 2020/06/22 19:03:59 ICE connection state changed: checking
client: ICE state changed: checking
ice INFO: 2020/06/22 19:04:05 Setting new connection state: Connected
pc INFO: 2020/06/22 19:04:05 ICE connection state changed: connected
client: ICE state changed: connected
dtls DEBUG: 19:04:06.404560 conn.go:722: CipherSuite not initialized, queuing packet
dtls DEBUG: 19:04:06.404618 conn.go:644: received packet of next epoch, queuing packet
pc INFO: 2020/06/22 19:04:06 peer connection state changed: connected
sctp DEBUG: 19:04:06.406608 association.go:778: [0xc0002f0000] state change: 'Closed' => 'CookieWait'
sctp DEBUG: 19:04:06.406586 association.go:414: [0xc0002f0000] readLoop entered
sctp DEBUG: 19:04:06.406651 association.go:323: [0xc0002f0000] sending INIT
sctp DEBUG: 19:04:06.406665 association.go:438: [0xc0002f0000] writeLoop entered
sctp DEBUG: 19:04:06.407671 association.go:813: [0xc0002f0000] chunkInit received in state 'CookieWait'
sctp DEBUG: 19:04:06.408482 association.go:965: [0xc0002f0000] COOKIE-ECHO received in state 'CookieWait'
sctp DEBUG: 19:04:06.408536 association.go:778: [0xc0002f0000] state change: 'CookieWait' => 'Established'
dc.onopen
sctp DEBUG: 19:04:06.409831 stream.go:118: [1:0xc0002f0000] reassemblyQueue readable=true
sctp DEBUG: 19:04:06.409866 stream.go:120: [1:0xc0002f0000] readNotifier.signal()
sctp DEBUG: 19:04:06.409903 stream.go:122: [1:0xc0002f0000] readNotifier.signal() done
sctp DEBUG: 19:04:06.430316 association.go:650: [0xc0002f0000] sending SACK: SACK cumTsnAck=1671915135 arwnd=1047376 dupTsn=[]
sctp DEBUG: 19:04:06.450274 stream.go:118: [1:0xc0002f0000] reassemblyQueue readable=true
sctp DEBUG: 19:04:06.450315 stream.go:120: [1:0xc0002f0000] readNotifier.signal()
sctp DEBUG: 19:04:06.450337 stream.go:122: [1:0xc0002f0000] readNotifier.signal() done
rx 0
sctp DEBUG: 19:04:06.450429 association.go:650: [0xc0002f0000] sending SACK: SACK cumTsnAck=1671915135 arwnd=1047376 dupTsn=[]
 gap ack: 2 - 3
sctp DEBUG: 19:04:06.470609 association.go:650: [0xc0002f0000] sending SACK: SACK cumTsnAck=1671915135 arwnd=1046576 dupTsn=[]
 gap ack: 2 - 3
 gap ack: 5 - 5
sctp DEBUG: 19:04:06.490287 stream.go:118: [1:0xc0002f0000] reassemblyQueue readable=true
sctp DEBUG: 19:04:06.490343 stream.go:120: [1:0xc0002f0000] readNotifier.signal()
sctp DEBUG: 19:04:06.490360 stream.go:122: [1:0xc0002f0000] readNotifier.signal() done
rx 1
sctp DEBUG: 19:04:06.490427 association.go:650: [0xc0002f0000] sending SACK: SACK cumTsnAck=1671915135 arwnd=1046576 dupTsn=[]
 gap ack: 2 - 3
 gap ack: 5 - 7
sctp DEBUG: 19:04:06.510233 stream.go:118: [1:0xc0002f0000] reassemblyQueue readable=true
sctp DEBUG: 19:04:06.510271 stream.go:120: [1:0xc0002f0000] readNotifier.signal()
sctp DEBUG: 19:04:06.510288 stream.go:122: [1:0xc0002f0000] readNotifier.signal() done
rx 2
sctp DEBUG: 19:04:06.510405 association.go:650: [0xc0002f0000] sending SACK: SACK cumTsnAck=1671915135 arwnd=1046576 dupTsn=[]
 gap ack: 2 - 3
 gap ack: 5 - 9
(seems to hang here)

@enobufs enobufs self-assigned this Jun 22, 2020
@enobufs
Copy link
Member

enobufs commented Jun 22, 2020

Looks like partial-reliability negotiation failed
I confirmed partial reliability options from Chrome correctly reach Pion. (using v2.2.17). I haven't seen the hang yet. I see some hiccups due to retransmission of FWD-TSN for a moment, but it does not persist. Will keep trying different ways...

@enobufs
Copy link
Member

enobufs commented Jun 27, 2020

Using pion/webrtc (v2.2.17) on both ends with 20% packetloss (with tylertreat/comcast), during the association handshake, I noticed this trace message on one end (client side):

sctp TRACE: 15:46:22.901000 association.go:1533: [0xc000095a00] building fwdtsn: newCumulativeTSN=339618363 cumTSN=339618362 - (si=1 ssn=0)
sctp DEBUG: 15:46:23.103423 association.go:2014: [0xc000095a00] Error chunk, with following errors: (Unrecognized Chunk Type)

On the server end, I see:

sctp TRACE: 15:46:23.002748 association.go:1577: [0xc00029e000] FwdTSN: New Cumulative TSN: 339618363
 - si=1, ssn=0

These logs occurred after COOKIE-ACK was received on the client end and the client began DCEP handshake. The problem is the client sending FWD-TSN (as maxRetransmits is set to 0), before the DCEP handshake completes. The server-side sees unexpected FWD-TSN, it responds with an Error Chunk.

@enobufs
Copy link
Member

enobufs commented Jun 28, 2020

I fixed the above locally. Now DCEP handshake stably completes despite the large packet loss.

Now, I am seeing a stall. When that happens, I see this error from ICE:

ice TRACE: 01:46:50.806088 selection.go:208: checking keepalive
ice TRACE: 01:46:50.806143 agent.go:988: ping STUN from host 10.0.0.135:59705 to host 10.0.0.135:51605
ice TRACE: 01:46:50.806153 agent.go:1035: Discarded 1 binding requests because they expired
 :
ice TRACE: 01:47:00.802204 agent.go:655: Set selected candidate pair: <nil>
ice INFO: 2020/06/28 01:47:00 Setting new connection state: Disconnected
 :
ice TRACE: 01:47:02.579074 agent.go:1132: inbound STUN (Request) from 10.0.0.135:51605 to host 10.0.0.135:59705
ice TRACE: 01:47:02.579156 agent.go:988: ping STUN from host 10.0.0.135:59705 to host 10.0.0.135:51605
ice TRACE: 01:47:02.579164 agent.go:1035: Discarded 1 binding requests because they expired
 :
ice TRACE: 01:47:02.781949 selection.go:261: inbound STUN (SuccessResponse) from host 10.0.0.135:51605 to host 10.0.0.135:59705
ice TRACE: 01:47:02.781988 selection.go:271: Found valid candidate pair: prio 9151314440652587007 (local, prio 2130706431) host 10.0.0.135:59705 <-> host 10.0.0.135:51605 (remote, prio 2130706431)
ice TRACE: 01:47:02.801787 selection.go:213: check timeout reached and no valid candidate pair found, marking connection as failed
ice INFO: 2020/06/28 01:47:02 Setting new connection state: Failed

@Sean-Der I am using Comcast to cause packet loss, now 20% loss ratio. Do you think 20% loss is too much for ice to work correctly?

@enobufs
Copy link
Member

enobufs commented Jun 29, 2020

I have found another bug, this time in pion/sctp. During the association handshake, if INIT-ACK was not received due to a packet loss, the node that did not receive INIT-ACK wouldn't have a.useForwardTSN set to true. This could happen because WebRTC does always the simultaneous open, and it is legit that node in the COOKIE-WAIT state receives COOKIE-ECHO and transitions straight to ESTABLISH estate. The problem is, current pion/sctp does not evaluate the extension parameters ("forward tsn supported" parameter) in the INIT chunk and always relying on the parameter on INIT-ACK only. When INIT-ACK is lost and it transitions to ESTABLISH state without receiving retransmitted INIT-ACK, the it won't be able to enable Forward TSN mode.

By fixing this in my local environment, I no longer the situation.

I will create two PRs, one for pion/datachannel and the other for pion/sctp shortly.

@enobufs
Copy link
Member

enobufs commented Jun 29, 2020

@tuexen Sorry to bother you but I could not find an official document that describes expected simultaneous-open in SCTP. The state transition chart below was derived from a compilation of what I saw in my search a while ago, but I have not come across the official spec and I wonder if you could refer me to an official document you know of?

Screen Shot 2020-06-28 at 6 24 46 PM

@enobufs
Copy link
Member

enobufs commented Jun 29, 2020

@tuexen never mind. I realized the text I was looking for was actually in RFC 4960 Sec 5.2. (I thought "simultaneous open" was an extension.

enobufs added a commit to pion/sctp that referenced this issue Jun 29, 2020
enobufs added a commit to pion/sctp that referenced this issue Jun 29, 2020
enobufs added a commit to pion/datachannel that referenced this issue Jun 29, 2020
enobufs added a commit to pion/sctp that referenced this issue Jun 29, 2020
enobufs added a commit to pion/datachannel that referenced this issue Jun 29, 2020
enobufs added a commit to pion/sctp that referenced this issue Jun 29, 2020
enobufs added a commit to pion/datachannel that referenced this issue Jun 29, 2020
@tuexen
Copy link

tuexen commented Jun 29, 2020

@tuexen never mind. I realized the text I was looking for was actually in RFC 4960 Sec 5.2. (I thought "simultaneous open" was an extension.

Glad you found it.

Sean-Der pushed a commit to pion/datachannel that referenced this issue Jun 29, 2020
Sean-Der pushed a commit to pion/sctp that referenced this issue Jun 29, 2020
@enobufs
Copy link
Member

enobufs commented Jul 5, 2020

@chrbsg told me that with the two fixes (pion/datachannel#64 and pion/sctp#127) made the situation much better, but he was still seeing the hang.

As I was digging more, I found another bug:

sctp TRACE: 15:39:04.534699 association.go:282: [0xc000314000] updated cwnd=4380 ssthresh=0 inflight=0 (INI)
sctp DEBUG: 15:39:04.534729 association.go:780: [0xc000314000] state change: 'Closed' => 'CookieWait'
sctp DEBUG: 15:39:04.534740 association.go:323: [0xc000314000] sending INIT
sctp DEBUG: 15:39:04.534740 association.go:414: [0xc000314000] readLoop entered
sctp DEBUG: 15:39:04.534768 association.go:440: [0xc000314000] writeLoop entered
sctp TRACE: 15:39:04.636587 association.go:282: [0xc0003141a0] updated cwnd=4380 ssthresh=0 inflight=0 (INI)
sctp DEBUG: 15:39:04.636609 association.go:780: [0xc0003141a0] state change: 'Closed' => 'CookieWait'
sctp DEBUG: 15:39:04.636620 association.go:323: [0xc0003141a0] sending INIT
sctp DEBUG: 15:39:04.636679 association.go:440: [0xc0003141a0] writeLoop entered
sctp DEBUG: 15:39:04.636677 association.go:414: [0xc0003141a0] readLoop entered
sctp DEBUG: 15:39:04.637400 association.go:815: [0xc0003141a0] chunkInit received in state 'CookieWait'
sctp DEBUG: 15:39:04.637419 association.go:848: [0xc0003141a0] use ForwardTSN (on init)
sctp DEBUG: 15:39:04.738093 association.go:815: [0xc000314000] chunkInit received in state 'CookieWait'
sctp DEBUG: 15:39:04.738109 association.go:848: [0xc000314000] use ForwardTSN (on init)
sctp DEBUG: 15:39:04.739320 association.go:887: [0xc000314000] chunkInitAck received in state 'CookieWait'
sctp DEBUG: 15:39:04.739343 association.go:909: [0xc000314000] initial rwnd=1048576
sctp TRACE: 15:39:04.739349 association.go:916: [0xc000314000] updated cwnd=4380 ssthresh=1048576 inflight=0 (INI)
sctp DEBUG: 15:39:04.739360 association.go:930: [0xc000314000] use ForwardTSN (on initAck)
sctp DEBUG: 15:39:04.739364 association.go:349: [0xc000314000] sending COOKIE-ECHO
sctp DEBUG: 15:39:04.739371 association.go:780: [0xc000314000] state change: 'CookieWait' => 'CookieEchoed'
sctp DEBUG: 15:39:04.840245 association.go:887: [0xc0003141a0] chunkInitAck received in state 'CookieWait'
sctp DEBUG: 15:39:04.840262 association.go:909: [0xc0003141a0] initial rwnd=1048576
sctp TRACE: 15:39:04.840268 association.go:916: [0xc0003141a0] updated cwnd=4380 ssthresh=1048576 inflight=0 (INI)
sctp DEBUG: 15:39:04.840276 association.go:930: [0xc0003141a0] use ForwardTSN (on initAck)
sctp DEBUG: 15:39:04.840281 association.go:349: [0xc0003141a0] sending COOKIE-ECHO
sctp DEBUG: 15:39:04.840288 association.go:780: [0xc0003141a0] state change: 'CookieWait' => 'CookieEchoed'
sctp DEBUG: 15:39:07.739404 association.go:349: [0xc000314000] sending COOKIE-ECHO
sctp DEBUG: 15:39:07.840123 association.go:981: [0xc0003141a0] COOKIE-ECHO received in state 'CookieEchoed'
sctp DEBUG: 15:39:07.840144 association.go:780: [0xc0003141a0] state change: 'CookieEchoed' => 'Established'
sctp DEBUG: 15:39:13.740308 association.go:349: [0xc000314000] sending COOKIE-ECHO
sctp DEBUG: 15:39:13.841671 association.go:981: [0xc0003141a0] COOKIE-ECHO received in state 'Established'
sctp DEBUG: 15:39:25.740444 association.go:349: [0xc000314000] sending COOKIE-ECHO
sctp DEBUG: 15:39:49.744651 association.go:349: [0xc000314000] sending COOKIE-ECHO
sctp DEBUG: 15:39:49.845593 association.go:981: [0xc0003141a0] COOKIE-ECHO received in state 'Established'
sctp DEBUG: 15:40:37.743661 association.go:349: [0xc000314000] sending COOKIE-ECHO
sctp DEBUG: 15:41:37.746298 association.go:349: [0xc000314000] sending COOKIE-ECHO
sctp DEBUG: 15:41:37.848089 association.go:981: [0xc0003141a0] COOKIE-ECHO received in state 'Established'
sctp DEBUG: 15:42:37.748492 association.go:349: [0xc000314000] sending COOKIE-ECHO
sctp DEBUG: 15:42:37.850424 association.go:981: [0xc0003141a0] COOKIE-ECHO received in state 'Established'
sctp DEBUG: 15:43:37.749497 association.go:349: [0xc000314000] sending COOKIE-ECHO
sctp DEBUG: 15:43:37.749819 association.go:981: [0xc0003141a0] COOKIE-ECHO received in state 'Established'
sctp ERROR: 2020/07/05 15:44:37 [0xc000314000] retransmission failure: T1-cookie

The peer (association ptr being 0xc000314000) keeps retransmitting COOKIE-ECHO, then finally T1-cookie timed out. The problem is the other end (association 0xc0003141a0) was receiving the COOKIE-ECHO but it never sends back COOKIE-ACK once it entered into Established state. (Bug)

enobufs added a commit to pion/sctp that referenced this issue Jul 6, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 6, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 6, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
@tuexen
Copy link

tuexen commented Jul 6, 2020

Can you provide a .pcapng file and the debug output of usrsctp assuming that it is your peer?

@enobufs
Copy link
Member

enobufs commented Jul 6, 2020

Thanks @tuexen - but this is happening between pion nodes. :) (haven't thoroughly tested with browsers yet but will do)

enobufs added a commit to pion/sctp that referenced this issue Jul 6, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 6, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 6, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
@tuexen
Copy link

tuexen commented Jul 6, 2020

Thanks @tuexen - but this is happening between pion nodes. :) (haven't thoroughly tested with browsers yet but will do)

Sorry, I wasn't realising that this is a bug report against pion, I assumed that it is a bug report against usrsctp. Sorry for the noise...

enobufs added a commit to pion/sctp that referenced this issue Jul 13, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 13, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 13, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
This is related to #62 in which we removed the use of immediate ack (I)
bit. This test became unstable as we no longer use immediate ack.
Relates to pion/webrtc#1270
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
enobufs added a commit to pion/sctp that referenced this issue Jul 15, 2020
@enobufs
Copy link
Member

enobufs commented Jul 16, 2020

@Sean-Der @at-wat With the current pion/sctp and pion/datachannel master heads, data channel transmission over lossy connection is very stable to me and I no longer see any hang even with 20% of (extremely bad) packet loss. So far I have done the following combinations:

1:1 connection

  • pion server vs desktop Chrome client
  • pion server vs desktop Firefox client
  • pion server vs android Chrome client

1:2 connections

  • pion server vs desktop Chrome and desktop Firefox clients
  • pion server vs desktop Chrome and desktop Safari clients
  • pion server vs desktop Chrome and android Chrome clients

As to reliability options I have done both of these settings:

  • { ordered: true, maxRetransmits: null, maxPacketLifeTime: null }
  • { ordered: fase, maxRetransmits: 0, maxPacketLifeTime: null } // the reported case

With the fixes (pion/sctp#127, pion/datachannel#64, pion/sctp#130) which are all landed on master, data channel got very robust and stable. Please let me tag the latest pion/sctp and pion/datachannel.

@enobufs
Copy link
Member

enobufs commented Jul 16, 2020

Tagged:

@Sean-Der
Copy link
Member

That is amazing @enobufs! Sorry I haven't been more involved so much (great!) stuff going on with Pion I am losing track :(

@at-wat
Copy link
Member

at-wat commented Jul 16, 2020

@enobufs pion/[email protected] still refers pion/[email protected].
I'm going to merge pion/datachannel#68 and tag pion/[email protected] then old reference to pion/[email protected] will be gone.

@enobufs
Copy link
Member

enobufs commented Jul 16, 2020

Oops... I forgot about pion/datachannel still depends on pion/sctp for testing... Thanks @at-wat so much for your help!

I guess we could remove datachannel's dependency on pion/sctp by using interface... well, someday!

@enobufs
Copy link
Member

enobufs commented Jul 16, 2020

@at-wat I updated sctp and datachannel versions to the latest on v2 branch. FYI.

@enobufs
Copy link
Member

enobufs commented Jul 16, 2020

Now all the fixes on this issue have been landed on pion/webrtc/[email protected]. (many thanks to @at-wat)
I have chatted with @chrbsg also. Let me close this issue.

@enobufs enobufs closed this as completed Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants