Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File descriptor (socket) leak #126

Closed
Tracked by #1
qbx2 opened this issue Oct 26, 2021 · 14 comments
Closed
Tracked by #1

File descriptor (socket) leak #126

qbx2 opened this issue Oct 26, 2021 · 14 comments

Comments

@qbx2
Copy link
Contributor

qbx2 commented Oct 26, 2021

Sockets opened in ICEGatherer seem to be leaked. They will never be closed even when the PeerConnections are closed. Finally, my server becomes unavailable with too many open files error.

I've investigated into webrtc's source code, and found that CandidateBase.close() is an no-op. Of course, it is because tokio's UdpSocket does not provide close(). Despite all, the sockets should be closed when it is dropped. Therefore, I guess that the socket is not dropped.

RTCPeerConnection holds Arc<PeerConnectionInternal> and there are other holders too. In v0.2.1, the peerconnection has no problem, dropped well. However, other holders seems not to drop it. I have no idea where the others are, but internal_rtcp_writer (peer_connection.rs, line 170) or pci (peer_connection.rs, line 1163, 1345, 1393) may have created reference cycle.

If possible, those references should be replaced by std::sync::Weak to break the cycle. Pion or other webrtc libraries written in garbage-collected languages may not suffer from this issue because the GC can detect and free those circular references well. Because Rust does not have one, we should use weak references to avoid this problem. It will also fix other memory leaks too.

@rainliu
Copy link
Member

rainliu commented Oct 26, 2021

Thanks for reporting this issue.

Based on your information, I just used Valgrind on data-channels-flow-control example, and it detected at least two Leaks, including:
webrtc_sctp::association::Association::new
webrtc::api::API::new_ice_gatherer

I will fix these leaks ASAP

@wdv4758h
Copy link
Contributor

I was thinking of checking Arc usage too. In my another probject (using GStreamer for WebRTC), Arc downgrade/upgrade solves the cycle problem.

@rainliu
Copy link
Member

rainliu commented Oct 31, 2021

I think I have made some progress on fixing reference cycle caused memory leak.

reflect and data-channels-flow-control in examples v0.3.0 branch should be memory leak free now.

I have checked by using nightly rust version with following build
RUSTFLAGS="-Z sanitizer=leak" cargo build --example reflect

It reports no memory leak.

I will keep checking and fixing other examples to make sure there is no reference cycle caused memory leak

@rainliu rainliu mentioned this issue Oct 31, 2021
41 tasks
@qbx2
Copy link
Contributor Author

qbx2 commented Oct 31, 2021

Unfortunately, I don't think the file descriptor leak is fixed yet (on master branch). I am going to look further into it on Monday

@rainliu
Copy link
Member

rainliu commented Oct 31, 2021

Unfortunately, I don't think the file descriptor leak is fixed yet (on master branch). I am going to look further into it on Monday

All examples in v0.3.0 branch of examples repo passed rc-cycle check. Please note that it requires some changes on callback function:

for example, here we have to use Weak pointer of peer_connection in its on_track callback, otherwise, it will cause reference cycle leak.

let pc = Arc::downgrade(&peer_connection);
peer_connection
        .on_track(Box::new(
            move |track: Option<Arc<TrackRemote>>, _receiver: Option<Arc<RTCRtpReceiver>>| {

Could you check your code whether have such callback?

@qbx2
Copy link
Contributor Author

qbx2 commented Nov 1, 2021

RTCPeerConnection is dropped very well, and the file descriptor closes well on normal cases. It seems to be leaked when it fails to connect.

Steps to reproduce:

  1. I added the following code to CandidateBase.close() (line 255)
        if let Some(conn) = &self.conn {
            let _ = conn.close().await;

            use std::sync::Weak;
            use std::time::Duration;
            let conn_weak = Arc::downgrade(conn);
            tokio::spawn(async move {
                let mut int = tokio::time::interval(Duration::from_secs(1));
                loop {
                    int.tick().await;
                    dbg!(Weak::weak_count(&conn_weak));
                    if dbg!(Weak::strong_count(&conn_weak)) == 0 {
                        break;
                    }
                }
            });
        }
  1. Signal(offer) from other peer and forcefully kill the offer-side process
  2. My program spams this log for a while
    [2021-11-01T04:17:16Z TRACE webrtc_ice::agent::agent_internal] [controlled]: pinging all candidates
  3. ICE connection state changes to failed
  4. conn_weak still has strong_count of 1
[/Users/qbx2/ice/src/candidate/candidate_base.rs:263] Weak::weak_count(&conn_weak) = 1
[/Users/qbx2/ice/src/candidate/candidate_base.rs:264] Weak::strong_count(&conn_weak) = 1

I reproduced using ~50 clients.

@rainliu
Copy link
Member

rainliu commented Nov 13, 2021

RTCPeerConnection is dropped very well, and the file descriptor closes well on normal cases. It seems to be leaked when it fails to connect.

Steps to reproduce:

  1. I added the following code to CandidateBase.close() (line 255)
        if let Some(conn) = &self.conn {
            let _ = conn.close().await;

            use std::sync::Weak;
            use std::time::Duration;
            let conn_weak = Arc::downgrade(conn);
            tokio::spawn(async move {
                let mut int = tokio::time::interval(Duration::from_secs(1));
                loop {
                    int.tick().await;
                    dbg!(Weak::weak_count(&conn_weak));
                    if dbg!(Weak::strong_count(&conn_weak)) == 0 {
                        break;
                    }
                }
            });
        }
  1. Signal(offer) from other peer and forcefully kill the offer-side process
  2. My program spams this log for a while
    [2021-11-01T04:17:16Z TRACE webrtc_ice::agent::agent_internal] [controlled]: pinging all candidates
  3. ICE connection state changes to failed
  4. conn_weak still has strong_count of 1
[/Users/qbx2/ice/src/candidate/candidate_base.rs:263] Weak::weak_count(&conn_weak) = 1
[/Users/qbx2/ice/src/candidate/candidate_base.rs:264] Weak::strong_count(&conn_weak) = 1

I reproduced using ~50 clients.

@qbx2, I tried to reproduce this issue in ice crate with ping_pong example. From new ping_pong example, ice crate looks like not root cause for such leak.

Therefore, I think the leak may come from webrtc crate itself, so I fixed one possible issue in 0944adf, where endpoint forgets to call underlying conn.close().

Could you try again?

In addition, in order to reproduce this issue easily, do you think any example in https://github.com/webrtc-rs/examples that can be hacked to reproduce this issue? With the same codebase, it is more easy to debug.

Thank you.

@qbx2
Copy link
Contributor Author

qbx2 commented Nov 14, 2021

I tried webrtc on master branch, but had no luck. :(
Also, I am sorry about the codebase. I have a problem that my code is related to company stuff and that's why I could not share my code details. Therefore, I am going to make an example for reproduction to share with you. (I guess it is difficult to reproduce with the examples because it has different signaling protocol?)

@qbx2
Copy link
Contributor Author

qbx2 commented Nov 14, 2021

@rainliu Please, check this out: qbx2/webrtc-rs-examples@5f5c3a5

@rainliu
Copy link
Member

rainliu commented Nov 14, 2021

@rainliu Please, check this out: qbx2/examples@5f5c3a5

@qbx2, thanks for this example, I can reproduce this issue.

@rainliu
Copy link
Member

rainliu commented Nov 15, 2021

@qbx2, I think I found the root cause of the bug. e7d72d2 should fix this issue.

The root cause is that cancel_tx signal should be set to internal.cancel_tx before call agent.dial or agent.accept, otherwise, calling agent.close() will have no-op on cancellation of dial/accept.

            let (cancel_tx, cancel_rx) = mpsc::channel(1);
            {
                let mut internal = self.internal.lock().await;
                internal.role = role;
                internal.cancel_tx = Some(cancel_tx);
            }

            let conn: Arc<dyn Conn + Send + Sync> = match role {
                RTCIceRole::Controlling => {
                    agent
                        .dial(
                            cancel_rx,
                            params.username_fragment.clone(),
                            params.password.clone(),
                        )
                        .await?
                }
                RTCIceRole::Controlled => {
                    agent
                        .accept(
                            cancel_rx,
                            params.username_fragment.clone(),
                            params.password.clone(),
                        )
                        .await?
                }
                _ => return Err(Error::ErrICERoleUnknown),

in your example, you still need to send done signal to call peer_connection.close().await, otherwise, it won't call self.internal.ice_transport.stop().await, which is used to cancel agent.dial/accept.

peer_connection
        .on_peer_connection_state_change(Box::new(move |s: RTCPeerConnectionState| {
            println!("Peer Connection State has changed: {}", s);
            if s == RTCPeerConnectionState::Failed {
                ...
                println!("Peer Connection has gone to failed exiting");
                // let _ = done_tx.try_send(());
            }
...

@rainliu
Copy link
Member

rainliu commented Nov 15, 2021

By the way, in your 50 clients call case, does each client have one peer_connection? If yes, once the following callback has indicated Failed, you may need to close the peer_connection for such client.

peer_connection
.on_ice_connection_state_change
or
peer_connection
.on_peer_connection_state_change

@qbx2
Copy link
Contributor Author

qbx2 commented Nov 15, 2021

in your example, you still need to send done signal to call peer_connection.close().await, otherwise, it won't call self.internal.ice_transport.stop().await, which is used to cancel agent.dial/accept.

You're right. The reason that I commented that line out was to keep the process alive so that the leak could be confirmed. Thanks a lot for the advice, I am caring of closing peer_connection.

By the way, in your 50 clients call case, does each client have one peer_connection? If yes, once the following callback has indicated Failed, you may need to close the peer_connection for such client.

Yes, they have. However, peer_connections are designed to be dropped when it failed to connect (I use Weak). It has no problem, right?
Also, I've confirmed that e7d72d2 fixes this issue. Thank you so much.

@rainliu
Copy link
Member

rainliu commented Nov 15, 2021

thanks for confirm. I just released v0.3.1 to include this fix.

@rainliu rainliu closed this as completed Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants