Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks #107

Closed
codomania opened this issue Mar 3, 2022 · 9 comments · Fixed by #152
Closed

Memory leaks #107

codomania opened this issue Mar 3, 2022 · 9 comments · Fixed by #152

Comments

@codomania
Copy link

codomania commented Mar 3, 2022

Hi,

I noticed a serious memory leak while using the gvisor-tap-vsock latest master. My host ran out of memory after 2 min of iperf runs. I used the pprof to root cause the leak source and found two issues:

  1. UDP forwarder does not close the go routines after the connection is closed. e.g run a iperf in a loop (with udp) and you will see the number of go routine kept growing.

Steps to reproduce
On Host:
iperf3-s
On VM:
while true; do iperf3 -c 192.168.127.254 -t 1 -u; done

Open the pprof window and see the number of go routines. It keeps growing.

  1. Leaking the memory used for packet slice.

Steps to reproduce
On Host:
iperf3-s
On VM: iperf3 -c 192.168.127.254 -t 180

On another host terminal run "htop" and monitor the memory usage.

The output from pprof indicate that leak is coming from the
https://github.com/containers/gvisor-tap-vsock/blob/main/pkg/tap/switch.go#L264.

Stat from profile:

File: gvproxy
Type: inuse_space
Time: Mar 2, 2022 at 8:44pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 459.35MB, 99.65% of 460.94MB total
Dropped 15 nodes (cum <= 2.30MB)
Showing top 10 nodes out of 38
flat flat% sum% cum cum%
299.94MB 65.07% 65.07% 409.03MB 88.74% github.com/containers/gvisor-tap-vsock/pkg/tap.(*Switch).rxStream
152.13MB 33.00% 98.07% 152.13MB 33.00% gvisor.dev/gvisor/pkg/tcpip/stack.glob..func1
4.78MB 1.04% 99.11% 4.78MB 1.04% gvisor.dev/gvisor/pkg/refsvfs2.Register
2.50MB 0.54% 99.65% 2.50MB 0.54% runtime.allocm
0 0% 99.65% 409.03MB 88.74% github.com/containers/gvisor-tap-vsock/pkg/tap.(*Switch).Accept
0 0% 99.65% 409.03MB 88.74% github.com/containers/gvisor-tap-vsock/pkg/tap.(*Switch).rx
0 0% 99.65% 109.09MB 23.67% github.com/containers/gvisor-tap-vsock/pkg/tap.(*Switch).rxBuf
0 0% 99.65% 409.03MB 88.74% github.com/containers/gvisor-tap-vsock/pkg/virtualnetwork.(*VirtualNetwork).Mux.func4
0 0% 99.65% 5MB 1.09% github.com/google/tcpproxy.proxyCopy
0 0% 99.65% 5MB 1.09% gvisor.dev/gvisor/pkg/tcpip/adapters/gonet.(*TCPConn).Read

Any idea how to fix the leak. I have not looked at the gvisor in deep to figure out yet, any pointers are appreciated.

@guillaumerose
Copy link
Contributor

  1. The UDP forwarder uses connection tracking. It keeps the forwarding opened for 90s. If you run your loop, wait for 90s, you should see go routines count come back to normal (this is what I see on my computer at least).

  2. Yes there is something here. Maybe we can try to use pool of buffer like they do in stack.NewPacketBuffer

@codomania
Copy link
Author

codomania commented Mar 4, 2022

  1. Ah, I didn't realize the timeout, You are right that after 90s routine disappears
  2. I have tried this patch which basically decrement the count after queuing the packet. I see that in gvisor code after allocating the new packet they decrement the ref, so I tried the same. With this I am not longer seeing leak. Does this make sense ? I can submit a patch.
diff --git a/pkg/tap/switch.go b/pkg/tap/switch.go
index 39821ac..dbcd040 100644
--- a/pkg/tap/switch.go
+++ b/pkg/tap/switch.go
@@ -294,14 +294,16 @@ func (e *Switch) rxBuf(ctx context.Context, id int, buf []byte) {
        }
        if eth.DestinationAddress() == e.gateway.LinkAddress() || eth.DestinationAddress() == header.EthernetBroadcastAddress {
                vv.TrimFront(header.EthernetMinimumSize)
+        pkt := stack.NewPacketBuffer(stack.PacketBufferOptions{
+                Data: vv,
+                })
                e.gateway.DeliverNetworkPacket(
                        eth.SourceAddress(),
                        eth.DestinationAddress(),
                        eth.Type(),
-                       stack.NewPacketBuffer(stack.PacketBufferOptions{
-                               Data: vv,
-                       }),
+            pkt,
                )
+        pkt.DecRef()
        }
 
        atomic.AddUint64(&e.Received, uint64(len(buf)))

@dimalinux
Copy link

The UDP forwarder uses connection tracking. It keeps the forwarding opened for 90s. If you run your loop, wait for 90s, you should see go routines count come back to normal (this is what I see on my computer at least).

Is the connection tracker protocol aware? It is common, for example, that UDP syslog messages are sent at a constant high velocity. If the code is waiting for responses (that don't happen with syslog), it will create new go routines at a faster rate than the one old ones can time out.

@guillaumerose
Copy link
Contributor

@codomania it looks good, thanks!

@dimalinux I took the same code as dockerd, I bet it is ok. You can take a look at the code: https://github.com/containers/gvisor-tap-vsock/blob/main/pkg/services/forwarder/udp_proxy.go (taken from https://github.com/moby/moby/blob/master/cmd/docker-proxy/udp_proxy.go)

@MarcBridner
Copy link

This fix appears to be working and takes the VM from several gigabytes of memory to less than 20mb. Is there anything that prevents this fix from being applied so that people don't have to patch the source manually?

stevenmhood pushed a commit to stevenmhood/gvisor-tap-vsock that referenced this issue Oct 24, 2022
Decrement reference count after queueing the PacketBuffer.  Since
PacketBuffer objects come from a pool, this allows them to be returned
to the pool instead of staying in memory and forcing new allocations.

Fixes containers#107
stevenmhood pushed a commit to stevenmhood/gvisor-tap-vsock that referenced this issue Oct 25, 2022
Decrement reference count after queueing the PacketBuffer.  Since
PacketBuffer objects come from a pool, this allows them to be returned
to the pool instead of staying in memory and forcing new allocations.

Fixes containers#107

Signed-off-by: stevenmhood <[email protected]>
@sc68cal
Copy link

sc68cal commented Oct 25, 2022

I just had to kill off gvproxy, it was consuming 30GB of ram on my M1 mac

cfergeau pushed a commit that referenced this issue Nov 3, 2022
Decrement reference count after queueing the PacketBuffer.  Since
PacketBuffer objects come from a pool, this allows them to be returned
to the pool instead of staying in memory and forcing new allocations.

Fixes #107

Signed-off-by: stevenmhood <[email protected]>
urish added a commit to wokwi/wokwigw that referenced this issue Nov 4, 2022
@gam-phon
Copy link

@sc68cal me too it consumes 38GB of RAM on my M1 ultra on real memory and memory around 50GB

@gam-phon
Copy link

I think the memory leak still exists

@cfergeau
Copy link
Collaborator

@gam-phon can you open a new issue with more details about your usage of gvisor-tap-vsock?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants