Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[internal/k8stest] Leaking goroutine when creating and deleting objects #31047

Closed
crobert-1 opened this issue Feb 5, 2024 · 9 comments
Closed

Comments

@crobert-1
Copy link
Member

Component(s)

internal/k8stest

Describe the issue you're reporting

Bug
I've attempted to add goleak checks to multiple k8s components, but so far all are failing on tests that use internal/k8stest functionality of CreateObject and DeleteObject.

Relevant PRs:
#30842
#30898

Investigation

  1. I've included the stack trace of leaks in this comment, I won't paste as it's a lot of text. As shown, all leaking goroutines were started in the k8stest.CreateObject method.
  2. I've attempted to move all deletions to the propagation policy Foreground.
  3. I've attempted to add a GracePeriod of 0 seconds to ensure deletions happen instantly.
  4. I've attempted to confirm that all resources are no longer found.

None of my attempts have resulted in resolving the leaking goroutines.

@crobert-1 crobert-1 added bug Something isn't working needs triage New item requiring triage internal/k8stest labels Feb 5, 2024
Copy link
Contributor

github-actions bot commented Feb 5, 2024

Pinging code owners for internal/k8stest: @crobert-1. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member Author

@dmitryax @jinja2: I see you both have also worked with these tests, so adding you for awareness and to see if you have any insights.

@crobert-1
Copy link
Member Author

More investigation:

  1. Use a single context in all k8s test functions, add cancel call in shutdown.
  2. Attempted to order deletion according to helm's UninstallOrder.

Neither of these worked. I believe I've found the issue though, I believe this is frequency of kubernetes/kubernetes#109289. Implementing the workaround here solves the leak.

@crobert-1
Copy link
Member Author

crobert-1 commented Mar 26, 2024

Looks like a slightly different leak is occurring now, but only in Github. I was seeing the original goleak locally, but the current leak is only hit in GitHub CI.

goleak: Errors on successful test run: found unexpected goroutines:
[Goroutine 12 in state IO wait, 1 minutes, with internal/poll.runtime_pollWait on top of the stack:
internal/poll.runtime_pollWait(0x7f9d292b0c68, 0x72)
	/opt/hostedtoolcache/go/1.21.8/x64/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc0005e4880?, 0xc000575000?, 0x0)
	/opt/hostedtoolcache/go/1.21.8/x64/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/opt/hostedtoolcache/go/1.21.8/x64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0005e4880, {0xc000575000, 0x1000, 0x1000})
	/opt/hostedtoolcache/go/1.21.8/x64/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc0005e4880, {0xc000575000?, 0x79d7e0?, 0xc0002fe640?})
	/opt/hostedtoolcache/go/1.21.8/x64/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc00007a3c8, {0xc000575000?, 0x0?, 0xc000417860?})
	/opt/hostedtoolcache/go/1.21.8/x64/src/net/net.go:179 +0x45
net/http.(*persistConn).Read(0xc0005e3560, {0xc000575000?, 0xc00045dd40?, 0xc000092d38?})
	/opt/hostedtoolcache/go/1.21.8/x64/src/net/http/transport.go:1954 +0x4a
bufio.(*Reader).fill(0xc00007f500)
	/opt/hostedtoolcache/go/1.21.8/x64/src/bufio/bufio.go:113 +0x103
bufio.(*Reader).Peek(0xc00007f500, 0x1)
	/opt/hostedtoolcache/go/1.21.8/x64/src/bufio/bufio.go:151 +0x53
net/http.(*persistConn).readLoop(0xc0005e3560)
	/opt/hostedtoolcache/go/1.21.8/x64/src/net/http/transport.go:2118 +0x1b9
created by net/http.(*Transport).dialConn in goroutine 11
	/opt/hostedtoolcache/go/1.21.8/x64/src/net/http/transport.go:1776 +0x169f
 Goroutine 13 in state select, 1 minutes, with net/http.(*persistConn).writeLoop on top of the stack:
net/http.(*persistConn).writeLoop(0xc0005e3560)
	/opt/hostedtoolcache/go/1.21.8/x64/src/net/http/transport.go:2421 +0xe5
created by net/http.(*Transport).dialConn in goroutine 11
	/opt/hostedtoolcache/go/1.21.8/x64/src/net/http/transport.go:1777 +0x16f1
]

The only call stacks for this trace are coming from internal/k8stest, first with create's call to the RestMapping functionality, the other in delete's resource.Delete call.

This is still failing even though the context being passed in has a cancel func that's being called in shutdown. From what I can tell, the context's cancel is only useful in the delete call stack though, not the RestMapping. I'm testing to see if removing the delete call will make CI pass, to try to confirm it's happening due to the RestMapping functionality.

Update: Removing Delete doesn't fix the leak, so I believe this may be caused by this RestMapping call.

@crobert-1
Copy link
Member Author

I realized the discovery client used by the internal k8s client also needed to use the workaround, along with close idle connections in shutdown, but this also didn't resolve the issue.

It looks like the rest mapping call uses the discovery client, and its HTTP client when doing the call, but closing idle connections does not resolve the leak. It's somewhere in there though, from what I can tell.

@crobert-1
Copy link
Member Author

Also, the change that includes a context with cancel hasn't changed the resulting goleak output, so I don't believe this is necessary.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 27, 2024
@crobert-1 crobert-1 removed the Stale label May 28, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants