compute/metadata: `metadata.OnGCE()` leaks 2 http goroutines #5430

menghanl · 2022-02-02T21:50:46Z

Client

N/A

Environment

This leak was detected in our github actions CI.
I only managed to reproduce on github actions. Never got it to leak on my local machine.

Ubuntu
20.04.3
LTS

Go Environment

$ go version
go version go1.17.6 linux/amd64

$ go env

expand to see go env

  GO111MODULE=""
  GOARCH="amd64"
  GOBIN=""
  GOCACHE="/home/runner/.cache/go-build"
  GOENV="/home/runner/.config/go/env"
  GOEXE=""
  GOEXPERIMENT=""
  GOFLAGS=""
  GOHOSTARCH="amd64"
  GOHOSTOS="linux"
  GOINSECURE=""
  GOMODCACHE="/home/runner/go/pkg/mod"
  GONOPROXY=""
  GONOSUMDB=""
  GOOS="linux"
  GOPATH="/home/runner/go"
  GOPRIVATE=""
  GOPROXY="https://proxy.golang.org,direct"
  GOROOT="/opt/hostedtoolcache/go/1.[17](https://github.com/menghanl/cloud-compute-metadata-leak/runs/5043476328?check_suite_focus=true#step:3:17).6/x64"
  GOSUMDB="sum.golang.org"
  GOTMPDIR=""
  GOTOOLDIR="/opt/hostedtoolcache/go/1.17.6/x64/pkg/tool/linux_amd64"
  GOVCS=""
  GOVERSION="go1.17.6"
  GCCGO="gccgo"
  AR="ar"
  CC="gcc"
  CXX="g++"
  CGO_ENABLED="1"
  GOMOD="/home/runner/work/cloud-compute-metadata-leak/cloud-compute-metadata-leak/go.mod"
  CGO_CFLAGS="-g -O2"
  CGO_CPPFLAGS=""
  CGO_CXXFLAGS="-g -O2"
  CGO_FFLAGS="-g -O2"
  CGO_LDFLAGS="-g -O2"
  PKG_CONFIG="pkg-config"
  GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build371730[26](https://github.com/menghanl/cloud-compute-metadata-leak/runs/5043476328?check_suite_focus=true#step:3:26)67=/tmp/go-build -gno-record-gcc-switches"

Code

https://github.com/menghanl/cloud-compute-metadata-leak

func TestML(t *testing.T) {
	defer leakcheck.Check(t) // This will fail if there are still leaked goroutines after 10 seconds
	if metadata.OnGCE() {
		t.Logf("on gce")
	}
}

Additional context

This was detected in our github actions CI: grpc/grpc-go#5171
We at first thought it was caused by other dependencies of gRPC.
But with many attempts, we finally managed to reproduce it with just compute/metadata.

It's flaky (1 failure in ~100 runs).
One failed run: https://github.com/menghanl/cloud-compute-metadata-leak/runs/5043476328?check_suite_focus=true

The text was updated successfully, but these errors were encountered:

codyoss · 2022-02-03T16:00:12Z

Thanks for the report I will take a look and see if I can reproduce.

codyoss · 2022-02-03T16:20:54Z

Looking at the panic trace you provided I believe this is intentional behavior in the http transport layer of Go. I see net/http.(*Transport).dialConn as the culprit here. Like the standard lib http package we have a default client cached as a default client used by the metadata package. I assume what is happening here is the keepAlive keeping the connection around in the background for up to 30 seconds. Because the client is package global its connections are not cleaned up/gc'ed as a locally instantiated client would be. I suppose we can scope a client to that function to resolve this issue. But I don't think we are truly leaking a goroutine here, at least not directly.

I am fine making this change for this check, but the default client will remain for other metadata operations and this type of issue could still occur.

Initalize a resolver and http.Client locally in function to force GC to cleanup background resources such as cached connections. This can falsely report as leaked goroutines otherwise. This kind of issue will still persist for calling methods like ProjectID, but we are intentionally caching a client and its connections for these occations today. If this ends up causing users issues in the fututure we can reevaluate at that time. Fixes: googleapis#5430

Initalize a resolver and http.Client locally in function to force GC to cleanup background resources such as cached connections. This can falsely report as leaked goroutines otherwise. This kind of issue will still persist for calling methods like ProjectID, but we are intentionally caching a client and its connections for these occations today. If this ends up causing users issues in the fututure we can reevaluate at that time. Fixes: #5430

codyoss · 2022-02-03T17:02:32Z

Released under: https://github.com/googleapis/google-cloud-go/releases/tag/compute/v1.2.0

menghanl · 2022-02-03T18:33:28Z

Thanks for the explanation.

But it seems the fix didn't work.
I updated the package to v1.2.0, and the goroutines still were not closed:
https://github.com/menghanl/cloud-compute-metadata-leak/actions/runs/1791114171

menghanl · 2022-02-03T18:40:01Z

We already marked these goroutines as expected in our leakcheck package. We will be OK if this is an intended behavior.

Another question is, why is this flaky?
If the metadata IP cannot be connected to, why does the connection rely on keepalive to get closed? (Or it depends on the testing environment?)

codyoss · 2022-02-03T21:29:46Z

Hmm yeah this is definitely from HTTP internals. We do launch to goroutines in this method but we do so with a derfered cancel context so those are not the issue. The flakiness might be due to the GC not running before leakcheck run perhaps so the HTTP conns are still open? This is just a guess. Someone that knows the HTTP package better than I would need to chime in. But I have seen something like this in the past and this was the cause. Thanks for the report either way, sorry the change did not solve the issue for yall.

dfawley · 2022-02-03T21:47:18Z

IIUC, GC will never clean up your client, because it is a global variable, so it is always referenced. The change referenced still causes a global client to be created at init time.

I think there are three options:

Change your API so that there is no longer a need for a global HTTP client. E.g. create a metadata.Client type that the user instantiates and has a Close method on it to clean up properly. This could live alongside the global API as long as the global client is only lazily initialized.
Create & destroy HTTP clients as needed instead of having a global.
Do nothing. This isn't a "leak" from the perspective that these clients will grow over time. However, it is a resource that exists for the life of the binary which may only be needed for a single operation.

codyoss · 2022-02-03T21:56:44Z

I would not have guessed the global client is the culprit after the change made as it is not referenced in this method. But maybe the http package launches some goroutines when the client is created so the global client is still leaking goroutines?! I think since this is not a real leak we will go with #3 for now unless this becomes more of an issue.

dfawley · 2022-02-03T22:07:15Z

But maybe the http package launches some goroutines when the client is created so the global client is still leaking goroutines?!

Hmm, no, I think you're right. I saw "dial" written so many times in its initialization that I assumed it was doing something, but it looks like nothing is actually dialing a connection due to the global's initialization.

In this case, I'm not sure. If nothing is using the global client for some reason, then you may be right that it's the local clients leaking instead. Our leak checker waits 10 seconds before failing, and I'm not sure whether there's a periodic GC sweep that should come along by then and clean things up, or if runtime.Stack will make a GC happen, or we need to trigger it manually via runtime.GC(). An easy way to rule out the global would be to run this test with the global left as nil. (Of course, it's flaky, so it needs to be run many, many times to have confidence in any solution.)

menghanl · 2022-02-07T22:07:52Z

Update on this:

the goroutines are leaked for more than 30 seconds. I increased the leakcheck timeout to 120 seconds, and the test still failed. So the guess that the connection stays for 30 seconds due to keepalive is not correct.
- https://github.com/menghanl/cloud-compute-metadata-leak/actions/runs/1808571388
GC doesn't help. I added a manual runtime.GC() before each iteration (every 50 ms), it didn't free the goroutines (after 120 seconds)
- https://github.com/menghanl/cloud-compute-metadata-leak/actions/runs/1809078113

codyoss · 2022-02-07T22:30:55Z

I will try to look into this issue some more.

codyoss · 2022-06-15T16:33:48Z

I still believe this to be related to keepalives. I did some more testing today and removing keep alives all-together makes it so leakcheck is not tripped. I don't see a bug in our code though, which makes me believe there is something else possibly at play here in internals. Still going to investigate some more though to make sure there is nothing we can do.

codyoss · 2022-06-15T19:57:53Z

Note: The only time this seems to happen is when first the DNS lookup fails and then the HTTP request also fails due to a context cancel. It seems that when the code execution gives that goroutine enough process the cancelation is when this error occurs there is a leak in resources.

I locally added some logging when either the DNS lookup or HTTP request fail. All passing test cases showed the DNS lookup failed and the execution ending. All failing cases showed the DNS loopup failing and subsequent failing of the HTTP request due to a canceled context. It seems like the backing connection in leaked in these cases. If I had to guess it is due to the fact that another request with that transport is never made to clean up the underlying transport connection, so the keepalive timeout is not enforced for the living connection.

To avoid an issue with google go sdk leaking goroutines we need to ensure it has a proper credential file which avoids it looking up metadata from the GCE metadata endpoints. See: googleapis/google-cloud-go#5430

edaniels · 2022-11-04T16:15:00Z

Any update on this? Can we just have OnGCE be properly canceled/closed so as to not avoid a leak? It could reuse the same context that the client has and use a waitgroup/channel to wait for the check to stop. In the worst case, an env var or option could be passed to skip this check given there's a metadataHostEnv check that trusts that the user is on GCE; we could have the opposite.

codyoss · 2022-11-08T20:18:37Z

@codyoss retest this

This helps clean up idled connections held by the keep-alive. The default transport set this to 90 seconds but since we declare our own the default zero value means they are never closed out if they are under the threshold for connections to keep alive(2). Fixes: googleapis#5430

This helps clean up idled connections held by the keep-alive. The default transport set this to 90 seconds but since we declare our own the default zero value means they are never closed out if they are under the threshold for connections to keep alive(2). Fixes: #5430

edaniels · 2022-11-18T22:08:19Z

@codyoss that doesn't feel like a sufficient solution for leaks when it comes to goroutine leak testing. How do you anticipate we get around this?

codyoss · 2022-11-18T22:18:35Z

The "leak" was an idle conn from my testing. So the fix does make sure that eventually gets cleaned up. The original issue opened here I believe had a two minute wait for the leak check which this does get cleaned up before. If you have a tighter timeout than the underlying connection cleanup maybe it is worth skipping this case in your test.

I believe there is another open requesting the envvar you were asking for, but it has not gotten much traction since opening: #4920

edaniels · 2022-11-18T23:19:26Z

Our wait time is less than a second and there's no case to skip for us. This GCE code is run no matter what environment you're on unfortunately.

menghanl added the triage me I really want to be triaged. label Feb 2, 2022

product-auto-label bot added the api: compute Issues related to the Compute Engine API. label Feb 2, 2022

menghanl changed the title ~~compute/metadata: metadata.OnGCE() leaks http2 goroutines~~ compute/metadata: metadata.OnGCE() leaks 2 http goroutines Feb 2, 2022

menghanl mentioned this issue Feb 2, 2022

Flaky tests: ClientWatchEDS and ClusterNameInAddressAttributes grpc/grpc-go#5171

Closed

codyoss self-assigned this Feb 3, 2022

codyoss added status: investigating The issue is under investigation, which is determined to be non-trivial. and removed triage me I really want to be triaged. labels Feb 3, 2022

codyoss mentioned this issue Feb 3, 2022

fix(compute/metadata): init a HTTP client in OnGCE check #5439

Merged

codyoss closed this as completed in #5439 Feb 3, 2022

codyoss reopened this Feb 7, 2022

This was referenced Feb 9, 2022

chore(main): release 1.0.0 #5487

Closed

chore(main): release 1.0.0 #5509

Closed

chore(main): release 1.0.0 #5657

Closed

release-please bot mentioned this issue Feb 17, 2022

chore: release main #5669

Merged

saschagrunert mentioned this issue Feb 25, 2022

Bump opentelemetry to v1.4.1 etcd-io/etcd#13723

Closed

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels May 8, 2022

AdamMagaluk mentioned this issue Aug 9, 2022

Github Actions use Google Credentials viamrobotics/rdk#1140

Merged

edaniels mentioned this issue Nov 4, 2022

Add goleak ignore lines for net/http viamrobotics/goutils#102

Closed

codyoss mentioned this issue Nov 18, 2022

fix(compute/metadata): set IdleConnTimeout for http.Client #7084

Merged

gcf-merge-on-green bot closed this as completed in #7084 Nov 18, 2022

release-please bot mentioned this issue Nov 18, 2022

chore: release main #7063

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute/metadata: `metadata.OnGCE()` leaks 2 http goroutines #5430

compute/metadata: `metadata.OnGCE()` leaks 2 http goroutines #5430

menghanl commented Feb 2, 2022 •

edited

Loading

codyoss commented Feb 3, 2022

codyoss commented Feb 3, 2022

codyoss commented Feb 3, 2022

menghanl commented Feb 3, 2022

menghanl commented Feb 3, 2022

codyoss commented Feb 3, 2022

dfawley commented Feb 3, 2022

codyoss commented Feb 3, 2022

dfawley commented Feb 3, 2022

menghanl commented Feb 7, 2022

codyoss commented Feb 7, 2022

codyoss commented Jun 15, 2022

codyoss commented Jun 15, 2022

edaniels commented Nov 4, 2022

codyoss commented Nov 8, 2022

edaniels commented Nov 18, 2022

codyoss commented Nov 18, 2022

edaniels commented Nov 18, 2022

compute/metadata: metadata.OnGCE() leaks 2 http goroutines #5430

compute/metadata: metadata.OnGCE() leaks 2 http goroutines #5430

Comments

menghanl commented Feb 2, 2022 • edited Loading

codyoss commented Feb 3, 2022

codyoss commented Feb 3, 2022

codyoss commented Feb 3, 2022

menghanl commented Feb 3, 2022

menghanl commented Feb 3, 2022

codyoss commented Feb 3, 2022

dfawley commented Feb 3, 2022

codyoss commented Feb 3, 2022

dfawley commented Feb 3, 2022

menghanl commented Feb 7, 2022

codyoss commented Feb 7, 2022

codyoss commented Jun 15, 2022

codyoss commented Jun 15, 2022

edaniels commented Nov 4, 2022

codyoss commented Nov 8, 2022

edaniels commented Nov 18, 2022

codyoss commented Nov 18, 2022

edaniels commented Nov 18, 2022

compute/metadata: `metadata.OnGCE()` leaks 2 http goroutines #5430

compute/metadata: `metadata.OnGCE()` leaks 2 http goroutines #5430

menghanl commented Feb 2, 2022 •

edited

Loading