-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compute/metadata: metadata.OnGCE()
leaks 2 http goroutines
#5430
Comments
metadata.OnGCE()
leaks http2 goroutinesmetadata.OnGCE()
leaks 2 http goroutines
Thanks for the report I will take a look and see if I can reproduce. |
Looking at the panic trace you provided I believe this is intentional behavior in the http transport layer of Go. I see I am fine making this change for this check, but the default client will remain for other metadata operations and this type of issue could still occur. |
Initalize a resolver and http.Client locally in function to force GC to cleanup background resources such as cached connections. This can falsely report as leaked goroutines otherwise. This kind of issue will still persist for calling methods like ProjectID, but we are intentionally caching a client and its connections for these occations today. If this ends up causing users issues in the fututure we can reevaluate at that time. Fixes: googleapis#5430
Initalize a resolver and http.Client locally in function to force GC to cleanup background resources such as cached connections. This can falsely report as leaked goroutines otherwise. This kind of issue will still persist for calling methods like ProjectID, but we are intentionally caching a client and its connections for these occations today. If this ends up causing users issues in the fututure we can reevaluate at that time. Fixes: #5430
Thanks for the explanation. But it seems the fix didn't work. |
We already marked these goroutines as expected in our leakcheck package. We will be OK if this is an intended behavior. Another question is, why is this flaky? |
Hmm yeah this is definitely from HTTP internals. We do launch to goroutines in this method but we do so with a derfered cancel context so those are not the issue. The flakiness might be due to the GC not running before leakcheck run perhaps so the HTTP conns are still open? This is just a guess. Someone that knows the HTTP package better than I would need to chime in. But I have seen something like this in the past and this was the cause. Thanks for the report either way, sorry the change did not solve the issue for yall. |
IIUC, GC will never clean up your client, because it is a global variable, so it is always referenced. The change referenced still causes a global client to be created at init time. I think there are three options:
|
I would not have guessed the global client is the culprit after the change made as it is not referenced in this method. But maybe the http package launches some goroutines when the client is created so the global client is still leaking goroutines?! I think since this is not a real leak we will go with #3 for now unless this becomes more of an issue. |
Hmm, no, I think you're right. I saw "dial" written so many times in its initialization that I assumed it was doing something, but it looks like nothing is actually dialing a connection due to the global's initialization. In this case, I'm not sure. If nothing is using the global client for some reason, then you may be right that it's the local clients leaking instead. Our leak checker waits 10 seconds before failing, and I'm not sure whether there's a periodic GC sweep that should come along by then and clean things up, or if |
Update on this:
|
I will try to look into this issue some more. |
I still believe this to be related to keepalives. I did some more testing today and removing keep alives all-together makes it so leakcheck is not tripped. I don't see a bug in our code though, which makes me believe there is something else possibly at play here in internals. Still going to investigate some more though to make sure there is nothing we can do. |
Note: The only time this seems to happen is when first the DNS lookup fails and then the HTTP request also fails due to a context cancel. It seems that when the code execution gives that goroutine enough process the cancelation is when this error occurs there is a leak in resources. I locally added some logging when either the DNS lookup or HTTP request fail. All passing test cases showed the DNS lookup failed and the execution ending. All failing cases showed the DNS loopup failing and subsequent failing of the HTTP request due to a canceled context. It seems like the backing connection in leaked in these cases. If I had to guess it is due to the fact that another request with that transport is never made to clean up the underlying transport connection, so the keepalive timeout is not enforced for the living connection. |
To avoid an issue with google go sdk leaking goroutines we need to ensure it has a proper credential file which avoids it looking up metadata from the GCE metadata endpoints. See: googleapis/google-cloud-go#5430
To avoid an issue with google go sdk leaking goroutines we need to ensure it has a proper credential file which avoids it looking up metadata from the GCE metadata endpoints. See: googleapis/google-cloud-go#5430
To avoid an issue with google go sdk leaking goroutines we need to ensure it has a proper credential file which avoids it looking up metadata from the GCE metadata endpoints. See: googleapis/google-cloud-go#5430
Any update on this? Can we just have OnGCE be properly canceled/closed so as to not avoid a leak? It could reuse the same context that the client has and use a waitgroup/channel to wait for the check to stop. In the worst case, an env var or option could be passed to skip this check given there's a |
@codyoss retest this |
This helps clean up idled connections held by the keep-alive. The default transport set this to 90 seconds but since we declare our own the default zero value means they are never closed out if they are under the threshold for connections to keep alive(2). Fixes: googleapis#5430
This helps clean up idled connections held by the keep-alive. The default transport set this to 90 seconds but since we declare our own the default zero value means they are never closed out if they are under the threshold for connections to keep alive(2). Fixes: #5430
@codyoss that doesn't feel like a sufficient solution for leaks when it comes to goroutine leak testing. How do you anticipate we get around this? |
The "leak" was an idle conn from my testing. So the fix does make sure that eventually gets cleaned up. The original issue opened here I believe had a two minute wait for the leak check which this does get cleaned up before. If you have a tighter timeout than the underlying connection cleanup maybe it is worth skipping this case in your test. I believe there is another open requesting the envvar you were asking for, but it has not gotten much traction since opening: #4920 |
Our wait time is less than a second and there's no case to skip for us. This GCE code is run no matter what environment you're on unfortunately. |
Client
N/A
Environment
This leak was detected in our github actions CI.
I only managed to reproduce on github actions. Never got it to leak on my local machine.
Go Environment
$ go version
go version go1.17.6 linux/amd64
$ go env
expand to see go env
Code
https://github.com/menghanl/cloud-compute-metadata-leak
Additional context
This was detected in our github actions CI: grpc/grpc-go#5171
We at first thought it was caused by other dependencies of gRPC.
But with many attempts, we finally managed to reproduce it with just
compute/metadata
.It's flaky (1 failure in ~100 runs).
One failed run: https://github.com/menghanl/cloud-compute-metadata-leak/runs/5043476328?check_suite_focus=true
The text was updated successfully, but these errors were encountered: