-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely #33425
Comments
/cc @bradfitz @tombergan |
Any other information I can provide to help speed up any investigation? I've seen some outages in production caused by this issue, and want to help get a fix out. |
Anything we can help contribute to help with this? Since a single bad backend can deadlock the whole transport, the impact if quite large for us. |
Is this a dup of #32388? |
It looks similar to #32388, but I don't see any mention of the I took the in-review change and applied it to my repro, and it didn't seem to help. Presumably, there's more changes required.
The deadlock is caused by:
And getting the
So the blocked |
The mentioned change doesn't fix anything because its incomplete. All of these problems are all the same with no good solution because eventually the write lock will become the issue that backs up to the mu given the relationship that exists between them. |
Yeah, I dug through the code and was unable to find any easy way to fix the issue. I think the main issue is that In this scenario, it'd be nice to know if the lock was held since we can skip any connection with the lock held (probably safe to assume it's not usable right now). To prototype this idea, I used a huge hack that lets me check whether a lock is held, and return a
With this change, my repro above passes everytime. A less hacky of my approach can be done using a channel with size 1 (blocking write to |
@prashantv While your solution fixes idleState it breaks everything else that relies on the read mutex. |
Can you expand on what's broken by this? The The only thing that's affected is calls to In fact, I've run the HTTP2 tests multiple times with my change without issues:
Pushed a branch with my change if you want to test it: My change is still a hack, so it doesn't fully solve the problem (it's still possible that |
Here is a simple issue that you have: |
My mistake, I meant to set locked = 1 while holding the lock. Once that's flipped, apart from not fixing the issue 100%, are there other things that are broken? You mentioned "everything that relies on read mutex", not exactly sure what you meant by "read mutex". I want to understand if the problem being solved (Making This uses a channel to implement a lock that supports a |
And eventually the channel can be blocked. You can't just short circuit idleStateLocked, it actually computes the answer. Returning an incorrect value will break other guarantees. |
I assume the guarantee you mean is that we'll end up with more connections than required, since my Looking at the change linked to the other issue, I'm not sure if there's a clear path to fixing the underlying locking issues. I can try to investigate that approach, but in the meantime, we're stuck with deadlocks. Is there a short-term workaround? E.g., write deadlines? A workaround (even if it triggers more connections than necessary) would be preferable to a deadlock in production. |
The only solution I can think of right now is to build your own client pool which tracks outstanding requests per client. Don't share the transport. |
Alternatively, how do we like the idea of per-backend/address lock, instead of a shared lock for the entire client connection pool? A proof-of-concept: https://github.com/jaricftw/go-http2-stuck-transport/compare/client-conn-pool?expand=1, which ensures a single stuck backend won't block requests to other backends. |
due to lots of issues with x/net/http2, as well as the hundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now
due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now
due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now
* fix: metacache should only rename entries during cleanup (minio#11503) To avoid large delays in metacache cleanup, use rename instead of recursive delete calls, renames are cheaper move the content to minioMetaTmpBucket and then cleanup this folder once in 24hrs instead. If the new cache can replace an existing one, we should let it replace since that is currently being saved anyways, this avoids pile up of 1000's of metacache entires for same listing calls that are not necessary to be stored on disk. * turn off http2 for TLS setups for now (minio#11523) due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now * fix: save ModTime properly in disk cache (minio#11522) fix minio#11414 * fix: osinfos incomplete in case of warnings (minio#11505) The function used for getting host information (host.SensorsTemperaturesWithContext) returns warnings in some cases. Returning with error in such cases means we miss out on the other useful information already fetched (os info). If the OS info has been succesfully fetched, it should always be included in the output irrespective of whether the other data (CPU sensors, users) could be fetched or not. * fix: avoid timed value for network calls (minio#11531) additionally simply timedValue to have RWMutex to avoid concurrent calls to DiskInfo() getting serialized, this has an effect on all calls that use GetDiskInfo() on the same disks. Such as getOnlineDisks, getOnlineDisksWithoutHealing * fix: support IAM policy handling for wildcard actions (minio#11530) This PR fixes - allow 's3:versionid` as a valid conditional for Get,Put,Tags,Object locking APIs - allow additional headers missing for object APIs - allow wildcard based action matching * fix: in MultiDelete API return MalformedXML upon empty input (minio#11532) To follow S3 spec * Update yaml files to latest version RELEASE.2021-02-14T04-01-33Z * fix: multiple pool reads parallelize when possible (minio#11537) * Add support for remote tier management With this change MinIO's ILM supports transitioning objects to a remote tier. This change includes support for Azure Blob Storage, AWS S3 and Google Cloud Storage as remote tier storage backends. Co-authored-by: Poorna Krishnamoorthy <[email protected]> Co-authored-by: Krishna Srinivas <[email protected]> Co-authored-by: Krishnan Parthasarathi <[email protected]> Co-authored-by: Harshavardhana <[email protected]> Co-authored-by: Poorna Krishnamoorthy <[email protected]> Co-authored-by: Shireesh Anjal <[email protected]> Co-authored-by: Anis Elleuch <[email protected]> Co-authored-by: Minio Trusted <[email protected]> Co-authored-by: Krishna Srinivas <[email protected]> Co-authored-by: Poorna Krishnamoorthy <[email protected]> Co-authored-by: Krishna Srinivas <[email protected]>
due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now
due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now
due to lots of issues with x/net/http2, as well as the bundled h2_bundle.go in the go runtime should be avoided for now. golang/go#23559 golang/go#42534 golang/go#43989 golang/go#33425 golang/go#29246 With collection of such issues present, it make sense to remove HTTP2 support for now
I believe the problem of new requests being blocked by stuck network writes in old ones should have been fixed by CL 353390 and CL 349594. (And maybe a couple other CLs in that general range.) A bug in ping checking could result in a permanently write-blocked connection never being flagged as dead. CL 354389 should have fixed that. There's a remaining case where It might also be nice to be able to detect a write-blocked connection independently of a ping timeout. That's #48830. I think the cases described in this issue are now all either fixed or covered by more specific issues, so I'm going to close this one. Let me know if there's anything I'm missing. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Used the HTTP/2 library to make calls as a client. One of the backends stalled and stopped reading from the TCP connection (although it was still running and the TCP connection was active).
This caused the TCP write buffer on the client to fill up.
What did you expect to see?
I expected that all future calls to the blocked server would timeout eventually, and that calls to any other server would not be affected.
What did you see instead?
All calls on the transport are blocked forever, including calls to working backends. Stack traces should the calls are blocked on a mutex, so any timeouts are ignored.
Repro
Repro: https://github.com/prashantv/go-http2-stuck-repro/
The repro sets up 2 "backends":
stuckURL
: Accepts a TCP connection and does not read from the connection, to simulate a stuck backendnormal1URL
: Echoes the request body back to the caller.stuckURL
with a payload that never completes. This causes the TCP write buffer to fill up, and theWrite
to block, while holding the*ClientConn.wmu
(write lock on the connection). We sleep a second to make sure the TCP buffer fills up, and the write is blocked.Stack trace
stuckURL
is made, which is blocked inroundTrip
trying to get the*ClientConn.wmu
(since (1) holds this lock), and with the*ClientConn.mu
lock held.Stack trace
stuckURL
grabs the connection pool lock,*clientConnPool.mu
, then iterates over all connections to that address to check theidleState()
which tries to grab*ClientConn.mu
. This blocks since (2) is already holding this lock.Stack trace
*clientConnPool
(e.g., any call through the same client/transport) block on trying to get*clientConnPool.mu
.Stack trace
I've reduced the size of TCP read/write buffers to trigger the issue more quickly.
The repro includs a flag for controlling steps 2/3. Setting the
--num-stuck-calls
to 2 or higher will reproduce the issue (the echo calls will fail), while 0 or 1 will cause the test to pass.The text was updated successfully, but these errors were encountered: