-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/poll: deadlock in Read on arm64 when an FD is closed [1.17 backport] #50611
Comments
I'd feel better about a backport if we had more data showing that the crashes really have stopped. |
Unfortunately the crashes only reproduced on a very small number of builders, and even then not in a way we could reproduce easily outside of the builder environment, which can be arbitrarily different from even a Perhaps we should revisit a backport after the tree reopens for 1.19 development, so that we can have a larger body of test runs to compare? |
Approved. This is a serious issue with no workaround. |
https://go.dev/wiki/MinorRelease states:
Both this issue and #50610 are now labeled CherryPickApproved. @rsc You're the original author of CL 378234 that fixed #45211, so please feel free to mail cherry-pick CLs when you're ready to proceed here. Thanks. |
This backport is at risk of missing the next minor release. |
Change https://go.dev/cl/392714 mentions this issue: |
I'm not sure whether disabling So if falling back to older atomic instructions is a valid way to run, then it looks to me like that commit really does stop the crashes (at least by a factor of 50, from ~3% failure rate to 0 out of 2000). Update: I ran the reproducer 45k times, with no netpoll-related deadlocks. (The single failure I saw was reported as a normal application-layer TCP timeout.) |
Closed by merging 4e69fdd to release-branch.go1.17. |
The netpoll code was written long ago, when the only multiprocessors that Go ran on were x86. It assumed that an atomic store would trigger a full memory barrier and then used that barrier to order otherwise racy access to a handful of fields, including pollDesc.closing. On ARM64, this code has finally failed, because the atomic store is on a value completely unrelated to any of the racily-accessed fields, and the ARMv8 hardware, unlike x86, is clever enough not to do a full memory barrier for a simple atomic store. We are seeing a constant background rate of trybot failures where the net/http tests deadlock - a netpollblock has clearly happened after the pollDesc has begun to close. The code that does the racy reads is netpollcheckerr, which needs to be able to run without acquiring a lock. This CL fixes the race, without introducing unnecessary inefficiency or deadlock, by arranging for every updater of the relevant fields to publish a summary as a single atomic uint32, and then having netpollcheckerr use a single atomic load to fetch the relevant bits and then proceed as before. For #45211 Fixes #50611 Change-Id: Ib6788c8da4d00b7bda84d55ca3fdffb5a64c1a0a Reviewed-on: https://go-review.googlesource.com/c/go/+/378234 Trust: Russ Cox <[email protected]> Run-TryBot: Russ Cox <[email protected]> Trust: Bryan Mills <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> (cherry picked from commit 17b2fb1) Reviewed-on: https://go-review.googlesource.com/c/go/+/392714 Trust: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Emmanuel Odeke <[email protected]>
@bcmills requested issue #45211 to be considered for backport to the next 1.17 minor release.
The text was updated successfully, but these errors were encountered: