-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: forEachP not done and stopTheWorld failing on aix/ppc64 #30189
Comments
I've added an atomic.Load on |
Change https://golang.org/cl/163624 mentions this issue: |
For some reason I don't always see these notifications so I'm sorry I missed this. We have not seen these particular errors on Linux. Carlos told me the memory model should be the same between what's used in AIX and Linux. However, there could be differences in the kernels that might possibly affect this. Carlos will ask around and see what he can find out. |
This error reminds me of some of the errors I've seen when the value in r2 is wrong. In particular, the fix found here, which I see was later #ifdef'ed out on AIX https://go-review.googlesource.com/c/go/+/117515/. In this case the addressing for sched depends on the value of r2, so that would lead to accessing the wrong storage if r2 was bad. I don't know why adding an atomic.Load would affect the results, unless it affects how goroutine switches happen. Not sure how likely but just thought I would mention it. The way r2 is managed on AIX is different than Linux. |
Here's one in https://build.golang.org/log/9db739f8bcaac0d30280c8a9c438850ec1915c46.
|
Thanks @bcmills. I haven't updated this issue for a while as a lot of the conversation occurs in the https://golang.org/cl/163624. The main problem is that notesleep/notewakeup don't always form an happens-before edge on AIX. This means that even if a value was updated in thread A and notewakeup called afterwards, this value might not be up to date in thread B... There are some possible workarounds, mainly by performing atomic operations on thread B: it will synchronize its memory. Note: acquirep failures (https://build.golang.org/log/b823c8f306e3544b16e2aa231912b71d5cbfb966) are also linked with this issue. |
This CL changes ppc64 atomic compare-and-swap (cas). Before this CL, if the cas failed--if the value in memory was not the value expected by the cas call--the atomic function would not synchronize memory. In the note code in runtime/lock_sema.go, used on BSD systems, notesleep and notetsleep first try a cas on the key. If that cas fails, something has already called notewakeup, and the sleep completes. However, because the cas did not synchronize memory on failure, this meant that notesleep/notetsleep could return to a core that was unable to see the memory changes that the notewakeup was reporting. Fixes #30189 Fixes #63384 Change-Id: I9b921de5c1c09b10a37df6b3206b9003c3f32986 Reviewed-on: https://go-review.googlesource.com/c/go/+/533118 Run-TryBot: Ian Lance Taylor <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Paul Murphy <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Lynn Boger <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
This CL changes ppc64 atomic compare-and-swap (cas). Before this CL, if the cas failed--if the value in memory was not the value expected by the cas call--the atomic function would not synchronize memory. In the note code in runtime/lock_sema.go, used on BSD systems, notesleep and notetsleep first try a cas on the key. If that cas fails, something has already called notewakeup, and the sleep completes. However, because the cas did not synchronize memory on failure, this meant that notesleep/notetsleep could return to a core that was unable to see the memory changes that the notewakeup was reporting. Fixes golang#30189 Fixes golang#63384 Change-Id: I9b921de5c1c09b10a37df6b3206b9003c3f32986 Reviewed-on: https://go-review.googlesource.com/c/go/+/533118 Run-TryBot: Ian Lance Taylor <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Paul Murphy <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Lynn Boger <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
What version of Go are you using (
go version
)?What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I'm trying to fix some issues within the aix/ppc64 runtime. forEachP and stopTheWorldWithSema seem to crash randomly on aix/ppc64, as you can see in the following logs:
https://build.golang.org/log/4ca5efa18918e49cd52403582986875f1fc0bde3
https://build.golang.org/log/4680c735ea77419991751ae77791e5fafff706a8
https://build.golang.org/log/0ce7939d9567eca343cb680bb447b6a4ebae5131
...
I didn't managed to reproduce these crashes with a simple test... Therefore, I must launch the full
./all.bash
everytime hopping it will crash. Locally, it happens 1 every 100/200./all.bash
but the builder seems to crash more often.I've added some traces locally in order to understand what can be wrong. But the output are quite suppressing.
First, for forEachP, I've added traces before throw("forEachP: not done") and can get these kind of traces:
As you can see, everything seems alright, even
sched.safePointWait
isn't nil.In order to get these traces, I've added a lock to
sched.lock
right after theif sched.safePointWait != 0
.Is that possible that without this lock,
sched.safePointWait
has an old value or is still being updated by another routine ?Is it even possible to access safely
sched
values without a lock ?The same seems to occur with stopTheWorld, because if I add
schedtrace(true)
(which has a lock inside) beforethrow(bad)
, I'm getting:Once again,
stopwait
is nil in the traces but it throws because it's not.I still don't know why these bugs only occur on aix/ppc64.
Another guess is that the 100us is too short on AIX and the time needed to print the traces is enough to update all remaining Ps. But I don't think AIX is that slow.
These bugs might also be related to another bug with
acquirep()
(cf https://build.golang.org/log/60b0cd90bf7560bc4924bfa70e679be9ace58bbd). I haven't found anything relevant on this bug, except that_g_.m.nextp.ptr()
instopm()
isn't nil when it crashes..I'm still trying to get more traces.
But if anyone has any ideas about what's wrong with these bugs, you're welcome !
@aclements
The text was updated successfully, but these errors were encountered: