-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: modified timer results in extreme cpu load #51654
Comments
cc @golang/runtime |
CC @mknyszek |
Seems to me that if |
I have now added a link to instructions how to reproduce the issue above. I understand that it might not be that helpful due to its complexity and will continue the efforts to trigger the problem using a testcase instead. |
@bjosv |
Thanks for the detailed report. I still need to think more deeply, but @ianlancetaylor's suggestion sounds reasonable.
Just a note, we run a GC at least every 2 minutes. So if this workload isn't otherwise triggering GC, it could be the forced 2min GC fixing this (since it adds work to multiple Ps). |
@klauspost Ah, yes, you are right. I see now that the correction was included already in go1.17rc2. I'll run it to make sure. https://github.com/bjosv/golang-devenv/tree/master/issue/timer-cpu-spike/reproduction I have only been "lucky" to reproduce it on Linux, have not yet got it on a Mac M1, and it seem be more frequent when running on a VM with more cores. |
@klauspost I have now verified that the issue exists in go1.17 and go1.17.1 by running the reproduction binaries and some printouts in proc.go. I have updated the top text. |
A similar situation was observed at a large customer deployment, that we have been battling with for the last 3 days go1.17.8 compiled and deployed - extremely high CPU load.
Millions of them on all nodes. |
Just to be on the safe side: this issue is currently targeting the Go1.19 milestone, but would it also qualify for backports (specifically 1.18.*)? Or would that depend on the complexity of the eventual fix? |
@costela Yes, for a case like this it would depend on the complexity of the fix. |
We have two applications in production that got hit by this bug. Here are some notes, maybe it's somehow helpful. |
As a workaround we downgraded specific services suffering from this issue to |
@mknyszek What is the current status here? This issue is currently in the 1.19 milestone. Should it move to 1.20? To Backlog? Thanks. |
We think we're hitting this too: tailscale/tailscale#4760 (comment) |
@bjosv, that's a great repro. Thanks for making that! I can confirm it works to repro the issue on an 8-core linux/amd64 VM ... |
Moving to 1.20, this shouldn't be on the backlog, in fact it might warrant a backport. |
@ianlancetaylor wrote:
Looking at a profile, stealWork and checkTimers is definitely there. I guess the issue is that it only tries 4 tries? func stealWork(now int64) (gp *g, inheritTime bool, rnow, pollUntil int64, newWork bool) {
...
const stealTries = 4
for i := 0; i < stealTries; i++ { If so, the workaround for this would be to set GOMAXPROCS to 4 or less, perhaps? Or is the fix as easy as making stealTries unbounded, to the number of Ps? /cc @golang/runtime |
No, each of the 4 tries attempts to steal from all Ps (stealOrder will iterate over all Ps), though only the last try attempts to steal timers. I have the repro working and instrumented (thanks @bjosv!) with extra debuglog, but haven't had a chance to fully investigate yet. Likely either |
I think I have a general understanding of the problem, though I still need to confirm. Paraphrasing:
Event flow:
The quick fix here may be to call |
Change https://go.dev/cl/417434 mentions this issue: |
1.16.7 added https://go.dev/cl/336432. Prior to that CL, |
@gopherbot Please backport to 1.17 and 1.18. This is a serious issue (busy loop in scheduler for up to 2 minutes) with no reasonable workaround. |
Backport issue(s) opened: #53846 (for 1.17), #53847 (for 1.18). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/417475 mentions this issue: |
Change https://go.dev/cl/417476 mentions this issue: |
…t timer is deleted timerModifiedEarliest contains the lowest possible expiration for a modified earlier timer, which may be earlier than timer0When because we haven't yet updated the heap. Note "may", as the modified earlier timer that set timerModifiedEarliest may have since been modified later or deleted. We can clear timerModifiedEarliest when the last timer is deleted because by definition there must not be any modified earlier timers. Why does this matter? checkTimersNoP claims that there is work to do if timerModifiedEarliest has passed, causing findRunnable to loop back around to checkTimers. But the code to clean up timerModifiedEarliest in checkTimers (i.e., the call to adjusttimers) is conditional behind a check that len(pp.timers) > 0. Without clearing timerModifiedEarliest, a spinning M that would otherwise go to sleep will busy loop in findRunnable until some other work is available. Note that changing the condition on the call to adjusttimers would also be a valid fix. I took this approach because it feels a bit cleaner to clean up timerModifiedEarliest as soon as it is known to be irrelevant. For golang#51654. Fixes golang#53847. Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5 Reviewed-on: https://go-review.googlesource.com/c/go/+/417434 Auto-Submit: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> (cherry picked from commit c006b7a)
…t timer is deleted timerModifiedEarliest contains the lowest possible expiration for a modified earlier timer, which may be earlier than timer0When because we haven't yet updated the heap. Note "may", as the modified earlier timer that set timerModifiedEarliest may have since been modified later or deleted. We can clear timerModifiedEarliest when the last timer is deleted because by definition there must not be any modified earlier timers. Why does this matter? checkTimersNoP claims that there is work to do if timerModifiedEarliest has passed, causing findRunnable to loop back around to checkTimers. But the code to clean up timerModifiedEarliest in checkTimers (i.e., the call to adjusttimers) is conditional behind a check that len(pp.timers) > 0. Without clearing timerModifiedEarliest, a spinning M that would otherwise go to sleep will busy loop in findRunnable until some other work is available. Note that changing the condition on the call to adjusttimers would also be a valid fix. I took this approach because it feels a bit cleaner to clean up timerModifiedEarliest as soon as it is known to be irrelevant. For golang#51654. Fixes golang#53847. Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5 Reviewed-on: https://go-review.googlesource.com/c/go/+/417434 Auto-Submit: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> (cherry picked from commit c006b7a)
…t timer is deleted timerModifiedEarliest contains the lowest possible expiration for a modified earlier timer, which may be earlier than timer0When because we haven't yet updated the heap. Note "may", as the modified earlier timer that set timerModifiedEarliest may have since been modified later or deleted. We can clear timerModifiedEarliest when the last timer is deleted because by definition there must not be any modified earlier timers. Why does this matter? checkTimersNoP claims that there is work to do if timerModifiedEarliest has passed, causing findRunnable to loop back around to checkTimers. But the code to clean up timerModifiedEarliest in checkTimers (i.e., the call to adjusttimers) is conditional behind a check that len(pp.timers) > 0. Without clearing timerModifiedEarliest, a spinning M that would otherwise go to sleep will busy loop in findRunnable until some other work is available. Note that changing the condition on the call to adjusttimers would also be a valid fix. I took this approach because it feels a bit cleaner to clean up timerModifiedEarliest as soon as it is known to be irrelevant. For #51654. Fixes #53847. Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5 Reviewed-on: https://go-review.googlesource.com/c/go/+/417434 Auto-Submit: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> (cherry picked from commit c006b7a) Reviewed-on: https://go-review.googlesource.com/c/go/+/417475
…t timer is deleted timerModifiedEarliest contains the lowest possible expiration for a modified earlier timer, which may be earlier than timer0When because we haven't yet updated the heap. Note "may", as the modified earlier timer that set timerModifiedEarliest may have since been modified later or deleted. We can clear timerModifiedEarliest when the last timer is deleted because by definition there must not be any modified earlier timers. Why does this matter? checkTimersNoP claims that there is work to do if timerModifiedEarliest has passed, causing findRunnable to loop back around to checkTimers. But the code to clean up timerModifiedEarliest in checkTimers (i.e., the call to adjusttimers) is conditional behind a check that len(pp.timers) > 0. Without clearing timerModifiedEarliest, a spinning M that would otherwise go to sleep will busy loop in findRunnable until some other work is available. Note that changing the condition on the call to adjusttimers would also be a valid fix. I took this approach because it feels a bit cleaner to clean up timerModifiedEarliest as soon as it is known to be irrelevant. For #51654. Fixes #53846. Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5 Reviewed-on: https://go-review.googlesource.com/c/go/+/417434 Auto-Submit: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> (cherry picked from commit c006b7a) Reviewed-on: https://go-review.googlesource.com/c/go/+/417476
timerModifiedEarliest contains the lowest possible expiration for a modified earlier timer, which may be earlier than timer0When because we haven't yet updated the heap. Note "may", as the modified earlier timer that set timerModifiedEarliest may have since been modified later or deleted. We can clear timerModifiedEarliest when the last timer is deleted because by definition there must not be any modified earlier timers. Why does this matter? checkTimersNoP claims that there is work to do if timerModifiedEarliest has passed, causing findRunnable to loop back around to checkTimers. But the code to clean up timerModifiedEarliest in checkTimers (i.e., the call to adjusttimers) is conditional behind a check that len(pp.timers) > 0. Without clearing timerModifiedEarliest, a spinning M that would otherwise go to sleep will busy loop in findRunnable until some other work is available. Note that changing the condition on the call to adjusttimers would also be a valid fix. I took this approach because it feels a bit cleaner to clean up timerModifiedEarliest as soon as it is known to be irrelevant. Fixes golang#51654. Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5 Reviewed-on: https://go-review.googlesource.com/c/go/+/417434 Auto-Submit: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
What version of Go are you using (
go version
)?Tracked down the issue to the following change: https://go-review.googlesource.com/c/go/+/336432
and issue: #47329
(including: https://go-review.googlesource.com/c/go/+/337309)
The issue exists from 1.16.7 and up, 1.17 and up, 1.18 and on master.
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
A K8s/Prometheus metrics exporter process started to sporadically give a high CPU load when updating the Go version.
The process gives 100% CPU usage up to ~2 minutes according to
pidstat
right after a TLS client connection has been setup and closed. The process is a http server and when a client connects a new TLS client connection is setup towards a 3rd entity to fetch metrics information.The problem has been nailed down to a TLS write deadline timer that is modified by a P that is not owning the timer. This later on results in a "busy loop" around
netpoll(delay)
where a delay=0 is used.This is the scenario:
A client connect to the HTTP server.
A new TLS client connection is setup towards a 3rd entity with a write deadline configured.
This sets the timer wt which previously is not on any P's heap.
In this example the setup is run on a p.id=0 which now puts timer
wt
on the heap of P=0.The TLS client connection is closed after the communication is done via close()
In the scenario of this issue the close is performed on another P than was used during connect, in this example P=2.
The issue is not seen when the close is done by the same P used when opening the connection.
During a close in TLS there is a call to closeNotify() which sets and reuses the timer wt.
Since this timer is still on the heap of P=0 it can't be modified from current P=2.
Instead
modtimer()
calls updateTimerModifiedEarliest() to setupdateTimerModifiedEarliest
for P=0 to 5s in the future.The next step in closeNotify() is to reset the timer to
now()
, this will not trigger any change of the previously setupdateTimerModifiedEarliest
.The
updateTimerModifiedEarliest
contains the timeout value for a timer that is no longer is used.The connection is now being closed and unregistreted via netpollclose()
Then I see that adjusttimers() for P=0 gets to run,
but after the corrections for issue time: Timer reset broken under heavy use since go1.16 timer optimizations added #47329 there is no longer any adjustments done, and the function just returns since
timerModifiedEarliest
has not occurred yet.A couple of timeSleepUntil() finding P=0 and the
updateTimerModifiedEarliest
time set in step 4 can be seen.A call to checkTimersNoP() results in finding the
updateTimerModifiedEarliest
value set in step 4.findrunnable() uses the result from step 9 to calculate delay=0 due to that the timeout value has already expired. The call to
netpoll()
is not blocked/waiting.Step 9 and 10 is now repeated for ~2 minutes, and this gives the high CPU load.
The "loop" seems to be interruptable by triggering a new connection via the HTTP server, or by itself after ~1m54s for a currently unknown reason.
Since I'm not that familiar with the runtime internals I have no correction proposal yet,
Maybe it has been triggered by a
stealWork()
that haven't taken over the timers?updateTimerModifiedEarliest
is only set when the new value is earlier than current,so changing the timer in step 4 will not trigger a reset of
updateTimerModifiedEarliest
even if the timer has changed.Currently we can only reproduced the issue when running in containers via K8s/minikube, and it takes 15min to hours for it to show.
We have attempted to reproduce it via a testcase but have yet not succeeded, probably due to timing/num-of-threads/other.
I will update the issue with links how I setup a minikube deployment for reproduction,but I'm also happy to provide more details about my runs or fetch more information.
Any advice is appreciated.
Here is a link to the steps taken when reproducing the issue:
https://github.com/bjosv/golang-devenv/tree/master/issue/timer-cpu-spike
What did you expect to see?
I expected the CPU load not to be affected by timers.
What did you see instead?
pidstat -p 1437245 1
when fetching metrics every 5sRunning perf on a process during the CPU load, via pprof and perf_data_converter.
The text was updated successfully, but these errors were encountered: