Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: modified timer results in extreme cpu load #51654

Closed
bjosv opened this issue Mar 14, 2022 · 26 comments
Closed

runtime: modified timer results in extreme cpu load #51654

bjosv opened this issue Mar 14, 2022 · 26 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@bjosv
Copy link

bjosv commented Mar 14, 2022

What version of Go are you using (go version)?

Tracked down the issue to the following change: https://go-review.googlesource.com/c/go/+/336432
and issue: #47329
(including: https://go-review.googlesource.com/c/go/+/337309)

The issue exists from 1.16.7 and up, 1.17 and up, 1.18 and on master.

$ go version
go version devel go1.19-3c2e73c8c3 Sun Mar 13 21:31:06 2022 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="devel go1.19-3c2e73c8c3 Sun Mar 13 21:31:06 2022 +0000"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3684654207=/tmp/go-build -gno-record-gcc-switches"

What did you do?

A K8s/Prometheus metrics exporter process started to sporadically give a high CPU load when updating the Go version.
The process gives 100% CPU usage up to ~2 minutes according to pidstat right after a TLS client connection has been setup and closed. The process is a http server and when a client connects a new TLS client connection is setup towards a 3rd entity to fetch metrics information.

The problem has been nailed down to a TLS write deadline timer that is modified by a P that is not owning the timer. This later on results in a "busy loop" around netpoll(delay) where a delay=0 is used.

This is the scenario:

  1. A client connect to the HTTP server.

  2. A new TLS client connection is setup towards a 3rd entity with a write deadline configured.
    This sets the timer wt which previously is not on any P's heap.
    In this example the setup is run on a p.id=0 which now puts timer wt on the heap of P=0.

  3. The TLS client connection is closed after the communication is done via close()
    In the scenario of this issue the close is performed on another P than was used during connect, in this example P=2.
    The issue is not seen when the close is done by the same P used when opening the connection.

  4. During a close in TLS there is a call to closeNotify() which sets and reuses the timer wt.
    Since this timer is still on the heap of P=0 it can't be modified from current P=2.
    Instead modtimer() calls updateTimerModifiedEarliest() to set updateTimerModifiedEarliest for P=0 to 5s in the future.

  5. The next step in closeNotify() is to reset the timer to now(), this will not trigger any change of the previously set updateTimerModifiedEarliest.
    The updateTimerModifiedEarliest contains the timeout value for a timer that is no longer is used.

  6. The connection is now being closed and unregistreted via netpollclose()

  7. Then I see that adjusttimers() for P=0 gets to run,
    but after the corrections for issue time: Timer reset broken under heavy use since go1.16 timer optimizations added #47329 there is no longer any adjustments done, and the function just returns since timerModifiedEarliest has not occurred yet.

  8. A couple of timeSleepUntil() finding P=0 and the updateTimerModifiedEarliest time set in step 4 can be seen.

  9. A call to checkTimersNoP() results in finding the updateTimerModifiedEarliest value set in step 4.

  10. findrunnable() uses the result from step 9 to calculate delay=0 due to that the timeout value has already expired. The call to netpoll() is not blocked/waiting.

Step 9 and 10 is now repeated for ~2 minutes, and this gives the high CPU load.
The "loop" seems to be interruptable by triggering a new connection via the HTTP server, or by itself after ~1m54s for a currently unknown reason.

Since I'm not that familiar with the runtime internals I have no correction proposal yet,

  • but the switch of P that modify timers between step 2 and 3 seems to be the initial cause of the issue.
    Maybe it has been triggered by a stealWork() that haven't taken over the timers?
  • Another finding is that updateTimerModifiedEarliest is only set when the new value is earlier than current,
    so changing the timer in step 4 will not trigger a reset of updateTimerModifiedEarliest even if the timer has changed.

Currently we can only reproduced the issue when running in containers via K8s/minikube, and it takes 15min to hours for it to show.
We have attempted to reproduce it via a testcase but have yet not succeeded, probably due to timing/num-of-threads/other.

I will update the issue with links how I setup a minikube deployment for reproduction,
but I'm also happy to provide more details about my runs or fetch more information.
Any advice is appreciated.

Here is a link to the steps taken when reproducing the issue:
https://github.com/bjosv/golang-devenv/tree/master/issue/timer-cpu-spike

What did you expect to see?

I expected the CPU load not to be affected by timers.

What did you see instead?

pidstat -p 1437245 1 when fetching metrics every 5s

03:53:58 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
03:53:59 PM 59000   1437245    0,00    0,00    0,00    0,00    0,00     8  redis_exporter
03:54:00 PM 59000   1437245    1,00    0,00    0,00    0,00    1,00     8  redis_exporter
03:54:01 PM 59000   1437245    0,00    0,00    0,00    0,00    0,00     8  redis_exporter
03:54:02 PM 59000   1437245    0,00    0,00    0,00    0,00    0,00     8  redis_exporter
03:54:03 PM 59000   1437245   29,00   12,00    0,00    0,00   41,00     8  redis_exporter
03:54:04 PM 59000   1437245   75,00   26,00    0,00    0,00  101,00     8  redis_exporter
03:54:05 PM 59000   1437245   72,00   28,00    0,00    0,00  100,00     8  redis_exporter
03:54:06 PM 59000   1437245   75,00   25,00    0,00    0,00  100,00     8  redis_exporter
03:54:07 PM 59000   1437245   75,00   25,00    0,00    0,00  100,00     8  redis_exporter
03:54:08 PM 59000   1437245   42,00   20,00    0,00    0,00   62,00     8  redis_exporter
03:54:09 PM 59000   1437245    0,00    0,00    0,00    0,00    0,00     8  redis_exporter
03:54:10 PM 59000   1437245    0,00    0,00    0,00    0,00    0,00     8  redis_exporter

Running perf on a process during the CPU load, via pprof and perf_data_converter.

      File: redis_exporter
perf-version:5.4.166
perf-command:/usr/lib/linux-tools-5.4.0-104/perf record -F 999 -g -p 256702
Type: cycles_event
Showing nodes accounting for 397249690343, 100% of 397249690343 total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                      158127683505 96.67% |   runtime.epollwait /usr/local/go/src/runtime/sys_linux_amd64.s:689
                                        2450571955  1.50% |   runtime.findrunnable /usr/local/go/src/runtime/proc.go:2767
                                        2362278681  1.44% |   runtime.findrunnable /usr/local/go/src/runtime/proc.go:2576
                                         289069589  0.18% |   runtime.usleep /usr/local/go/src/runtime/sys_linux_amd64.s:139
                                          70082980 0.043% |   runtime.(*randomEnum).next /usr/local/go/src/runtime/proc.go:6151
                                          36979179 0.023% |   runtime.stealWork /usr/local/go/src/runtime/proc.go:2892
                                          33108497  0.02% |   runtime.netpoll /usr/local/go/src/runtime/netpoll_epoll.go:124
                                          24963444 0.015% |   [x86_pkg_temp_thermal.ko]
                                          19626310 0.012% |   runtime.stealWork /usr/local/go/src/runtime/proc.go:2891
                                          16466417  0.01% |   runtime.pMask.read /usr/local/go/src/runtime/proc.go:5547
                                          12265674 0.0075% |   runtime.stealWork /usr/local/go/src/runtime/proc.go:2867
                                          12027724 0.0074% |   runtime.pMask.read /usr/local/go/src/runtime/proc.go:5544
                                           8397722 0.0051% |   runtime.(*randomEnum).next /usr/local/go/src/runtime/proc.go:6149
                                           8110383 0.005% |   runtime.(*randomOrder).start /usr/local/go/src/runtime/proc.go:6140
                                           8004265 0.0049% |   runtime.stealWork /usr/local/go/src/runtime/proc.go:2849
                                           8001281 0.0049% |   runtime.stealWork /usr/local/go/src/runtime/proc.go:2844
                                           7890341 0.0048% |   runtime.lock2 /usr/local/go/src/runtime/lock_futex.go:59
                                           ...
@heschi
Copy link
Contributor

heschi commented Mar 14, 2022

cc @golang/runtime

@heschi heschi added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 14, 2022
@heschi heschi added this to the Go1.19 milestone Mar 14, 2022
@ianlancetaylor
Copy link
Contributor

CC @mknyszek

@ianlancetaylor
Copy link
Contributor

Seems to me that if findrunnable loops around after calling netpoll(0), it should make sure to call stealWork which if there is nothing else to do should call checkTimers for each P. (I don't plan to work on this myself, though.)

@bjosv
Copy link
Author

bjosv commented Mar 16, 2022

I have now added a link to instructions how to reproduce the issue above. I understand that it might not be that helpful due to its complexity and will continue the efforts to trigger the problem using a testcase instead.

@klauspost
Copy link
Contributor

@bjosv
This very closely matches issues that we are observing, causing completely unresponsive servers. The CL you mention would be in all Go 1.17 versions, and not just v1.17.8, correct?

@prattmic
Copy link
Member

Thanks for the detailed report. I still need to think more deeply, but @ianlancetaylor's suggestion sounds reasonable.

The "loop" seems to be interruptable by triggering a new connection via the HTTP server, or by itself after ~1m54s for a currently unknown reason.

Just a note, we run a GC at least every 2 minutes. So if this workload isn't otherwise triggering GC, it could be the forced 2min GC fixing this (since it adds work to multiple Ps).

@bjosv
Copy link
Author

bjosv commented Mar 22, 2022

@klauspost Ah, yes, you are right. I see now that the correction was included already in go1.17rc2. I'll run it to make sure.
@prattmic I have now distilled the problematic actions into a server and a client process that trigger the fault:

https://github.com/bjosv/golang-devenv/tree/master/issue/timer-cpu-spike/reproduction

I have only been "lucky" to reproduce it on Linux, have not yet got it on a Mac M1, and it seem be more frequent when running on a VM with more cores.

@bjosv
Copy link
Author

bjosv commented Mar 22, 2022

@klauspost I have now verified that the issue exists in go1.17 and go1.17.1 by running the reproduction binaries and some printouts in proc.go. I have updated the top text.

@harshavardhana
Copy link
Contributor

A similar situation was observed at a large customer deployment, that we have been battling with for the last 3 days go1.17.8 compiled and deployed - extremely high CPU load.

        ffffffff9249ec5c sys_epoll_wait+0xcc ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249ec7d sys_epoll_wait+0xed ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249ec7d sys_epoll_wait+0xed ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249ec7d sys_epoll_wait+0xed ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249ec7d sys_epoll_wait+0xed ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249ec5c sys_epoll_wait+0xcc ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249ec5c sys_epoll_wait+0xcc ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249ec7d sys_epoll_wait+0xed ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249ec7d sys_epoll_wait+0xed ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249ec7d sys_epoll_wait+0xed ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])
	ffffffff9249edc5 sys_epoll_pwait+0x115 ([kernel.kallsyms])

Millions of them on all nodes.

@costela
Copy link
Contributor

costela commented Apr 12, 2022

Just to be on the safe side: this issue is currently targeting the Go1.19 milestone, but would it also qualify for backports (specifically 1.18.*)? Or would that depend on the complexity of the eventual fix?

@ianlancetaylor
Copy link
Contributor

@costela Yes, for a case like this it would depend on the complexity of the fix.

@totallyunknown
Copy link

We have two applications in production that got hit by this bug. Here are some notes, maybe it's somehow helpful.

@FlorianLoch
Copy link

FlorianLoch commented Apr 21, 2022

As a workaround we downgraded specific services suffering from this issue to 1.16.6 - really hoping that we can bump the Go version again soon in order to benefit from the many great improvements added to the runtime and the language since 1.16.6. 😉

@ianlancetaylor
Copy link
Contributor

@mknyszek What is the current status here? This issue is currently in the 1.19 milestone. Should it move to 1.20? To Backlog? Thanks.

@bradfitz
Copy link
Contributor

bradfitz commented Jul 9, 2022

We think we're hitting this too: tailscale/tailscale#4760 (comment)

@bradfitz
Copy link
Contributor

bradfitz commented Jul 9, 2022

@bjosv, that's a great repro. Thanks for making that! I can confirm it works to repro the issue on an 8-core linux/amd64 VM ...

Screen Shot 2022-07-09 at 2 36 22 PM

@prattmic
Copy link
Member

Moving to 1.20, this shouldn't be on the backlog, in fact it might warrant a backport.

@prattmic prattmic modified the milestones: Backlog, Go1.20 Jul 11, 2022
@bradfitz
Copy link
Contributor

@ianlancetaylor wrote:

Seems to me that if findrunnable loops around after calling netpoll(0), it should make sure to call stealWork which if there is nothing else to do should call checkTimers for each P. (I don't plan to work on this myself, though.)

Looking at a profile, stealWork and checkTimers is definitely there.

I guess the issue is that it only tries 4 tries?

func stealWork(now int64) (gp *g, inheritTime bool, rnow, pollUntil int64, newWork bool) {
...
        const stealTries = 4
        for i := 0; i < stealTries; i++ {

If so, the workaround for this would be to set GOMAXPROCS to 4 or less, perhaps?

Or is the fix as easy as making stealTries unbounded, to the number of Ps?

/cc @golang/runtime

@prattmic
Copy link
Member

No, each of the 4 tries attempts to steal from all Ps (stealOrder will iterate over all Ps), though only the last try attempts to steal timers.

I have the repro working and instrumented (thanks @bjosv!) with extra debuglog, but haven't had a chance to fully investigate yet. Likely either timerModifiedEarliest is corrupt and causing checkTimers to bail out (as @bjosv suggests), or timerpMask is corrupt and causing us to not even check for the timers. I hope to pin it down tomorrow.

@prattmic
Copy link
Member

I think I have a general understanding of the problem, though I still need to confirm. Paraphrasing:

  • Modifying a timers earlier sets p.timerModifiedEarliest.
  • p.timerModifiedEarliest is cleared only by adjusttimers and clearDeletedTimers, both called from checkTimers.
    • adjusttimers preconditions:
      • len(p.timers) > 0
      • now > p.timerModifiedEarliest
    • clearDeletedTimers preconditions:
      • Checking local P timers.

Event flow:

  • Timer modified earlier, setting p.timerModifiedEarliest.
  • Same timer deleted (prior to new expiration).
  • checkTimers runs (on P remote from timer).
    • Does not return early because >25% of timers are deleted.
    • adjusttimers does nothing because now < p.timerModifiedEarliest.
    • runtimer removed deleted timers from p.timers.
  • Future calls to checkTimers will not call adjusttimers because len(p.timers) == 0.

The quick fix here may be to call adjusttimers if len(p.timers) > 0 || now > p.timerModifiedEarliest, but I need to think about it some more.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/417434 mentions this issue: runtime: clear timerModifiedEarliest when last timer is deleted

@prattmic
Copy link
Member

The issue exists from 1.16.7 and up, 1.17 and up, 1.18 and on master.

1.16.7 added https://go.dev/cl/336432. Prior to that CL, timerModifiedEarliest was cleared a bit more aggressively, which probably reduced the likelihood of this bug.

@prattmic
Copy link
Member

@gopherbot Please backport to 1.17 and 1.18. This is a serious issue (busy loop in scheduler for up to 2 minutes) with no reasonable workaround.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #53846 (for 1.17), #53847 (for 1.18).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

Repository owner moved this from Todo to Done in Go Compiler / Runtime Jul 13, 2022
@gopherbot gopherbot moved this to Done in Release Blockers Jul 13, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/417475 mentions this issue: [release-branch.go1.18] runtime: clear timerModifiedEarliest when last timer is deleted

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/417476 mentions this issue: [release-branch.go1.17] runtime: clear timerModifiedEarliest when last timer is deleted

bradfitz pushed a commit to tailscale/go that referenced this issue Jul 14, 2022
…t timer is deleted

timerModifiedEarliest contains the lowest possible expiration for a
modified earlier timer, which may be earlier than timer0When because we
haven't yet updated the heap. Note "may", as the modified earlier timer
that set timerModifiedEarliest may have since been modified later or
deleted.

We can clear timerModifiedEarliest when the last timer is deleted
because by definition there must not be any modified earlier timers.

Why does this matter? checkTimersNoP claims that there is work to do if
timerModifiedEarliest has passed, causing findRunnable to loop back
around to checkTimers. But the code to clean up timerModifiedEarliest in
checkTimers (i.e., the call to adjusttimers) is conditional behind a
check that len(pp.timers) > 0.

Without clearing timerModifiedEarliest, a spinning M that would
otherwise go to sleep will busy loop in findRunnable until some other
work is available.

Note that changing the condition on the call to adjusttimers would also
be a valid fix. I took this approach because it feels a bit cleaner to
clean up timerModifiedEarliest as soon as it is known to be irrelevant.

For golang#51654.
Fixes golang#53847.

Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5
Reviewed-on: https://go-review.googlesource.com/c/go/+/417434
Auto-Submit: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Michael Pratt <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
(cherry picked from commit c006b7a)
bradfitz pushed a commit to tailscale/go that referenced this issue Jul 14, 2022
…t timer is deleted

timerModifiedEarliest contains the lowest possible expiration for a
modified earlier timer, which may be earlier than timer0When because we
haven't yet updated the heap. Note "may", as the modified earlier timer
that set timerModifiedEarliest may have since been modified later or
deleted.

We can clear timerModifiedEarliest when the last timer is deleted
because by definition there must not be any modified earlier timers.

Why does this matter? checkTimersNoP claims that there is work to do if
timerModifiedEarliest has passed, causing findRunnable to loop back
around to checkTimers. But the code to clean up timerModifiedEarliest in
checkTimers (i.e., the call to adjusttimers) is conditional behind a
check that len(pp.timers) > 0.

Without clearing timerModifiedEarliest, a spinning M that would
otherwise go to sleep will busy loop in findRunnable until some other
work is available.

Note that changing the condition on the call to adjusttimers would also
be a valid fix. I took this approach because it feels a bit cleaner to
clean up timerModifiedEarliest as soon as it is known to be irrelevant.

For golang#51654.
Fixes golang#53847.

Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5
Reviewed-on: https://go-review.googlesource.com/c/go/+/417434
Auto-Submit: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Michael Pratt <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
(cherry picked from commit c006b7a)
gopherbot pushed a commit that referenced this issue Jul 25, 2022
…t timer is deleted

timerModifiedEarliest contains the lowest possible expiration for a
modified earlier timer, which may be earlier than timer0When because we
haven't yet updated the heap. Note "may", as the modified earlier timer
that set timerModifiedEarliest may have since been modified later or
deleted.

We can clear timerModifiedEarliest when the last timer is deleted
because by definition there must not be any modified earlier timers.

Why does this matter? checkTimersNoP claims that there is work to do if
timerModifiedEarliest has passed, causing findRunnable to loop back
around to checkTimers. But the code to clean up timerModifiedEarliest in
checkTimers (i.e., the call to adjusttimers) is conditional behind a
check that len(pp.timers) > 0.

Without clearing timerModifiedEarliest, a spinning M that would
otherwise go to sleep will busy loop in findRunnable until some other
work is available.

Note that changing the condition on the call to adjusttimers would also
be a valid fix. I took this approach because it feels a bit cleaner to
clean up timerModifiedEarliest as soon as it is known to be irrelevant.

For #51654.
Fixes #53847.

Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5
Reviewed-on: https://go-review.googlesource.com/c/go/+/417434
Auto-Submit: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Michael Pratt <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
(cherry picked from commit c006b7a)
Reviewed-on: https://go-review.googlesource.com/c/go/+/417475
gopherbot pushed a commit that referenced this issue Jul 25, 2022
…t timer is deleted

timerModifiedEarliest contains the lowest possible expiration for a
modified earlier timer, which may be earlier than timer0When because we
haven't yet updated the heap. Note "may", as the modified earlier timer
that set timerModifiedEarliest may have since been modified later or
deleted.

We can clear timerModifiedEarliest when the last timer is deleted
because by definition there must not be any modified earlier timers.

Why does this matter? checkTimersNoP claims that there is work to do if
timerModifiedEarliest has passed, causing findRunnable to loop back
around to checkTimers. But the code to clean up timerModifiedEarliest in
checkTimers (i.e., the call to adjusttimers) is conditional behind a
check that len(pp.timers) > 0.

Without clearing timerModifiedEarliest, a spinning M that would
otherwise go to sleep will busy loop in findRunnable until some other
work is available.

Note that changing the condition on the call to adjusttimers would also
be a valid fix. I took this approach because it feels a bit cleaner to
clean up timerModifiedEarliest as soon as it is known to be irrelevant.

For #51654.
Fixes #53846.

Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5
Reviewed-on: https://go-review.googlesource.com/c/go/+/417434
Auto-Submit: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Michael Pratt <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
(cherry picked from commit c006b7a)
Reviewed-on: https://go-review.googlesource.com/c/go/+/417476
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 1, 2022
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
timerModifiedEarliest contains the lowest possible expiration for a
modified earlier timer, which may be earlier than timer0When because we
haven't yet updated the heap. Note "may", as the modified earlier timer
that set timerModifiedEarliest may have since been modified later or
deleted.

We can clear timerModifiedEarliest when the last timer is deleted
because by definition there must not be any modified earlier timers.

Why does this matter? checkTimersNoP claims that there is work to do if
timerModifiedEarliest has passed, causing findRunnable to loop back
around to checkTimers. But the code to clean up timerModifiedEarliest in
checkTimers (i.e., the call to adjusttimers) is conditional behind a
check that len(pp.timers) > 0.

Without clearing timerModifiedEarliest, a spinning M that would
otherwise go to sleep will busy loop in findRunnable until some other
work is available.

Note that changing the condition on the call to adjusttimers would also
be a valid fix. I took this approach because it feels a bit cleaner to
clean up timerModifiedEarliest as soon as it is known to be irrelevant.

Fixes golang#51654.

Change-Id: I3f3787c67781cac7ce87939c5706cef8db927dd5
Reviewed-on: https://go-review.googlesource.com/c/go/+/417434
Auto-Submit: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Michael Pratt <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
@golang golang locked and limited conversation to collaborators Aug 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
Status: Done
Development

No branches or pull requests