-
Notifications
You must be signed in to change notification settings - Fork 1k
Cumulative Run Time Silos in Go `pprof` CPU Profiles
The Go pprof
CPU profiling tool lists both flat and cum (cumulative) run-times and percentages for routines. The flat numbers refer to time spent in the routine itself. The cum numbers refer to time spent in the routine and all of its callees. This article discusses how to interpret the cumulative run time information in the context of different phases of the execution of a Go program.
When the Go pprof
profiling interrupt occurs, the profiler walks the currently active stack and records all of the routines that appear in the call chain. This allows the profiler to correctly charge cumulative run-time to each caller in the chain. At the bottom of the stack is a stack frame that records the first caller in the chain. These bottom-most frames do not record calls from user code. Instead, only a few selected system routines will ever appear at the bottom of the call stack, and together these few routines will record 100% of the cumulative run time of the application.
Thus the total run-time is partitioned based on which routine is found at the bottom of the stack. In the following the term silo is used to refer to these partitions, since as we discusss below, the different silos record more-or-less independent parts of the execution of the application.
Note: Recent results from profiling test applications suggest that the information contained in this article may not be fully complete or correct, however we have found the approach described here very useful in interpreting Hyperledger fabric profiles. We will update the article as we learn more.
These are the routines that may appear at the bottom of the call stack, and which parts of the execution they register. Note that this list was derived from empirical observation of profiles and reading the Go source code, and may not be exhaustive.
-
runtime.goexit
Go goroutines start with a stack context that makes it appear that the routine was called fromruntime.goexit
. This way, when the goroutine "returns to its caller" the goroutine is cleaned up and terminated. Therefore the cumulative run-time forruntime.goexit
records all of the CPU time spent in goroutines, including themain
program which is also executed as a goroutine. -
runtime.mcall
This routine silos code that runs on a system stack associated with an operating system thread, outside the context of any goroutine. In our profiles almost all of theruntime.mcall
silo is associatecd withruntime.schedule
, the goroutine scheduler. -
runtime.morestack
The sizes of goroutine stacks are dynamic, and Go implements stack-capacity checking on most function calls. If a goroutine needs more stack space than its current allocation,runtime.morestack
is the routine that arranges this. When this allocation is occurring the stack is modified such thatruntime.morestack
effectively occupies the bottom frame. This is useful in that the cumulative time forruntime.morestack
records the total overhead for dynamic stack allocation. This is unfortunate though in that we don't have a record of which routines may be causing this to happen. -
runtime.mstart
This routine is used to start goroutines. The routine also appears to be used as a bottom-marker in the stack during certain stack manipulation operations. We typically see only a small amount of system overhead charged toruntime.mstart
. [Needs further study.] -
runtime._System
This is a dummy routine that exists only for profile accounting. Time is charged toruntime._System
when a stack traceback "is impossible or has failed" for some reason. We typically see a small amount of system overhead accounted here. -
runtime._GC
Similar toruntime._System
, this dummy routine is charged with time when a traceback is not possible but the system appears to be in the midst of garbage collection.
Knowledge of the way that CPU run-time is partitioned can help focus attention on different parts of the profile. For example, go tool pprof
includes the -focus
option which can be used to restrict the profile output to only that portion of the run-time attributable to individual silos. As documented, -focus
"restricts to paths going through the node...". So for example, since runtime.goexit
is always a terminal node, focussing on runtime.goexit
reports on the goroutine silo.
The first example creates a focused list on the terminal, and the second example reports the same information as a call-graph PDF.
go tool pprof -top -focus=runtime.goexit pprof.XXX.samples.cpu.NNN.pb.gz
go tool pprof -pdf -focus=runtime.goexit pprof.XXX.samples.cpu.NNN.pb.gz > prof.pdf
Similarly, if one wanted to focus on all of the system overhead not asscoiated with goroutines one could -ignore
runtime.goexit
go tool pprof -top -ignore=runtime.goexit pprof.XXX.samples.cpu.NNN.pb.gz
The runtime.morestack
partition is self contained. High cumulative percentages recorded under runtime.morestack
suggest that the default goroutine stack size may be too small. Modifying the default stack size currently requires modifying the source code and rebuilding the Go runtime
package. As luck would have it, instructions for doing this happen to have been documented.