Skip to content
This repository has been archived by the owner on Mar 30, 2018. It is now read-only.

Cumulative Run Time Silos in Go `pprof` CPU Profiles

Bishop Brock edited this page Jul 22, 2016 · 3 revisions

Cumulative Run-Time Silos in Go pprof CPU Profiles

Introduction

The Go pprof CPU profiling tool lists both flat and cum (cumulative) run-times and percentages for routines. The flat numbers refer to time spent in the routine itself. The cum numbers refer to time spent in the routine and all of its callees. This article discusses how to interpret the cumulative run time information in the context of different phases of the execution of a Go program.

When the Go pprof profiling interrupt occurs, the profiler walks the currently active stack and records all of the routines that appear in the call chain. This allows the profiler to correctly charge cumulative run-time to each caller in the chain. At the bottom of the stack is a stack frame that records the first caller in the chain. These bottom-most frames do not record calls from user code. Instead, only a few selected system routines will ever appear at the bottom of the call stack, and together these few routines will record 100% of the cumulative run time of the application.

Thus the total run-time is partitioned based on which routine is found at the bottom of the stack. In the following the term silo is used to refer to these partitions, since as we discusss below, the different silos record more-or-less independent parts of the execution of the application.

Note: Recent results from profiling test applications suggest that the information contained in this article may not be fully complete or correct, however we have found the approach described here very useful in interpreting Hyperledger fabric profiles. We will update the article as we learn more.

Silo Routines

These are the routines that may appear at the bottom of the call stack, and which parts of the execution they register. Note that this list was derived from empirical observation of profiles and reading the Go source code, and may not be exhaustive.

  • runtime.goexit Go goroutines start with a stack context that makes it appear that the routine was called from runtime.goexit. This way, when the goroutine "returns to its caller" the goroutine is cleaned up and terminated. Therefore the cumulative run-time for runtime.goexit records all of the CPU time spent in goroutines, including the main program which is also executed as a goroutine.

  • runtime.mcall This routine silos code that runs on a system stack associated with an operating system thread, outside the context of any goroutine. In our profiles almost all of the runtime.mcall silo is associatecd with runtime.schedule, the goroutine scheduler.

  • runtime.morestack The sizes of goroutine stacks are dynamic, and Go implements stack-capacity checking on most function calls. If a goroutine needs more stack space than its current allocation, runtime.morestack is the routine that arranges this. When this allocation is occurring the stack is modified such that runtime.morestack effectively occupies the bottom frame. This is useful in that the cumulative time for runtime.morestack records the total overhead for dynamic stack allocation. This is unfortunate though in that we don't have a record of which routines may be causing this to happen.

  • runtime.mstart This routine is used to start goroutines. The routine also appears to be used as a bottom-marker in the stack during certain stack manipulation operations. We typically see only a small amount of system overhead charged to runtime.mstart. [Needs further study.]

  • runtime._System This is a dummy routine that exists only for profile accounting. Time is charged to runtime._System when a stack traceback "is impossible or has failed" for some reason. We typically see a small amount of system overhead accounted here.

  • runtime._GC Similar to runtime._System, this dummy routine is charged with time when a traceback is not possible but the system appears to be in the midst of garbage collection.

Using CPU Run-Time Silos

Knowledge of the way that CPU run-time is partitioned can help focus attention on different parts of the profile. For example, go tool pprof includes the -focus option which can be used to restrict the profile output to only that portion of the run-time attributable to individual silos. As documented, -focus "restricts to paths going through the node...". So for example, since runtime.goexit is always a terminal node, focussing on runtime.goexit reports on the goroutine silo.

The first example creates a focused list on the terminal, and the second example reports the same information as a call-graph PDF.

go tool pprof -top -focus=runtime.goexit pprof.XXX.samples.cpu.NNN.pb.gz
go tool pprof -pdf -focus=runtime.goexit pprof.XXX.samples.cpu.NNN.pb.gz > prof.pdf

Similarly, if one wanted to focus on all of the system overhead not asscoiated with goroutines one could -ignore runtime.goexit

go tool pprof -top -ignore=runtime.goexit pprof.XXX.samples.cpu.NNN.pb.gz

The runtime.morestack partition is self contained. High cumulative percentages recorded under runtime.morestack suggest that the default goroutine stack size may be too small. Modifying the default stack size currently requires modifying the source code and rebuilding the Go runtime package. As luck would have it, instructions for doing this happen to have been documented.