Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obs: export metrics about Go GC Assist work #88178

Closed
nvanbenschoten opened this issue Sep 19, 2022 · 7 comments · Fixed by #118875
Closed

obs: export metrics about Go GC Assist work #88178

nvanbenschoten opened this issue Sep 19, 2022 · 7 comments · Fixed by #118875
Labels
A-kv-observability A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 19, 2022

CockroachDB currently exports some timeseries metrics about the Go runtime memory GC:

  • sys.gc.pause.percent: "Current GC pause percentage"
  • sys.gc.pause.ns: "Total GC pause"
  • sys.gc.count: "Total number of GC runs"

These metrics are computed using runtime.ReadGCStats. They are useful to understand how long the Go GC is stopping the world ("pause" refers to STW work, not background work).

However, GO GC has other costs beyond its STW sweep termination and mark termination phases. In general, the concurrent mark and scan phase can be run without pushing back on foreground goroutines. However, when goroutines are allocating memory faster than GC can clean up (either because of significant memory allocation, slow GC, or both), GC work can be pushed back on foreground goroutines in line with their heap allocations. This is known as "GC Assist".

We've seen in cases like this one that GC assist can lead to large spikes in latency that are difficult to understand using other observability tools.

We should find a way to expose this information. Unfortunately, this is not exported by the go runtime, except through the GODEBUG=gctrace tooling. We may need to patch the runtime or upstream a fix to get at the information programmatically.

Jira issue: CRDB-19718

Epic CRDB-34227

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-observability T-kv-observability labels Sep 19, 2022
@nvanbenschoten nvanbenschoten changed the title kv-obs: export metrics about Go GC assist work kv-obs: export metrics about Go GC Assist work Sep 19, 2022
@nvanbenschoten
Copy link
Member Author

golang/go#55159 is a start here. If we can land that upstream, we can define a timeseries metric over this value. Ideally, we'd also be able to set up some alerting on it to detect when goroutines are over-assisting (e.g. normalized_rate(assist_ns) > 0.05).

@nvanbenschoten
Copy link
Member Author

It turns out that https://go-review.googlesource.com/c/go/+/404307 landed 3 days ago and added a new "/cpu/classes/gc/mark/assist:cpu-seconds" metric (among others) to the https://pkg.go.dev/runtime/metrics package. This contains the same information I was trying to expose in golang/go#55159. So we'll need to wait for Go 1.20 and can then integrate that into our RuntimeStatSampler logic.

That may be a good time to replace calls to debug.ReadGCStats and runtime.ReadMemStats with use of the new runtime/metrics package, which is also mentioned in #87823.

@nvanbenschoten
Copy link
Member Author

That may be a good time to replace calls to debug.ReadGCStats and runtime.ReadMemStats with use of the new runtime/metrics package, which is also mentioned in #87823.

Doing so will avoid the stop-the-world pause during calls to runtime.ReadMemStats. These pauses take about 1.5ms, and occur every 10s.

lyang24 added a commit to lyang24/cockroach that referenced this issue Feb 7, 2024
This commit introduced functions to extract exposed metrics in go runtime
metrics api. The runtime metrics is sampled along in SampleEnvironment call
every 10 seconds. New metric GcAssitSecond is captured in this pr to capture
represent an estimate to amount of effort of go routines assist in gc
activities.

Fixes: cockroachdb#88178

Relase note: None
lyang24 added a commit to lyang24/cockroach that referenced this issue Feb 16, 2024
This commit introduced functions to extract exposed metrics in go runtime
metrics api. The runtime metrics is sampled along in SampleEnvironment call
every 10 seconds. New metric GcAssistNS is captured in this pr to capture
represent an estimate to amount of effort of user go routines assist in gc
activities in nanoseconds.

Fixes: cockroachdb#88178

Release note: None
@nvanbenschoten
Copy link
Member Author

That may be a good time to replace calls to debug.ReadGCStats and runtime.ReadMemStats with use of the new runtime/metrics package, which is also mentioned in #87823.

@lyang24 determined that the mapping for this will look like:

GoAllocBytes:      ms.HeapAlloc => /gc/heap/allocs:bytes
HeapFragmentBytes: ms.HeapInuse - ms.HeapAlloc => /memory/classes/heap/unused:bytes
HeapReservedBytes: ms.HeapIdle - ms.HeapReleased => /memory/classes/heap/free:bytes
HeapReleasedBytes: ms.HeapReleased => /memory/classes/heap/released:bytes
MemStackSysBytes:  ms.StackSys => /memory/classes/heap/stacks:bytes
goTotal := ms.Sys - ms.HeapReleased => /memory/classes/total:bytes

craig bot pushed a commit that referenced this issue Feb 21, 2024
118875: obs: export metrics about Go GC Assist work r=lyang24 a=lyang24

This commit introduced functions to extract exposed metrics in go runtime
metrics api. The runtime metrics is sampled along in SampleEnvironment call
every 10 seconds. New metric GcAssistNS is added as an estimate to amount
of effort of user go routines assist in gc activities in nanoseconds.

Fixes: #88178

Relase note: None


Co-authored-by: lyang24 <[email protected]>
@craig craig bot closed this as completed in 4188ee0 Feb 21, 2024
@nvanbenschoten
Copy link
Member Author

That may be a good time to replace calls to debug.ReadGCStats and runtime.ReadMemStats with use of the new runtime/metrics package, which is also mentioned in #87823.

I've extracted this into #119461 and #119462.

@lyang24
Copy link
Contributor

lyang24 commented Feb 23, 2024

That may be a good time to replace calls to debug.ReadGCStats and runtime.ReadMemStats with use of the new runtime/metrics package, which is also mentioned in #87823.

@lyang24 determined that the mapping for this will look like:

GoAllocBytes:      ms.HeapAlloc => /gc/heap/allocs:bytes
HeapFragmentBytes: ms.HeapInuse - ms.HeapAlloc => /memory/classes/heap/unused:bytes
HeapReservedBytes: ms.HeapIdle - ms.HeapReleased => /memory/classes/heap/free:bytes
HeapReleasedBytes: ms.HeapReleased => /memory/classes/heap/released:bytes
MemStackSysBytes:  ms.StackSys => /memory/classes/heap/stacks:bytes
goTotal := ms.Sys - ms.HeapReleased => /memory/classes/total:bytes

I made a mistake on GoAllocBytes after testing need to update mapping to
GoAllocBytes: ms.HeapAlloc => /memory/classes/heap/objects:bytes
"Memory occupied by live objects and dead objects that have not
yet been marked free by the garbage collector."

/gc/heap/allocs:bytes is the total allocated bytes and this number will not decrease on release

@andrewbaptist andrewbaptist added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Feb 26, 2024
@andrewbaptist
Copy link
Collaborator

A few recent support escalations would have been much easier to debug with this information exposed.

nicktrav pushed a commit to nicktrav/cockroach that referenced this issue Sep 5, 2024
This commit introduced functions to extract exposed metrics in go runtime
metrics api. The runtime metrics is sampled along in SampleEnvironment call
every 10 seconds. New metric GcAssistNS is added as an estimate to amount
of effort of user go routines assist in gc activities in nanoseconds.

Fixes: cockroachdb#88178

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants