-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
obs: export metrics about Go GC Assist work #118875
Conversation
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. Before a member of our team reviews your PR, I have some potential action items for you:
I have added a few people who may be able to assist in reviewing: 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
checking out the test failures |
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
pkg/server/status/runtime.go
line 344 at r1 (raw file):
// in the metricSamples field. func (grm *GoRuntimeSampler) sampleRuntimeMetrics() { metrics.Read(grm.metricSamples)
The alternative here is to always construct the sample array here and read from metrics and then return a map of metric name to metric values. It may be inefficient (it should be ok since we are running this sampling loop every 10s) this approach guarantees 100% correctness.
I am a little concerned with RuntimeStatSampler maybe shared in tenants.go, not clear if the current approach could run into a data race.
pkg/server/status/runtime.go
line 357 at r1 (raw file):
metricSamples := make([]metrics.Sample, len(metricNames)) metricIndexes := make(map[string]int, len(metricNames)) for i, n := range metricNames {
One idea is to construct sample array that contains all available metrics. In future if we want to get additional metrics it would be easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
pkg/server/status/runtime.go
line 704 at r1 (raw file):
gcPauseRatio := float64(uint64(gc.PauseTotal)-rsr.last.gcPauseTime) / dur runnableSum := goschedstats.CumulativeNormalizedRunnableGoroutines() gcAssitSeconds := rsr.goRuntimeSampler.getFloat64(GcAssist)
The GcAssist metrics is exposed as float64 in seconds, should we do a conversion to nano seconds here?
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
cannot figure out the experimental test failures, seems like a directory on the github action host is missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @lyang24)
-- commits
line 6 at r2:
GcAssistSecond
pkg/server/status/runtime.go
line 344 at r1 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
The alternative here is to always construct the sample array here and read from metrics and then return a map of metric name to metric values. It may be inefficient (it should be ok since we are running this sampling loop every 10s) this approach guarantees 100% correctness.
I am a little concerned with RuntimeStatSampler maybe shared in tenants.go, not clear if the current approach could run into a data race.
RuntimeStatSampler
can be shared across tenants, but only so that each of those tenants has access to the metric
fields. What you have here seems appropriate.
pkg/server/status/runtime.go
line 357 at r1 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
One idea is to construct sample array that contains all available metrics. In future if we want to get additional metrics it would be easier
I think what you have here makes sense, because we don't want to pay for metrics in the metrics.Read
call that we aren't using.
pkg/server/status/runtime.go
line 704 at r1 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
The GcAssist metrics is exposed as float64 in seconds, should we do a conversion to nano seconds here?
I think that would make sense, to more closely mirror metrics like sys.gc.pause.ns
.
pkg/server/status/runtime.go
line 98 at r2 (raw file):
Unit: metric.Unit_PERCENT, } metaGCAssitSeconds = metric.Metadata{
"Assist"
pkg/server/status/runtime.go
line 99 at r2 (raw file):
} metaGCAssitSeconds = metric.Metadata{ Name: "sys.gc.assit.seconds",
"assist"
pkg/server/status/runtime.go
line 100 at r2 (raw file):
metaGCAssitSeconds = metric.Metadata{ Name: "sys.gc.assit.seconds", Help: "Estimated total CPU time goroutines spent to assist the GC process",
"user goroutines"
pkg/server/status/runtime.go
line 303 at r2 (raw file):
var getCgoMemStats func(context.Context) (uint, uint, error) // Estimated total CPU time goroutines spent performing GC tasks to assist the GC and prevent it from falling behind the
nit: wrap at 80 chars.
pkg/server/status/runtime.go
line 306 at r2 (raw file):
// application. This metric is an overestimate, and not directly comparable to system CPU time measurements. Compare // only with other /cpu/classes metrics. const GcAssist = "/cpu/classes/gc/mark/assist:cpu-seconds"
This name is a little too broad to be in the global namespace of this package. And it also doesn't need to be exported. How about runtimeMetricGCAssist
?
pkg/server/status/runtime.go
line 314 at r2 (raw file):
// The collection of metrics we want to sample. metricSamples []metrics.Sample // The mapping to find metric slot in runtimeSamples by name.
s/runtimeSamples/metricSamples/
pkg/server/status/runtime.go
line 318 at r2 (raw file):
} // getIndex find postion of metrics in the sample array by name.
"finds the position"
pkg/server/status/runtime.go
line 327 at r2 (raw file):
} // getFloat64 get the sampled value by metrics name as getFloat64.
"gets"
pkg/server/status/runtime.go
line 327 at r2 (raw file):
} // getFloat64 get the sampled value by metrics name as getFloat64.
s/getFloat64/as a float64/
pkg/server/status/runtime.go
line 337 at r2 (raw file):
// in the metricSamples field. func (grm *GoRuntimeSampler) sampleRuntimeMetrics() { metrics.Read(grm.metricSamples)
Have you taken a look at how long this call takes?
pkg/server/status/runtime.go
line 485 at r2 (raw file):
log.Ops.Errorf(ctx, "could not get initial network stat counters: %v", err) } runtimeMetrics := []string{GcAssist}
Let's move this to a global, right up below GcAssist
.
pkg/server/status/runtime.go
line 587 at r2 (raw file):
// The CGoMemStats should be provided via GetCGoMemStats(). func (rsr *RuntimeStatSampler) SampleEnvironment( ctx context.Context, ms *GoMemStats, cs *CGoMemStats,
Have we taken a look at which runtime metrics we'll need to use to eliminate the use of runtime.ReadMemStats
, in order to address #88178 (comment)? Do all of those metrics exist in runtime/metrics
?
pkg/server/status/runtime.go
line 697 at r2 (raw file):
gcPauseRatio := float64(uint64(gc.PauseTotal)-rsr.last.gcPauseTime) / dur runnableSum := goschedstats.CumulativeNormalizedRunnableGoroutines() gcAssitSeconds := rsr.goRuntimeSampler.getFloat64(GcAssist)
"assist" here and everywhere else 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed review. I will take one more look tmr and send it for another look.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
pkg/server/status/runtime.go
line 337 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Have you taken a look at how long this call takes?
added benchmark in the comment
pkg/server/status/runtime.go
line 587 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Have we taken a look at which runtime metrics we'll need to use to eliminate the use of
runtime.ReadMemStats
, in order to address #88178 (comment)? Do all of those metrics exist inruntime/metrics
?
yes I think i have found the metrics that have the same semantics as those fields used in memstats but i haven't yet verified the go runtime impl to see if they have exactly the same impl.
This is the mapping of memstats usage to runtime metrics in my mind
GoAllocBytes: ms.HeapAlloc
=> /gc/heap/allocs:bytes
HeapFragmentBytes: ms.HeapInuse - ms.HeapAlloc
=> /memory/classes/heap/unused:bytes
HeapReservedBytes: ms.HeapIdle - ms.HeapReleased
=> /memory/classes/heap/free:bytes
HeapReleasedBytes: ms.HeapReleased
=> /memory/classes/heap/released:bytes
MemStackSysBytes: ms.StackSys
=> /memory/classes/heap/stacks:bytes
goTotal := ms.Sys - ms.HeapReleased
=> /memory/classes/total:bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
pkg/util/log/eventpb/health_events.proto
line 69 at r3 (raw file):
// Estimated total CPU time user goroutines spent performing GC tasks to // assist the GC. Expressed in nanoseconds. uint64 gc_assist_ns = 20 [(gogoproto.customname) = "GCAssistNs", (gogoproto.jsontag) = ",omitempty"];
does it make sense to add this metrics to health_events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @lyang24)
-- commits
line 6 at r3:
"capture represent"
pkg/server/status/runtime.go
line 337 at r2 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
added benchmark in the comment
Thanks for doing that!
pkg/util/log/eventpb/health_events.proto
line 69 at r3 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
does it make sense to add this metrics to health_events
I don't think it needs to be added here. gc_pause_ns is not part of this either. It also looks like there are some compatibility concerns we'll need to contend with if we add the metric to health_events — see doc.go. So I'm 👍 on removing it from the PR.
This commit introduced functions to extract exposed metrics in go runtime metrics api. The runtime metrics is sampled along in SampleEnvironment call every 10 seconds. New metric GcAssistNS is added as an estimate to amount of effort of user go routines assist in gc activities in nanoseconds. Fixes: cockroachdb#88178 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I will replace go memstats with runtime metrics in the following pr
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"capture represent"
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier)
bors r+ |
Build succeeded: |
@lyang24 do you want to open a follow-up PR to add this new metric to the "Runtime" dashboard in the UI, right below the "GC Pause Time" graph? |
This commit adds a line chart contains gc assist duration on the runtime page. The goal is to present the estimated time user go routines spend on assisting gc tasks. Informs: cockroachdb#118875 Release note: None
119506: ui: add charts on gc assist metric r=lyang24 a=lyang24 This commit adds a line chart contains gc assist duration on the runtime page. The goal is to present the estimated time user go routines spend on assisting gc tasks. Informs: #118875 Release note: None gc assist on crdb single node with kv workload gcttl=5s for database in (kv, system) ![image](https://github.com/cockroachdb/cockroach/assets/20375035/b88882d0-8390-4fe4-8343-eb2a7497bc9f) 119877: roachtest: make pg_regress weekly and its failures non-blocking r=yuzefovich a=yuzefovich This should reduce the amount of toil for us until we iron out most of the differences with postgres. Epic: None Release note: None Co-authored-by: lyang24 <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
This commit adds a line chart contains gc assist duration on the runtime page. The goal is to present the estimated time user go routines spend on assisting gc tasks. Informs: cockroachdb#118875 Release note: None
This commit introduced functions to extract exposed metrics in go runtime
metrics api. The runtime metrics is sampled along in SampleEnvironment call
every 10 seconds. New metric GcAssistNS is added as an estimate to amount
of effort of user go routines assist in gc activities in nanoseconds.
Fixes: #88178
Relase note: None