-
Notifications
You must be signed in to change notification settings - Fork 12
Overhall the telemetry collection for Stackdriver Nozzle #152
Overhall the telemetry collection for Stackdriver Nozzle #152
Conversation
I've not reviewed in depth here because I have a question for you: why invent your own implementation of metric counters in package telemetry when Go has them in the standard library (package expvar) and there are many other implementations on Github? I can't find any that have deep Stackdriver API integration, but I think that's an orthogonal problem to the one the Stackdriver nozzle is trying to solve. I can see that package telemetry needs to exist, because it's also got the responsibility of writing the set of counters to the firehose and recording them to any configured Sink, but I think that modelling all the metrics inside a map[string]int is going to cause you problems. I also find it a bit confusing that you create multiple Counters when each Counter can Count multiple things -- why not just have one that counts everything? Review status: 0 of 21 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
The internal map is super simplified to match what it's replacing. I'd like to see it evolve like this: map[string]Counter
type Counter struct {
FirstSeen time
LastSeen time
LastValue int
CumulativeValue int
} That will allow us to emit I do not like the multiple counters thing. Here's why it exists:
The TelemetrySink requires an adapter and the adapter requires a counter. I was considering breaking this by having the TelemetrySink write to a stackdriver.MetricClient instead. Interested what your thoughts are there. |
|
This implementation guarentees that Increment is non-blocking. This is achieved by having the emit call grab the mutex to the data before it grabs a handle on it, then resets the data and returns the mutex. No mater how slow emit() takes it will not block Increment() for longer than it takes to create a new map. This design is also considerably simpler. No log handler is needed and the running goroutine is no longer responsible for flushing, stopping, and keeping track of incoming Increment calls.
Move the MetricHandler into the stackdriver package. This removes the need for redefining the Heartbeat interface in the stackdriver package. The MetricHandler has much in common with of the MetricAdapter and is more suited to live nearby.
- rename class/associated variables - extract "heartbeater" action string into a constant.
- fix up mocks/ names to match refactoring - replace uint with int. it makes for awkward conversions and is unnecessary. - reduce telemetry.Sink to Record() instead of expecting the sinks to keep track of metrics then flush.
- Remove telemetry.Counter and use the expvar package to perform all telemetry collection. - Remove reporting to Stackdriver Monitoring. This will be reintroduced in a future change. - Introduce telemetry.Reporter/telemetry.Sink to periodically report on all registered metrics.
12c43b6
to
f84e807
Compare
@fluffle new PR description & ready for review. Overhall the telemetry collection for Stackdriver Nozzle. This change removes the The new implementation relies on the The metric descriptors are now created on boot instead of when they are first encountered. This is important for metrics that are not frequently seen, such as The metrics are now reported as cumulative values captured between {NozzleBootTime, ReportingTime}. This allows Stackdriver to detect zero values, missing data due to shut down (instead of extrapolating), and accurately add values between nozzle instances.
|
Code-wise this looks great, and using expvar gives me the /varz page I've been wanting too, so thanks for that! I've left a number of very opinionated comments on how to name and organize metrics, because getting this right from the start makes building consoles that display meaningful information much easier. I'm sorry it's so prescriptive :-/ Reviewed 11 of 25 files at r1, 19 of 21 files at r2. src/stackdriver-nozzle/nozzle/nozzle.go, line 53 at r2 (raw file):
My other comment applies for all these errors too. Even though you don't have a "requests" counter, these errors are all related because they come from the same source, and we may well want to aggregate the total number of firehose errors together while disregarding the exact error type. src/stackdriver-nozzle/nozzle/nozzle.go, line 60 at r2 (raw file):
Naming nit here: "nozzle.events.received", because otherwise it is not immediately clear whether nozzle.events is a total that includes the dropped count and the received count. For extra credit, it'd be good to export a "nozzle.events.total" so it's easy to calculate the percentage of total events that were dropped by the nozzle. src/stackdriver-nozzle/stackdriver/metric_adapter.go, line 51 at r2 (raw file):
Now we're doing things like this I have what amounts to very specific instructions on metric naming and organization. Sorry this is so prescriptive, but doing things in certain ways makes it much, much easier to build good monitoring dashboards.
So here I would recommend you have: timeSeriesCount = expvar.NewInt("nozzle.metrics.timeseries.count")
timeSeriesReqs = expvar.NewInt("nozzle.metrics.timeseries.requests")
timeSeriesErrs = expvar.NewMap("nozzle.metrics.timeseries.errors") // has two map values "unknown" and "out_of_order"
descriptorReqs = expvar.NewInt("nozzle.metrics.descriptor.requests")
descriptorErrs = expvar.NewMap("nozzle.metrics.descriptor.errors") You should increment descriptorReqs/Errs in CreateMetricDescriptor rather than PostMetricEvents, because that's where you're actually making a request to stackdriver. In general metric increments should be close to the thing they are intended to measure. src/stackdriver-nozzle/stackdriver/telemetry_sink.go, line 81 at r2 (raw file):
I think that keeping nozzle internal metrics logically separate from those derived from the firehose might be a good idea. So i'm not sure you need to use the path prefix here. I'd argue that two separate hierarchies is better: custom.googleapis.com/firehose/gorouter.total_requests etc. and Otherwise you have stackdriver-nozzle as the lone "subdirectory" of custom.googleapis.com/firehose, because PCF doesn't do path-style metric names. Alternatively, if you are going to prefix all metrics with a "nozzle" origin for filtering purposes (see other comment) then that may be enough to distinguish the nozzle metrics from other firehose metrics, thus you could put them under the same path prefix without the "subdirectory". src/stackdriver-nozzle/stackdriver/telemetry_sink.go, line 82 at r2 (raw file):
Since we expect to be monitoring more than one stackdriver nozzle, we need the "index" label to contain the GCE VM instance name or ID for all the metrics the nozzle exports. I can't give 100% concrete advice on how best to do this right now because I am working with [email protected] to figure out why the current "index" from the firehose envelope is a UUID that appears to be completely disconnected from anything in GCE, but the desire is that we have "index" be something that we can directly relate back to the GCE VM instance this binary is running on. Prefer instance name for now, because that's the route I currently expect we will go down? src/stackdriver-nozzle/telemetry/reporter.go, line 75 at r2 (raw file):
It might be easier to enforce that all nozzle metrics begin "nozzle.", a bit like the way we prepend the event origin to firehose metrics to end up with "gorouter.total_requests", then explicitly whitelist those. Comments from Reviewable |
Awesome feedback! That's exactly what I want. I left the names alone because I figured we'd want an all up discussion. Now to implement. Review status: all files reviewed at latest revision, 6 unresolved discussions. src/stackdriver-nozzle/stackdriver/telemetry_sink.go, line 81 at r2 (raw file): Previously, fluffle (Alex Bee) wrote…
src/stackdriver-nozzle/stackdriver/telemetry_sink.go, line 82 at r2 (raw file): Previously, fluffle (Alex Bee) wrote…
I was hoping to accomplish this by associating the TimeSeries with the If we're going to add an additional label to the firehose metrics as well then we should probably add it here oo. Comments from Reviewable |
old: custom.google.com/metricPrefix/stackdriver-nozzle/.. new: custom.google.com/stackdriver-nozzle/..
- All naming follows stackdriver-nozzle/nozzle. - More consistent naming for .requests, .count - Added nozzle.events.total for easier calculations - Errors now roll up as a single map. Added map support to stackdriver.telemetrySink. The key is reported as the "kind" value in the metric label. TODO: Fix batching in stackdriver.telemetrySink Snapshot of reported metric: "nozzle.events.dropped": 0, "nozzle.events.received": 5303, "nozzle.events.total": 5303, "nozzle.firehose.errors": {"close_normal_closure": 0, "close_policy_violation": 0, "close_unknown": 0, "empty": 0, "unknown": 0}, "nozzle.logs.count": 8, "nozzle.metrics.descriptor.errors": 0, "nozzle.metrics.descriptor.requests": 0, "nozzle.metrics.firehose_events.count": 0, "nozzle.metrics.firehose_events.sampled": 3599, "nozzle.metrics.timeseries.count": 0, "nozzle.metrics.timeseries.errors": {"out_of_order": 0, "unknown": 0}, "nozzle.metrics.timeseries.requests": 0
Lots of naming/data type fixes. Let me know what you think! I've got a small TODO before it can be merged and I want to make sure the way I'm doing map reporting to stackdriver is useful. Review status: 12 of 25 files reviewed at latest revision, 6 unresolved discussions. src/stackdriver-nozzle/nozzle/nozzle.go, line 53 at r2 (raw file): Previously, fluffle (Alex Bee) wrote…
Done. src/stackdriver-nozzle/nozzle/nozzle.go, line 60 at r2 (raw file): Previously, fluffle (Alex Bee) wrote…
Done. src/stackdriver-nozzle/stackdriver/metric_adapter.go, line 51 at r2 (raw file): Previously, fluffle (Alex Bee) wrote…
Done. src/stackdriver-nozzle/telemetry/reporter.go, line 75 at r2 (raw file): Previously, fluffle (Alex Bee) wrote…
Done. Comments from Reviewable |
Looking good! Minor nits / commentary but nothing that needs another RT. Reviewed 1 of 25 files at r1, 13 of 13 files at r3. src/stackdriver-nozzle/stackdriver/metric_adapter.go, line 60 at r3 (raw file):
expvar will create Ints for you if you Add("key", 1) to a nonexistent key, so what you're doing here is not strictly necessary. But I like this approach, because it explicitly exports zero values for the map data when no errors have been seen. src/stackdriver-nozzle/stackdriver/telemetry_sink.go, line 82 at r2 (raw file): Previously, johnsonj (Jeff Johnson) wrote…
That looks useful (and #TIL, so thanks!) We should definitely use the gce_instance monitored resource instead of an index label, you made the right call here. It looks like that provides 3 levels of granularity: project, zone and instance. This is good \o/ I guess the big problem is that we can't directly attribute firehose metrics to a gce_instance yet. I'm not sure that matters, or whether Stackdriver requires the instance_id field to be set to an actual instance ID. Maybe we can get away with creating monitored resource protos with the index uuid BOSH provides as the instance_id and dropping the index label. I don't know how we would figure out what zone a given uuid is in without asking the BOSH director, though. I wonder if it's safe for us to conflate diego containers with GKE ones and use the gke_container resource instead of instanceIndex. Probably not :-/ src/stackdriver-nozzle/stackdriver/telemetry_sink.go, line 115 at r3 (raw file):
This should be LableDescriptor_STRING, because the value for this label is the (string) map key e.g. "out_of_order", and the value for the metric will be the expvar's value. Also I think "kind" is a bit generic as a label name, though I can understand why you don't want to use "error", just in case you want to export maps of Other Things. Naming is hard, and expvar doesn't provide any way of associating metadata with a metric :-( src/stackdriver-nozzle/stackdriver/telemetry_sink_test.go, line 145 at r3 (raw file):
STRING here too. src/stackdriver-nozzle/telemetry/reporter.go, line 42 at r3 (raw file):
Now you've got this constant here, is it possible to use it wherever metrics are created? Otherwise, if you want to change it you have a fun multi-file search-and-replace to do. Or does this re-introduce the circular dependency problem? Comments from Reviewable |
Review status: all files reviewed at latest revision, 3 unresolved discussions. src/stackdriver-nozzle/stackdriver/metric_adapter.go, line 60 at r3 (raw file): Previously, fluffle (Alex Bee) wrote…
I went that route first but settled on this one for that reason. Got to have zeros! src/stackdriver-nozzle/stackdriver/telemetry_sink.go, line 82 at r2 (raw file): Previously, fluffle (Alex Bee) wrote…
Probably not src/stackdriver-nozzle/stackdriver/telemetry_sink.go, line 115 at r3 (raw file): Previously, fluffle (Alex Bee) wrote…
Thanks! Fixing _STRING I was thinking the best way to do this would be to implement something like this in type Group struct {
expvar.Map
Category string
} But it felt like overkill for now. src/stackdriver-nozzle/stackdriver/telemetry_sink_test.go, line 145 at r3 (raw file): Previously, fluffle (Alex Bee) wrote…
Done. Comments from Reviewable |
Kind is a categorical value, not an integer.
This change makes this part of the pipeline consistent with nozzle.metrics.firehose_events to take the guess work out of tracking events through the nozzle.
Be more explicit about values we want to track. This hack is dropped in telemetry/reporter.go: expvar.Do(func (val) { if strings.StartsWith("nozzle", val.Key) { .. We can now specify the category of map values instead of using the generic "kind" /debug/vars after this change: "stackdriver-nozzle/firehose.errors": {"close_normal_closure": 0, "close_policy_violation": 0, "close_unknown": 0, "empty": 0, "unknown": 0}, "stackdriver-nozzle/firehose_events.dropped": 0, "stackdriver-nozzle/firehose_events.received": 0, "stackdriver-nozzle/firehose_events.total": 0, "stackdriver-nozzle/logs.count": 0, "stackdriver-nozzle/metrics.descriptor.errors": 0, "stackdriver-nozzle/metrics.descriptor.requests": 0, "stackdriver-nozzle/metrics.firehose_events.emitted.count": 0, "stackdriver-nozzle/metrics.firehose_events.sampled.count": 0, "stackdriver-nozzle/metrics.timeseries.count": 0, "stackdriver-nozzle/metrics.timeseries.errors": {"out_of_order": 0, "unknown": 0}, "stackdriver-nozzle/metrics.timeseries.requests": 0
Thanks @fluffle! I've made a big code change but minor functionality by introducing telemetry.Counter, telemetry.CounterMap. These wrap the native expvar types and allow us to be explicit about which metrics we gather. I really like this approach now. I wonder if the telemetry/ package would make a good golang library. Review status: all files reviewed at latest revision, 3 unresolved discussions. src/stackdriver-nozzle/telemetry/reporter.go, line 42 at r3 (raw file): Previously, fluffle (Alex Bee) wrote…
Circular dependency was resolved because the telemetry package doesn't rely on stackdriver. Instead it defines the interface it needs (telemetry.Sink) and let's the stackdriver package satisfy it. Way more go-langy. Comments from Reviewable |
This design replaces the
heartbeat.Heartbeater
with a simplified implementation that matches functionality. This reduces the complexity and provides a simple interface for keeping track of events within the nozzle.This work was done to fix #150 and as pre-work to allow the nozzle to write cumulative/delta metrics for its own telemetry instead of gauges.
heartbeat.Heartbeater
interface in thestackdriver
package by moving theMetricHandler
into thestackdriver
package.heartbeat.Handler
to recording values (instead of storing, flushing) and rename totelemetry.Sink
heartbeat.Heartbeater
withtelemetry.Counter
This change is