Skip to content
This repository has been archived by the owner on Jun 25, 2020. It is now read-only.

Report CounterEvents as cumulative metrics to Stackdriver #162

Merged
merged 1 commit into from
Dec 12, 2017

Conversation

knyar
Copy link
Collaborator

@knyar knyar commented Dec 6, 2017

This adds a "counter tracker" which is used to maintain start time and a separate cumulative value for each counter metric, which need to be reported to Stackdriver.

As the result, each counter is properly exported as a cumulative Stackdriver metric instead of two gauge metrics (.total and .delta).


This change is Reviewable

@knyar knyar requested review from fluffle and johnsonj December 6, 2017 18:27
@johnsonj
Copy link
Contributor

johnsonj commented Dec 6, 2017

:lgtm: structure is good, just nit stuff.

With this change do we need to say only run a single instance of the nozzle (until consistent nozzling lands)?


Reviewed 9 of 9 files at r1.
Review status: all files reviewed at latest revision, 5 unresolved discussions.


src/stackdriver-nozzle/messages/metric.go, line 31 at r1 (raw file):

	// Extract keys to a slice and sort it
	numKeys := len(m.Labels) + 1
	keys := make([]string, 0, numKeys)

good catch since we're appending below.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 101 at r1 (raw file):

// At least two values need to be observed for a given counter to determine the total value, so for the first observed
// value, `nil` will be returned instead of the start time. In that case, the returned value should be discarded.
func (t *counterTracker) GetTotal(name string, value float64, eventTime *time.Time) (float64, *time.Time) {

suggest naming: UpdateCounter, Track, Update, ... Get isn't generally stateful.

Curious why you went with passing/returning eventTime as a pointer?


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 114 at r1 (raw file):

			lastSeenTime: time.Now(),
		}
		return 0, nil

suggest adding another return parameter, ok bool. I don't think the caller should make the decision to publish the metric based on the metric/time value.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 117 at r1 (raw file):

	}
	if c.lastValue > value {
		// Counter has been reset.

is there a reason not to reset our counter as well in this scenario?


src/stackdriver-nozzle/nozzle/metric_sink_test.go, line 241 at r1 (raw file):

		expectedTotals := []float64{10, 20, 25, 45}
		for idx, total := range expectedTotals {
			Expect(metrics[idx]).To(MatchFields(IgnoreExtras, Fields{

IgnoreExtras- very awesome.


Comments from Reviewable

@fluffle
Copy link
Collaborator

fluffle commented Dec 7, 2017

Review status: all files reviewed at latest revision, 10 unresolved discussions.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 31 at r1 (raw file):

func init() {
	countersExpiredCount = telemetry.NewCounter(telemetry.Nozzle, "metrics.counters.expired.count")

Naming nit: "total" instead of "count" for consistency? or just drop the "count" (it's cleaner!)


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 66 at r1 (raw file):

	counters map[string]*counterData
	mu       *sync.Mutex // protects `counters`
	ttl      *time.Duration

Similarly here time.Duration is really just an int64 so this should probably not be a pointer. I can't find any "pointer is nil" overloading for this particular field either.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 79 at r1 (raw file):

		ttl:      &ttl,
		logger:   logger,
		ticker:   time.NewTicker(time.Duration(ttl.Nanoseconds() / 2)),

Is this frequent enough? It means that the default TTL of 70s really means "between 70 and 105 seconds", which is a relatively large interval.

I understand why basing it on the TTL makes sense -- to scale down to small TTLs effectively -- but maybe there needs to be an upper limit on the tick time to prevent large TTLs having a lot of variance?


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 101 at r1 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

suggest naming: UpdateCounter, Track, Update, ... Get isn't generally stateful.

Curious why you went with passing/returning eventTime as a pointer?

+1 to time not being a pointer. https://golang.org/pkg/time/#Time says:

"Programs using times should typically store and pass them as values, not pointers. That is, time variables and struct fields should be of type time.Time, not *time.Time."


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 114 at r1 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

suggest adding another return parameter, ok bool. I don't think the caller should make the decision to publish the metric based on the metric/time value.

I agree that a bool that indicates whether the metric should be published is a better approach than having the nil-ness of a time.Time pointer convey the same information.


src/stackdriver-nozzle/nozzle/counter_tracker_test.go, line 42 at r1 (raw file):

	It("increments counters and handles counter resets", func() {
		ctx, cancel := context.WithCancel(context.TODO())

Isn't context.Background() OK to use in tests?


src/stackdriver-nozzle/nozzle/counter_tracker_test.go, line 69 at r1 (raw file):

		subject = NewCounterTracker(ctx, counterTTL, logger)

		runTest := func(baseTime time.Time) {

I think with a little effort you could significantly reduce the amount of copy/paste code here...


src/stackdriver-nozzle/nozzle/metric_sink_test.go, line 208 at r1 (raw file):

		for idx, values := range eventValues {
			ts := eventTime.UnixNano() + int64(idx*1e9) // Events are 1 second apart.

Ugh, many integer types. I was going to say that you should probably prefer to use time.Second here, but the amount of typecasting required for this to work is awkward. Even so,

eventTime.UnixNano() + int64(time.Second)*int64(idx)

or similar might be better.


src/stackdriver-nozzle/nozzle/metric_sink_test.go, line 209 at r1 (raw file):

		for idx, values := range eventValues {
			ts := eventTime.UnixNano() + int64(idx*1e9) // Events are 1 second apart.
			delta := uint64(values[0])

Is the cast required here? eventValues is [][]uint64 already.


Comments from Reviewable

@knyar
Copy link
Collaborator Author

knyar commented Dec 7, 2017

Oh yes, good point: this requires running a single copy of the nozzle. Should I document that somewhere, or maybe keep the old behavior making the counterTracker path enabled by a configuration option?

I have also discussed this with @fluffle this morning and made some significant changes:

  • since counter values are uint64 in Firehose, I've changed cumulative metrics to be integers as well (int64). This required some wider changes, and creating Stackdriver protos seems to be more convenient in the Metric now. This seems like it also moves us slightly closer to Use native Stackdriver Monitoring Protos #142.
  • expvar is now used to store actual counters. I've changed metric.Hash() to be slightly more human readable to make metrics easy to read in /debug/vars.

Review status: 2 of 9 files reviewed at latest revision, 10 unresolved discussions.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 31 at r1 (raw file):

Previously, fluffle (Alex Bee) wrote…

Naming nit: "total" instead of "count" for consistency? or just drop the "count" (it's cleaner!)

Done.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 66 at r1 (raw file):

Previously, fluffle (Alex Bee) wrote…

Similarly here time.Duration is really just an int64 so this should probably not be a pointer. I can't find any "pointer is nil" overloading for this particular field either.

Done.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 79 at r1 (raw file):

Previously, fluffle (Alex Bee) wrote…

Is this frequent enough? It means that the default TTL of 70s really means "between 70 and 105 seconds", which is a relatively large interval.

I understand why basing it on the TTL makes sense -- to scale down to small TTLs effectively -- but maybe there needs to be an upper limit on the tick time to prevent large TTLs having a lot of variance?

Good point, thanks. Capped at 10 seconds.
Also, default TTL increased to 130 seconds since some metrics are only emitted every minute and I'd like to ensure a single lost point does not result in a counter reset.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 101 at r1 (raw file):

Previously, fluffle (Alex Bee) wrote…

+1 to time not being a pointer. https://golang.org/pkg/time/#Time says:

"Programs using times should typically store and pass them as values, not pointers. That is, time variables and struct fields should be of type time.Time, not *time.Time."

Renamed to Update().

Switched time types to be passed as values. Instead of interpreting a nil pointer as indication of the first value seen for a counter, metric sink just uses the cumulative value of 0 (which seems more intuitive anyway).


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 114 at r1 (raw file):

Previously, fluffle (Alex Bee) wrote…

I agree that a bool that indicates whether the metric should be published is a better approach than having the nil-ness of a time.Time pointer convey the same information.

Currently that decision is made based on returned cumulative value being more than 0.

This seems logical, since the returned value is in fact correct and it could be used if Stackdriver accepted points where event time == start time. So the decision of whether to publish that value or not should be made by the caller, I think. Please let me know if you disagree.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 117 at r1 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

is there a reason not to reset our counter as well in this scenario?

Well, I don't see a reason to reset it, given that we have all the state to continue adding to the existing total.
Also, I suspect detecting counter resets has a non-zero (albeit probably quite small) cost in Stackdriver.


src/stackdriver-nozzle/nozzle/counter_tracker_test.go, line 42 at r1 (raw file):

Previously, fluffle (Alex Bee) wrote…

Isn't context.Background() OK to use in tests?

It does seem like it should be OK, however other tests are using context.TODO().
@johnsonj, any reason for that, or should I change the other tests as well (with a separate commit)?


src/stackdriver-nozzle/nozzle/counter_tracker_test.go, line 69 at r1 (raw file):

Previously, fluffle (Alex Bee) wrote…

I think with a little effort you could significantly reduce the amount of copy/paste code here...

Done. Not sure if this is the right way to do such things in Ginkgo.


src/stackdriver-nozzle/nozzle/metric_sink_test.go, line 208 at r1 (raw file):

Previously, fluffle (Alex Bee) wrote…

Ugh, many integer types. I was going to say that you should probably prefer to use time.Second here, but the amount of typecasting required for this to work is awkward. Even so,

eventTime.UnixNano() + int64(time.Second)*int64(idx)

or similar might be better.

Done.


src/stackdriver-nozzle/nozzle/metric_sink_test.go, line 209 at r1 (raw file):

Previously, fluffle (Alex Bee) wrote…

Is the cast required here? eventValues is [][]uint64 already.

Done.


Comments from Reviewable

@johnsonj
Copy link
Contributor

johnsonj commented Dec 7, 2017

I think we should make it enabled by configuration. This is the ideal behavior but if we enforce single node deployments then we break scaling where it works fine today (for Stackdriver Logging).


Reviewed 7 of 7 files at r2.
Review status: all files reviewed at latest revision, 13 unresolved discussions.


src/stackdriver-nozzle/messages/metric.go, line 38 at r2 (raw file):

// NeedsMetricDescriptor determines whether a custom metric descriptor needs to be created for this metric in Stackdriver.
// We do that if we need to set a custom unit, or mark metric as a cumulative.

very helpful comment!


src/stackdriver-nozzle/messages/metric.go, line 70 at r2 (raw file):

		Unit:        m.Unit,
		Description: "stackdriver-nozzle created custom metric.",
		DisplayName: m.Name, // TODO

is this TODO relevant or copy/pasted?


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 79 at r1 (raw file):

Previously, knyar (Anton Tolchanov) wrote…

Good point, thanks. Capped at 10 seconds.
Also, default TTL increased to 130 seconds since some metrics are only emitted every minute and I'd like to ensure a single lost point does not result in a counter reset.

I think a fixed period would be just fine and more predicable (1? 5? 10 seconds?). Agree with the TTL increase being larger than two metric periods.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 101 at r1 (raw file):

Previously, knyar (Anton Tolchanov) wrote…

Renamed to Update().

Switched time types to be passed as values. Instead of interpreting a nil pointer as indication of the first value seen for a counter, metric sink just uses the cumulative value of 0 (which seems more intuitive anyway).

New header looks good. Resolving as the 'ok' discussion is below.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 114 at r1 (raw file):

Previously, knyar (Anton Tolchanov) wrote…

Currently that decision is made based on returned cumulative value being more than 0.

This seems logical, since the returned value is in fact correct and it could be used if Stackdriver accepted points where event time == start time. So the decision of whether to publish that value or not should be made by the caller, I think. Please let me know if you disagree.

0 seems like a potential real value to me. If a counter was 'outOfMemoryErrors' then it very well may emit a cumulative 0 for a long period of time. I think we want to record that value in Stackdriver Monitoring and not hide it.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 165 at r2 (raw file):

		if time.Now().Sub(counter.lastSeenTime) > t.ttl {
			t.counters[name].totalValue.Set(-1)
			t.logger.Info("counterTracker", lager.Data{

suggest logging before resetting the value. is it necessary to reset the value? it looks like newCounterData will take care of that if it reappears.


src/stackdriver-nozzle/nozzle/counter_tracker_test.go, line 42 at r1 (raw file):

Previously, knyar (Anton Tolchanov) wrote…

It does seem like it should be OK, however other tests are using context.TODO().
@johnsonj, any reason for that, or should I change the other tests as well (with a separate commit)?

agree Background is correct. Would appreciate that clean up in another commit.


src/stackdriver-nozzle/nozzle/counter_tracker_test.go, line 69 at r1 (raw file):

Previously, knyar (Anton Tolchanov) wrote…

Done. Not sure if this is the right way to do such things in Ginkgo.

It's totally reasonable and a good clean up. Ginkgo allows you to create custom matchers[1] but it may not be the right fit here.

[1] https://onsi.github.io/gomega/#adding-your-own-matchers


src/stackdriver-nozzle/stackdriver/metric_adapter.go, line 155 at r2 (raw file):

	req := &monitoringpb.CreateMetricDescriptorRequest{
		Name:             projectName,
		MetricDescriptor: metric.MetricDescriptor(projectName),

works for me, seems like a positive change


Comments from Reviewable

@knyar knyar force-pushed the pr-counters branch 2 times, most recently from ea04f33 to 69ecbb1 Compare December 8, 2017 13:35
@knyar
Copy link
Collaborator Author

knyar commented Dec 8, 2017

Sounds good. I've added a new configuration option for this, which can be enabled in the manifest.


Review status: 1 of 11 files reviewed at latest revision, 12 unresolved discussions.


src/stackdriver-nozzle/messages/metric.go, line 70 at r2 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

is this TODO relevant or copy/pasted?

It has been copy/pasted from metric_adapter.go. Should I remove it?


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 79 at r1 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

I think a fixed period would be just fine and more predicable (1? 5? 10 seconds?). Agree with the TTL increase being larger than two metric periods.

As Alex mentioned, there's value in having period decrease for short TTLs, mostly to make tests pass faster. In production it will probably always be capped by maxExpirePeriod.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 114 at r1 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

0 seems like a potential real value to me. If a counter was 'outOfMemoryErrors' then it very well may emit a cumulative 0 for a long period of time. I think we want to record that value in Stackdriver Monitoring and not hide it.

Very good point, thanks. I've actually changed the condition in the metric sink to check for non-zero time interval, which is what Stackdriver actually expects.


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 165 at r2 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

suggest logging before resetting the value. is it necessary to reset the value? it looks like newCounterData will take care of that if it reappears.

Added logging. Resetting to -1 makes expired counters distinguishable from active ones on the /debug/vars HTTP page. I've added a comment clarifying this.


src/stackdriver-nozzle/nozzle/counter_tracker_test.go, line 69 at r1 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

It's totally reasonable and a good clean up. Ginkgo allows you to create custom matchers[1] but it may not be the right fit here.

[1] https://onsi.github.io/gomega/#adding-your-own-matchers

Thanks!


src/stackdriver-nozzle/stackdriver/metric_adapter.go, line 155 at r2 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

works for me, seems like a positive change

Cool, thanks for the feedback.


Comments from Reviewable

@johnsonj
Copy link
Contributor

johnsonj commented Dec 8, 2017

:lgtm:


Reviewed 10 of 10 files at r3.
Review status: all files reviewed at latest revision, 12 unresolved discussions.


jobs/stackdriver-nozzle/spec, line 71 at r3 (raw file):

    default: cf

  nozzle.enable_cumulative_counters:

Just a note- I think it's good to leave this out of the tile.yaml.erb for now. Likely we'll want to just enable this feature once the upstream work lands in PCF.


src/stackdriver-nozzle/messages/metric.go, line 70 at r2 (raw file):

Previously, knyar (Anton Tolchanov) wrote…

It has been copy/pasted from metric_adapter.go. Should I remove it?

yes please! no clue what it's a TODO - to do :)


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 79 at r1 (raw file):

Previously, knyar (Anton Tolchanov) wrote…

As Alex mentioned, there's value in having period decrease for short TTLs, mostly to make tests pass faster. In production it will probably always be capped by maxExpirePeriod.

Ah ok thanks!


src/stackdriver-nozzle/nozzle/counter_tracker.go, line 165 at r2 (raw file):

Previously, knyar (Anton Tolchanov) wrote…

Added logging. Resetting to -1 makes expired counters distinguishable from active ones on the /debug/vars HTTP page. I've added a comment clarifying this.

thanks! that makes sense


Comments from Reviewable

This adds a "counter tracker" which is used to maintain start time and a
separate cumulative value for each counter metric, which need to be
reported to Stackdriver.
@fluffle
Copy link
Collaborator

fluffle commented Dec 11, 2017

:lgtm: Nice :-)


Review status: 10 of 11 files reviewed at latest revision, 11 unresolved discussions.


Comments from Reviewable

@fluffle
Copy link
Collaborator

fluffle commented Dec 11, 2017

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 3 unresolved discussions.


Comments from Reviewable

@knyar
Copy link
Collaborator Author

knyar commented Dec 11, 2017

Review status: all files reviewed at latest revision, 3 unresolved discussions.


jobs/stackdriver-nozzle/spec, line 71 at r3 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

Just a note- I think it's good to leave this out of the tile.yaml.erb for now. Likely we'll want to just enable this feature once the upstream work lands in PCF.

Sounds good.


src/stackdriver-nozzle/messages/metric.go, line 70 at r2 (raw file):

Previously, johnsonj (Jeff Johnson) wrote…

yes please! no clue what it's a TODO - to do :)

Done.


Comments from Reviewable

@knyar
Copy link
Collaborator Author

knyar commented Dec 11, 2017

@johnsonj, I believe this is now ready for you to merge. I don't seem to be able to merge myself because of some pending comments in Reviewable, even though I believe I've addressed all of them.

@johnsonj johnsonj merged commit 5baad64 into cloudfoundry-community:develop Dec 12, 2017
@johnsonj johnsonj mentioned this pull request Feb 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants