Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eventbus: expose metrics #2020

Closed
marten-seemann opened this issue Jan 27, 2023 · 6 comments · Fixed by #2038
Closed

eventbus: expose metrics #2020

marten-seemann opened this issue Jan 27, 2023 · 6 comments · Fixed by #2038
Labels
effort/hours Estimated to take one or several hours exp/novice Someone with a little familiarity can pick up good first issue Good issue for new contributors help wanted Seeking public contribution on this issue kind/enhancement A net-new feature or improvement to an existing feature

Comments

@marten-seemann
Copy link
Contributor

marten-seemann commented Jan 27, 2023

This is very limited in scope, but might be a very useful indicator that things are going wrong. We should track:

  • events sent (by event type, might require reflection, or type assertion)
  • time it took between sending the event and until all subscribers were notified (by event type)
  • channels filled up (when events are sent faster than they're consumed)
  • number of subscribers (by event type), this would be a good use of a Prometheus gauge
@marten-seemann marten-seemann added kind/enhancement A net-new feature or improvement to an existing feature help wanted Seeking public contribution on this issue exp/novice Someone with a little familiarity can pick up good first issue Good issue for new contributors effort/hours Estimated to take one or several hours labels Jan 27, 2023
@sukunrt
Copy link
Member

sukunrt commented Feb 1, 2023

events sent (by event type, might require reflection, or type assertion)

Since an emitter is already associated with a type we would not need reflection. Something like this would do, right?

var eventsEmitted = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Namespace: metricNamespace,
			Name:      "events_emitted_total",
			Help:      "Events Emitted",
		},
		[]string{"event"},
	)

func (m *metricsTracer) EventEmitted(typ reflect.Type) {
	eventsEmitted.WithLabelValues(typ.String()).Inc()
}

This is the approach taken in PR: #2038

time it took between sending the event and until all subscribers were notified (by event type)

By subscribers were notified, do you mean that the message was successfully pushed on to the channel and the Emit method returned?

channels filled up (when events are sent faster than they're consumed)

This can either be done by

  1. starting a monitoring go routine for the subscription which pushes the channel length periodically

or

  1. by sending the metric when an event is pushed to the channel.
    I think 2 is simpler, we don't spawn any routines that we need to cleanup and the case where we are unable to send the metric because the emitter is stuck on a slow channel will be caught by time taken to notify all subscriptions.

For both the methods for useful monitoring we'd need to add a name to the subscription so that we can know which subscription is lagging.

@marten-seemann
Copy link
Contributor Author

time it took between sending the event and until all subscribers were notified (by event type)

By subscribers were notified, do you mean that the message was successfully pushed on to the channel and the Emit method returned?

Yes exactly. This would help detect if one of the consumers (the subscriber) is too slow consuming events.

channels filled up (when events are sent faster than they're consumed)

This can either be done by

  1. starting a monitoring go routine for the subscription which pushes the channel length periodically

or

  1. by sending the metric when an event is pushed to the channel.
    I think 2 is simpler, we don't spawn any routines that we need to cleanup and the case where we are unable to send the metric because the emitter is stuck on a slow channel will be caught by time taken to notify all subscriptions.

For both the methods for useful monitoring we'd need to add a name to the subscription so that we can know which subscription is lagging.

Agreed, we should add the name here. I'd definitely prefer 2.

@marten-seemann
Copy link
Contributor Author

  • number of subscribers (by event type), this would be a good use of a Prometheus gauge

I added another useful metric to the list above. @sukunrt, what do you think? Happy to hear your thoughts on these. Also feel free to suggest other metrics that might be interesting :)

@sukunrt
Copy link
Member

sukunrt commented Feb 2, 2023

number of subscribers is nice!
I want to track time spent in queue(channel). I can't think of a simple way to make that happen. Can we rely on Little's law? Queue Length = Wait Time * Arrival Rate. Since we have Queue Length and Arrival Rate we can track average wait time over a period.

@marten-seemann
Copy link
Contributor Author

Smart idea! I'd like to avoid hard-coding that though. Maybe we can track the channel length instead? The only caveat is that we can only do when we're actually emitting an event.

@sukunrt
Copy link
Member

sukunrt commented Feb 2, 2023

I updated the PR:

this wait time thing isn't working as I'd have liked, we can remove it but leaving it there for you to check it once.
The problem here is when there are no events sent for a channel the arrival rate goes down and so queue length / arrival rate shoots up which is very misleading.

I also added measuring events emitted by subscriber.

@marten-seemann marten-seemann linked a pull request Feb 4, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/hours Estimated to take one or several hours exp/novice Someone with a little familiarity can pick up good first issue Good issue for new contributors help wanted Seeking public contribution on this issue kind/enhancement A net-new feature or improvement to an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants