-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support/db: Add round_trip_time_seconds metric #4009
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. One suggestion (💡), regardless LGTM.
support/db/metrics.go
Outdated
go func() { | ||
for { | ||
select { | ||
case <-time.After(time.Second): | ||
ctx, cancel := context.WithTimeout(context.Background(), time.Second) | ||
startTime := time.Now() | ||
_, err := s.ExecRaw(ctx, "select 1") | ||
if err == nil { | ||
s.roundTripTimeSummary.Observe(time.Since(startTime).Seconds()) | ||
} | ||
cancel() | ||
case <-s.close: | ||
return | ||
} | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 I think this code would benefit from being isolated into its own type. The type can be constructed with a reference to a session, and a summary to write observations to, and it can hold and hide the chan
and the sync.Once
from this code by providing its own Close function.
support/db/metrics.go
Outdated
s.closeOnce.Do(func() { | ||
close(s.close) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 Nice safe handling of the channel close to make sure we never panic on a double.
@leighmcculloch @tamirms could you take a look again? I found multiple concurrency issues in
Also moved round trip go routine to a separate type as suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logger changes are not obvious or clear to me, I can't see where the race condition was.
The round trip probe looks good, albeit one question (❓). I also left one suggestion (💡), but it is only a suggestion and regardless it looks good.
support/db/round_trip_probe.go
Outdated
go func() { | ||
for { | ||
select { | ||
case <-time.After(time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 This is probably not a significant concern, but this will create a new channel on every call. For a repeating timer I typically use time.Ticker
. You create it with the duration, and it gives you a single chan
that'll have an event to read every duration period that passes.
The one difference is that the way you're using time.After here will result in ~1 second of time between the end of the last probe and beginning of the next, but if you use Ticker it'll result in ~1 second of time between the start of each probe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
support/db/metrics.go
Outdated
SessionInterface: s.SessionInterface.Clone(), | ||
SessionInterface: s.SessionInterface.Clone(), | ||
|
||
// Note that clonned Session will point at the same roundTripProbe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: clonned -> cloned
support/db/round_trip_probe.go
Outdated
ctx, cancel := context.WithTimeout(context.Background(), time.Second) | ||
startTime := time.Now() | ||
_, err := p.session.ExecRaw(ctx, "select 1") | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check for context deadline exceeded errors? If the select is taking longer than 1 second we would be blind to this if we're looking just at prometheus. maybe we should still call p.roundTripTimeSummary.Observe()
on deadline exceeded errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Added Observe(1)
in case of errors and also updated the metric comment to inform that this is the max value.
@leighmcculloch @tamirms I extracted |
@leighmcculloch @tamirms this is now ready for review again. |
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
needed with deprecations, added features, breaking changes, and DB schema changes.
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
Adds
round_trip_time_seconds
tosupport/db
metrics.Why
Very often round-trip time is causing some issues, ex. makes Horizon ingestion slow. New metric should help understand what is the most common round-trip time required.
Known limitations
[TODO or N/A]