Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support/db: Add round_trip_time_seconds metric #4009

Merged
merged 12 commits into from
Oct 18, 2021

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Oct 14, 2021

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

Adds round_trip_time_seconds to support/db metrics.

Why

Very often round-trip time is causing some issues, ex. makes Horizon ingestion slow. New metric should help understand what is the most common round-trip time required.

Known limitations

[TODO or N/A]

@bartekn bartekn requested a review from a team October 14, 2021 20:32
Copy link
Member

@leighmcculloch leighmcculloch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. One suggestion (💡), regardless LGTM.

Comment on lines 241 to 256
go func() {
for {
select {
case <-time.After(time.Second):
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
startTime := time.Now()
_, err := s.ExecRaw(ctx, "select 1")
if err == nil {
s.roundTripTimeSummary.Observe(time.Since(startTime).Seconds())
}
cancel()
case <-s.close:
return
}
}
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 I think this code would benefit from being isolated into its own type. The type can be constructed with a reference to a session, and a summary to write observations to, and it can hold and hide the chan and the sync.Once from this code by providing its own Close function.

Comment on lines 262 to 264
s.closeOnce.Do(func() {
close(s.close)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Nice safe handling of the channel close to make sure we never panic on a double.

@bartekn bartekn marked this pull request as draft October 14, 2021 21:52
@bartekn
Copy link
Contributor Author

bartekn commented Oct 15, 2021

@leighmcculloch @tamirms could you take a look again?

I found multiple concurrency issues in support/log:

  • Writing directly to log.DefaultLogger.Logger fields is not thread safe (obviously) but fortunately there are getters and setters that are thread safe to I updated the code to use these.
  • In services/horizon/internal/test/T we did some pretty complicated logger but there is StartTest method that is thread-safe and just easier to use. I updated the code to use it.

Also moved round trip go routine to a separate type as suggested.

@bartekn bartekn marked this pull request as ready for review October 15, 2021 11:49
Copy link
Member

@leighmcculloch leighmcculloch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger changes are not obvious or clear to me, I can't see where the race condition was.

The round trip probe looks good, albeit one question (❓). I also left one suggestion (💡), but it is only a suggestion and regardless it looks good.

support/db/round_trip_probe.go Show resolved Hide resolved
go func() {
for {
select {
case <-time.After(time.Second):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 This is probably not a significant concern, but this will create a new channel on every call. For a repeating timer I typically use time.Ticker. You create it with the duration, and it gives you a single chan that'll have an event to read every duration period that passes.

The one difference is that the way you're using time.After here will result in ~1 second of time between the end of the last probe and beginning of the next, but if you use Ticker it'll result in ~1 second of time between the start of each probe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

support/log/entry.go Outdated Show resolved Hide resolved
SessionInterface: s.SessionInterface.Clone(),
SessionInterface: s.SessionInterface.Clone(),

// Note that clonned Session will point at the same roundTripProbe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: clonned -> cloned

ctx, cancel := context.WithTimeout(context.Background(), time.Second)
startTime := time.Now()
_, err := p.session.ExecRaw(ctx, "select 1")
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check for context deadline exceeded errors? If the select is taking longer than 1 second we would be blind to this if we're looking just at prometheus. maybe we should still call p.roundTripTimeSummary.Observe() on deadline exceeded errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Added Observe(1) in case of errors and also updated the metric comment to inform that this is the max value.

@bartekn bartekn marked this pull request as draft October 18, 2021 15:21
@bartekn
Copy link
Contributor Author

bartekn commented Oct 18, 2021

@leighmcculloch @tamirms I extracted support/log fix and added explanations to #4012. Going to rebase this PR when the other is merged.

@bartekn
Copy link
Contributor Author

bartekn commented Oct 18, 2021

@leighmcculloch @tamirms this is now ready for review again.

@bartekn bartekn marked this pull request as ready for review October 18, 2021 19:47
@bartekn bartekn merged commit 234cadf into stellar:master Oct 18, 2021
@bartekn bartekn deleted the horizon_round_trip_metrics branch October 18, 2021 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants