Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Raft trailing logs and snapshot timing reloadable #10129

Merged
merged 10 commits into from
May 4, 2021

Conversation

banks
Copy link
Member

@banks banks commented Apr 27, 2021

This is a partial fix for #9609.

The description in that issue gives lots of background although much of it is also present in the telemetry docs proposed in this PR too.

This PR:

  • Updates Raft and go-metrics to pull in new telemetry and hot reloadable features
  • Makes the raft options reloadable in Consul
  • Updates our pre-registered prometheus metrics to cover the new raft ones and the old ones I've not added to the telemetry guide as important
  • Updates telemetry and config docs with the changes in metrics/reload behaviour
    • Sorry, but I also re-organised the raft metrics to be alphabetical - I've made the mistake of thinking we missed metrics on two separate occasions because they were not grouped alphabetically and even not hierarchically (e.g. some raft.fsm metrics are not alongside others... Probably makes the diff more onerous than it should be.)
  • Adds a "Key Metric" section to the telemetry docs that describes this issue in more detail and how to use the new metrics to monitor for it.

Feedback Requested

The Key Metrics section is verbose but it's pretty hard to simplify describing how to deal with this case without that detail. I feel like it would be better suited for more of a "operator runbook" section where we can go into a bit more detail on debugging common or serious failure modes of Consul, but we have no such thing yet and I wanted to get this available at least before taking on the project of starting a whole new type of content in the docs.

Would love feedback on whether it feels sufficient/appropriate/clear enough or if there are other ideas to improve. I'd probably defer any major changes to docs structure to later rather than take them on in this PR though as I want to get this into the next 1.10 beta.

Testing

The reloadable stuff has tests here but I also spent a while trying this out for real to make sure the metrics actually work and are useful. For example:

image

Here the leader has an artificially low raft_trailing_logs of 1 initially. It is writing our 1GB snapshots roughly every 2 minutes and is handling about 200 writes a second.

At 12:01 I changed raft_trailing_logs to 100k and used consul_reload the same leader (s1) continued to snapshot another couple of times but without truncating logs.

I also tested changes to snapshot threshold and timing.

TODO

  • Add changelogs for these changes, but also check for other fixes in Raft 1.3.0 that could impact Consul users and add those too

@banks banks added this to the 1.10.0 milestone Apr 27, 2021
@banks banks requested a review from a team April 27, 2021 15:34
@github-actions github-actions bot added pr/dependencies PR specifically updates dependencies of project type/docs Documentation needs to be created/updated/clarified labels Apr 27, 2021
@hashicorp-ci
Copy link
Contributor

🤔 This PR has changes in the website/ directory but does not have a type/docs-cherrypick label. If the changes are for the next version, this can be ignored. If they are updates to current docs, attach the label to auto cherrypick to the stable-website branch after merging.

@vercel vercel bot temporarily deployed to Preview – consul-ui-staging April 27, 2021 15:45 Inactive
@vercel vercel bot temporarily deployed to Preview – consul April 27, 2021 15:45 Inactive
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, found on stray word in some docs but otherwise things seem perfect.

website/content/docs/agent/options.mdx Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging May 4, 2021 14:35 Inactive
@banks banks merged commit 3ad754c into master May 4, 2021
@banks banks deleted the raft-replication-config branch May 4, 2021 14:36
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/361903.

@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/361915.

@hc-github-team-consul-core
Copy link
Collaborator

🍒✅ Cherry pick of commit 3ad754c onto release/1.10.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request May 4, 2021
* WIP reloadable raft config

* Pre-define new raft gauges

* Update go-metrics to change gauge reset behaviour

* Update raft to pull in new metric and reloadable config

* Add snapshot persistance timing and installSnapshot to our 'protected' list as they can be infrequent but are important

* Update telemetry docs

* Update config and telemetry docs

* Add note to oldestLogAge on when it is visible

* Add changelog entry

* Update website/content/docs/agent/options.mdx

Co-authored-by: Matt Keeler <[email protected]>

Co-authored-by: Matt Keeler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/dependencies PR specifically updates dependencies of project theme/reliability type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants