Make Raft trailing logs and snapshot timing reloadable #10129

banks · 2021-04-27T15:34:20Z

This is a partial fix for #9609.

The description in that issue gives lots of background although much of it is also present in the telemetry docs proposed in this PR too.

This PR:

Updates Raft and go-metrics to pull in new telemetry and hot reloadable features
Makes the raft options reloadable in Consul
Updates our pre-registered prometheus metrics to cover the new raft ones and the old ones I've not added to the telemetry guide as important
Updates telemetry and config docs with the changes in metrics/reload behaviour
- Sorry, but I also re-organised the raft metrics to be alphabetical - I've made the mistake of thinking we missed metrics on two separate occasions because they were not grouped alphabetically and even not hierarchically (e.g. some raft.fsm metrics are not alongside others... Probably makes the diff more onerous than it should be.)
Adds a "Key Metric" section to the telemetry docs that describes this issue in more detail and how to use the new metrics to monitor for it.

Feedback Requested

The Key Metrics section is verbose but it's pretty hard to simplify describing how to deal with this case without that detail. I feel like it would be better suited for more of a "operator runbook" section where we can go into a bit more detail on debugging common or serious failure modes of Consul, but we have no such thing yet and I wanted to get this available at least before taking on the project of starting a whole new type of content in the docs.

Would love feedback on whether it feels sufficient/appropriate/clear enough or if there are other ideas to improve. I'd probably defer any major changes to docs structure to later rather than take them on in this PR though as I want to get this into the next 1.10 beta.

Testing

The reloadable stuff has tests here but I also spent a while trying this out for real to make sure the metrics actually work and are useful. For example:

Here the leader has an artificially low raft_trailing_logs of 1 initially. It is writing our 1GB snapshots roughly every 2 minutes and is handling about 200 writes a second.

At 12:01 I changed raft_trailing_logs to 100k and used consul_reload the same leader (s1) continued to snapshot another couple of times but without truncating logs.

I also tested changes to snapshot threshold and timing.

TODO

Add changelogs for these changes, but also check for other fixes in Raft 1.3.0 that could impact Consul users and add those too

…' list as they can be infrequent but are important

hashicorp-ci · 2021-04-27T15:34:58Z

🤔 This PR has changes in the website/ directory but does not have a type/docs-cherrypick label. If the changes are for the next version, this can be ignored. If they are updates to current docs, attach the label to auto cherrypick to the stable-website branch after merging.

mkeeler

LGTM, found on stray word in some docs but otherwise things seem perfect.

website/content/docs/agent/options.mdx

Co-authored-by: Matt Keeler <[email protected]>

hc-github-team-consul-core · 2021-05-04T14:39:22Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/361903.

hc-github-team-consul-core · 2021-05-04T14:40:38Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/361915.

hc-github-team-consul-core · 2021-05-04T14:40:42Z

🍒✅ Cherry pick of commit 3ad754c onto release/1.10.x succeeded!

* WIP reloadable raft config * Pre-define new raft gauges * Update go-metrics to change gauge reset behaviour * Update raft to pull in new metric and reloadable config * Add snapshot persistance timing and installSnapshot to our 'protected' list as they can be infrequent but are important * Update telemetry docs * Update config and telemetry docs * Add note to oldestLogAge on when it is visible * Add changelog entry * Update website/content/docs/agent/options.mdx Co-authored-by: Matt Keeler <[email protected]> Co-authored-by: Matt Keeler <[email protected]>

banks added 7 commits April 26, 2021 21:13

WIP reloadable raft config

86c0a22

Pre-define new raft gauges

955c189

Update go-metrics to change gauge reset behaviour

f172360

Update raft to pull in new metric and reloadable config

ba77bbf

Add snapshot persistance timing and installSnapshot to our 'protected…

026132c

…' list as they can be infrequent but are important

Update telemetry docs

c3dbc24

Update config and telemetry docs

0df78c1

banks added the theme/reliability label Apr 27, 2021

banks added this to the 1.10.0 milestone Apr 27, 2021

banks requested a review from a team April 27, 2021 15:34

github-actions bot added pr/dependencies PR specifically updates dependencies of project type/docs Documentation needs to be created/updated/clarified labels Apr 27, 2021

Add note to oldestLogAge on when it is visible

e399bb7

vercel bot deployed to Preview – consul April 27, 2021 15:37 View deployment

vercel bot temporarily deployed to Preview – consul-ui-staging April 27, 2021 15:37 Inactive

Add changelog entry

0a496ff

vercel bot temporarily deployed to Preview – consul-ui-staging April 27, 2021 15:45 Inactive

vercel bot temporarily deployed to Preview – consul April 27, 2021 15:45 Inactive

mkeeler approved these changes May 4, 2021

View reviewed changes

website/content/docs/agent/options.mdx Outdated Show resolved Hide resolved

Update website/content/docs/agent/options.mdx

e1d0dcd

Co-authored-by: Matt Keeler <[email protected]>

vercel bot deployed to Preview – consul May 4, 2021 14:35 View deployment

vercel bot temporarily deployed to Preview – consul-ui-staging May 4, 2021 14:35 Inactive

banks merged commit 3ad754c into master May 4, 2021

banks deleted the raft-replication-config branch May 4, 2021 14:36

banks added the backport/1.10 label May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Raft trailing logs and snapshot timing reloadable #10129

Make Raft trailing logs and snapshot timing reloadable #10129

banks commented Apr 27, 2021 •

edited

Loading

hashicorp-ci commented Apr 27, 2021

mkeeler left a comment

hc-github-team-consul-core commented May 4, 2021

hc-github-team-consul-core commented May 4, 2021

hc-github-team-consul-core commented May 4, 2021

Make Raft trailing logs and snapshot timing reloadable #10129

Make Raft trailing logs and snapshot timing reloadable #10129

Conversation

banks commented Apr 27, 2021 • edited Loading

Feedback Requested

Testing

TODO

hashicorp-ci commented Apr 27, 2021

mkeeler left a comment

Choose a reason for hiding this comment

hc-github-team-consul-core commented May 4, 2021

hc-github-team-consul-core commented May 4, 2021

hc-github-team-consul-core commented May 4, 2021

banks commented Apr 27, 2021 •

edited

Loading