-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Raft trailing logs and snapshot timing reloadable #10129
Conversation
…' list as they can be infrequent but are important
🤔 This PR has changes in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, found on stray word in some docs but otherwise things seem perfect.
Co-authored-by: Matt Keeler <[email protected]>
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/361903. |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/361915. |
🍒✅ Cherry pick of commit 3ad754c onto |
* WIP reloadable raft config * Pre-define new raft gauges * Update go-metrics to change gauge reset behaviour * Update raft to pull in new metric and reloadable config * Add snapshot persistance timing and installSnapshot to our 'protected' list as they can be infrequent but are important * Update telemetry docs * Update config and telemetry docs * Add note to oldestLogAge on when it is visible * Add changelog entry * Update website/content/docs/agent/options.mdx Co-authored-by: Matt Keeler <[email protected]> Co-authored-by: Matt Keeler <[email protected]>
This is a partial fix for #9609.
The description in that issue gives lots of background although much of it is also present in the telemetry docs proposed in this PR too.
This PR:
raft.fsm
metrics are not alongside others... Probably makes the diff more onerous than it should be.)Feedback Requested
The Key Metrics section is verbose but it's pretty hard to simplify describing how to deal with this case without that detail. I feel like it would be better suited for more of a "operator runbook" section where we can go into a bit more detail on debugging common or serious failure modes of Consul, but we have no such thing yet and I wanted to get this available at least before taking on the project of starting a whole new type of content in the docs.
Would love feedback on whether it feels sufficient/appropriate/clear enough or if there are other ideas to improve. I'd probably defer any major changes to docs structure to later rather than take them on in this PR though as I want to get this into the next 1.10 beta.
Testing
The reloadable stuff has tests here but I also spent a while trying this out for real to make sure the metrics actually work and are useful. For example:
Here the leader has an artificially low
raft_trailing_logs
of1
initially. It is writing our 1GB snapshots roughly every 2 minutes and is handling about 200 writes a second.At 12:01 I changed
raft_trailing_logs
to100k
and usedconsul_reload
the same leader (s1) continued to snapshot another couple of times but without truncating logs.I also tested changes to snapshot threshold and timing.
TODO