-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make snapshot timing and trailing logs hot-reloadable in raft #444
Conversation
We are still having some issues with tests in CI, but the suite passes locally for me and the new tests I added pass using |
config.go
Outdated
// TrailingLogs controls how many logs we leave after a snapshot. This is used | ||
// so that we can quickly replay logs on a follower instead of being forced to | ||
// send an entire snapshot. The value passed here is the initial setting used. | ||
// This can be tuned during operation using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO:
- fix this comment and docs generally if we are happy with this API approach (duplicating fields)
f4c5b92
to
81887b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One optional comment, but looks good otherwise!
SnapshotInterval time.Duration | ||
|
||
// SnapshotThreshold controls how many outstanding logs there must be before | ||
// we perform a snapshot. This is to prevent excessive snapshots when we can | ||
// just replay a small set of logs. | ||
// just replay a small set of logs. The value passed here is the initial | ||
// setting used. This can be tuned during operation using ReloadConfig. | ||
SnapshotThreshold uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of duplicating these fields here and in ReloadableConfig can we embed ReloadableConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stated below,
" We choose to duplicate fields over embedding or accepting a Config but only using specific fields to keep the API clear."
I'm not entirely sold on the idea of embedding the ReloadableConfig into the "regular" Config because the separation of "this changes, and this doesn't" is clear and "feels safer" (which is probably not a good reason, I admit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was on the fence there - I wasn't sure if it was technically binary compatible to embed which tipped the balance although in practice most reasonable usages would probably have been unaffected. I tend to agree with Sarah though - the separation ends up being kind of OK with me in the end for keeping things clearly delineated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits
config.go
Outdated
// TrailingLogs controls how many logs we leave after a snapshot. This is used | ||
// so that we can quickly replay logs on a follower instead of being forced to | ||
// send an entire snapshot. The value passed here is the initial setting used. | ||
// This can be tuned during operation using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this isn't finished here, but I may be misunderstanding it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, good catch. I thought I'd fixed this but only in one of the places!
SnapshotInterval time.Duration | ||
|
||
// SnapshotThreshold controls how many outstanding logs there must be before | ||
// we perform a snapshot. This is to prevent excessive snapshots when we can | ||
// just replay a small set of logs. | ||
// just replay a small set of logs. The value passed here is the initial | ||
// setting used. This can be tuned during operation using ReloadConfig. | ||
SnapshotThreshold uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stated below,
" We choose to duplicate fields over embedding or accepting a Config but only using specific fields to keep the API clear."
I'm not entirely sold on the idea of embedding the ReloadableConfig into the "regular" Config because the separation of "this changes, and this doesn't" is clear and "feels safer" (which is probably not a good reason, I admit)
Thanks for the reviews folks! Waiting for CI to be green after typo fixes and I'll merge. |
These parameters are reasonably easy to make reloadable in a running raft instance.
Other parameters like the timeouts may be too although some require more thought about whether they are safe to change at runtime, however I'm sticking with these three right now as they are the critical parameters needed to get out of an otherwise unrecoverable situation detailed in hashicorp/consul#9609.
There are other possible mitigations and solutions to that problem including my previous WIP PR that prevents snapshotting during replication entirely, but this has fewer uncertainties about safety and is the simplest change that gives a manual intervention option where almost none exists today.
If others are happy with this change, we can merge it and then allow Consul (and others) to have a way to re-configure snapshotting and log pruning on a leader without restarting it and loosing leadership.
Open Questions
The API/Config change was the main gross part here. I went with this as the least-bad of a few bad options but I'm open to input if others thing there is a preferable way.
I didn't want to accept a whole
Config
even though some fields are not reloadable because then you either have to explicitly copy only some fields, or require the user ensure the rest of the config is identical to the current config (which they might not know as they can't read it) otherwise Bad Things would happen.I also didn't want to make a separate
SetX
method for every field we make reloadable as I could see why at least a handful of others might be good to be reloadable later.I also didn't want the internal state to continue reading from a plane
conf Config
field but just have to remember not to read reloadable fields from there but somewhere else because that seems error prone. This way forces all readers to access the most recent config and hopefully helps them consider whether the thing they are reading is reloadable or not so they can make sure the logic takes that into account (i.e. use same value in multiple operations in case it changes between them).