Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make snapshot timing and trailing logs hot-reloadable in raft #444

Merged
merged 4 commits into from
Mar 31, 2021

Conversation

banks
Copy link
Member

@banks banks commented Feb 26, 2021

These parameters are reasonably easy to make reloadable in a running raft instance.

Other parameters like the timeouts may be too although some require more thought about whether they are safe to change at runtime, however I'm sticking with these three right now as they are the critical parameters needed to get out of an otherwise unrecoverable situation detailed in hashicorp/consul#9609.

There are other possible mitigations and solutions to that problem including my previous WIP PR that prevents snapshotting during replication entirely, but this has fewer uncertainties about safety and is the simplest change that gives a manual intervention option where almost none exists today.

If others are happy with this change, we can merge it and then allow Consul (and others) to have a way to re-configure snapshotting and log pruning on a leader without restarting it and loosing leadership.

Open Questions

The API/Config change was the main gross part here. I went with this as the least-bad of a few bad options but I'm open to input if others thing there is a preferable way.

I didn't want to accept a whole Config even though some fields are not reloadable because then you either have to explicitly copy only some fields, or require the user ensure the rest of the config is identical to the current config (which they might not know as they can't read it) otherwise Bad Things would happen.

I also didn't want to make a separate SetX method for every field we make reloadable as I could see why at least a handful of others might be good to be reloadable later.

I also didn't want the internal state to continue reading from a plane conf Config field but just have to remember not to read reloadable fields from there but somewhere else because that seems error prone. This way forces all readers to access the most recent config and hopefully helps them consider whether the thing they are reading is reloadable or not so they can make sure the logic takes that into account (i.e. use same value in multiple operations in case it changes between them).

@banks banks requested review from schristoff and a team February 26, 2021 16:11
@banks
Copy link
Member Author

banks commented Feb 26, 2021

We are still having some issues with tests in CI, but the suite passes locally for me and the new tests I added pass using -race too FWIW.

config.go Outdated
// TrailingLogs controls how many logs we leave after a snapshot. This is used
// so that we can quickly replay logs on a follower instead of being forced to
// send an entire snapshot. The value passed here is the initial setting used.
// This can be tuned during operation using
Copy link
Member Author

@banks banks Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:

  • fix this comment and docs generally if we are happy with this API approach (duplicating fields)

@banks banks force-pushed the hot-reload-trailing branch from f4c5b92 to 81887b6 Compare March 26, 2021 12:18
@banks banks marked this pull request as ready for review March 26, 2021 12:18
Copy link
Contributor

@briankassouf briankassouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One optional comment, but looks good otherwise!

SnapshotInterval time.Duration

// SnapshotThreshold controls how many outstanding logs there must be before
// we perform a snapshot. This is to prevent excessive snapshots when we can
// just replay a small set of logs.
// just replay a small set of logs. The value passed here is the initial
// setting used. This can be tuned during operation using ReloadConfig.
SnapshotThreshold uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of duplicating these fields here and in ReloadableConfig can we embed ReloadableConfig?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stated below,
" We choose to duplicate fields over embedding or accepting a Config but only using specific fields to keep the API clear."
I'm not entirely sold on the idea of embedding the ReloadableConfig into the "regular" Config because the separation of "this changes, and this doesn't" is clear and "feels safer" (which is probably not a good reason, I admit)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the fence there - I wasn't sure if it was technically binary compatible to embed which tipped the balance although in practice most reasonable usages would probably have been unaffected. I tend to agree with Sarah though - the separation ends up being kind of OK with me in the end for keeping things clearly delineated?

Copy link
Contributor

@schristoff schristoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits

config.go Outdated
// TrailingLogs controls how many logs we leave after a snapshot. This is used
// so that we can quickly replay logs on a follower instead of being forced to
// send an entire snapshot. The value passed here is the initial setting used.
// This can be tuned during operation using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this isn't finished here, but I may be misunderstanding it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, good catch. I thought I'd fixed this but only in one of the places!

snapshot.go Outdated Show resolved Hide resolved
SnapshotInterval time.Duration

// SnapshotThreshold controls how many outstanding logs there must be before
// we perform a snapshot. This is to prevent excessive snapshots when we can
// just replay a small set of logs.
// just replay a small set of logs. The value passed here is the initial
// setting used. This can be tuned during operation using ReloadConfig.
SnapshotThreshold uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stated below,
" We choose to duplicate fields over embedding or accepting a Config but only using specific fields to keep the API clear."
I'm not entirely sold on the idea of embedding the ReloadableConfig into the "regular" Config because the separation of "this changes, and this doesn't" is clear and "feels safer" (which is probably not a good reason, I admit)

config.go Outdated Show resolved Hide resolved
@banks
Copy link
Member Author

banks commented Mar 31, 2021

Thanks for the reviews folks! Waiting for CI to be green after typo fixes and I'll merge.

@banks banks merged commit f52bb63 into master Mar 31, 2021
@banks banks deleted the hot-reload-trailing branch March 31, 2021 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants