-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot restore progress #490
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool feature @rboyer!! I had a small comment.
@@ -650,6 +635,38 @@ func (r *Raft) restoreSnapshot() error { | |||
return nil | |||
} | |||
|
|||
func (r *Raft) tryRestoreSingleSnapshot(snapshot *SnapshotMeta) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't be better to make tryRestoreSingleSnapshot
return an error and create the logger outside of it and log the error when it's returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? I wanted to have a derived logger here (Logger.With()
) so that all of the progress logs and the errors got the same set of bonus KV data logged.
Given that, I'd end up creating the logger outside of this method to pass in for progress in fsmRestoreAndMeasure
, but returning an error that is immediately logged by the caller, which seemed differently strange.
Also the two current logs are slightly different today:
snapLogger.Error("failed to open snapshot", "error", err)
snapLogger.Error("failed to restore snapshot", "error", err)
and are carried over from the existing code. If these were changed to fmt.Errorf
return values there'd be a slight change of output as we'd end up having to do snapLogger.Error(err.Error())
instead, and we'd lose the "error" hclog attribute.
I don't mind inverting that logic if you think it is warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one comment about what looks to be dead code. Otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When restoring a snapshot (on startup, installed from the leader, or during recovery) the logs are extremely terse. There are typically bookend messages indicating that a restore is going to happen, and that it is complete, but there's a big dead space in the middle.
For small snapshots this is probably fine, but for larger multi-GB snapshots this can stretch out and can be unnerving as an operator to know if it's stuck or still making progress.
This PR adjusts the logging to indicate a simple progress log message every 10s about overall completion in bytes-consumed.
Example of it in use while loading a large snapshot:
TODO: need to figure out a test