Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot restore progress #490

Merged
merged 8 commits into from
Feb 11, 2022
Merged

Snapshot restore progress #490

merged 8 commits into from
Feb 11, 2022

Conversation

rboyer
Copy link
Member

@rboyer rboyer commented Feb 10, 2022

When restoring a snapshot (on startup, installed from the leader, or during recovery) the logs are extremely terse. There are typically bookend messages indicating that a restore is going to happen, and that it is complete, but there's a big dead space in the middle.

For small snapshots this is probably fine, but for larger multi-GB snapshots this can stretch out and can be unnerving as an operator to know if it's stuck or still making progress.

This PR adjusts the logging to indicate a simple progress log message every 10s about overall completion in bytes-consumed.

Example of it in use while loading a large snapshot:

2022-02-09T17:26:17.050-0600 [INFO]  agent.server.snapshot: creating new snapshot: path=data/raft/snapshots/600-1900000-1444444444444.tmp
2022-02-09T17:26:19.602-0600 [INFO]  agent.server.raft: snapshot network transfer progress: read-bytes=1610612736 percent-complete=100.00%
2022-02-09T17:26:20.307-0600 [INFO]  agent.server.raft: copied to local snapshot: bytes=1610612736
2022-02-09T17:26:27.816-0600 [INFO]  agent: Synced node info
2022-02-09T17:26:31.746-0600 [INFO]  agent.server.raft: snapshot restore progress: id=600-1900000-1444444444444 last-index=1900000 last-term=600 size-in-bytes=1610612736 read-bytes=133680854 percent-complete=8.30%
2022-02-09T17:26:41.746-0600 [INFO]  agent.server.raft: snapshot restore progress: id=600-1900000-1444444444444 last-index=1900000 last-term=600 size-in-bytes=1610612736 read-bytes=212278758 percent-complete=13.18%
2022-02-09T17:26:51.746-0600 [INFO]  agent.server.raft: snapshot restore progress: id=600-1900000-1444444444444 last-index=1900000 last-term=600 size-in-bytes=1610612736 read-bytes=1293483088 percent-complete=80.31%
2022-02-09T17:26:56.361-0600 [INFO]  agent.server.raft: Installed remote snapshot

TODO: need to figure out a test

@rboyer rboyer requested review from banks, dnephin and mkeeler February 10, 2022 18:03
@rboyer rboyer self-assigned this Feb 10, 2022
Copy link
Contributor

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool feature @rboyer!! I had a small comment.

@@ -650,6 +635,38 @@ func (r *Raft) restoreSnapshot() error {
return nil
}

func (r *Raft) tryRestoreSingleSnapshot(snapshot *SnapshotMeta) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't be better to make tryRestoreSingleSnapshot return an error and create the logger outside of it and log the error when it's returned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I wanted to have a derived logger here (Logger.With()) so that all of the progress logs and the errors got the same set of bonus KV data logged.

Given that, I'd end up creating the logger outside of this method to pass in for progress in fsmRestoreAndMeasure, but returning an error that is immediately logged by the caller, which seemed differently strange.

Also the two current logs are slightly different today:

snapLogger.Error("failed to open snapshot", "error", err)
snapLogger.Error("failed to restore snapshot", "error", err)

and are carried over from the existing code. If these were changed to fmt.Errorf return values there'd be a slight change of output as we'd end up having to do snapLogger.Error(err.Error()) instead, and we'd lose the "error" hclog attribute.

I don't mind inverting that logic if you think it is warranted.

progress.go Outdated Show resolved Hide resolved
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one comment about what looks to be dead code. Otherwise LGTM

@rboyer rboyer requested review from dhiaayachi and mkeeler February 10, 2022 19:38
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants