Snapshot restore progress #490

rboyer · 2022-02-10T18:03:14Z

When restoring a snapshot (on startup, installed from the leader, or during recovery) the logs are extremely terse. There are typically bookend messages indicating that a restore is going to happen, and that it is complete, but there's a big dead space in the middle.

For small snapshots this is probably fine, but for larger multi-GB snapshots this can stretch out and can be unnerving as an operator to know if it's stuck or still making progress.

This PR adjusts the logging to indicate a simple progress log message every 10s about overall completion in bytes-consumed.

Example of it in use while loading a large snapshot:

2022-02-09T17:26:17.050-0600 [INFO]  agent.server.snapshot: creating new snapshot: path=data/raft/snapshots/600-1900000-1444444444444.tmp
2022-02-09T17:26:19.602-0600 [INFO]  agent.server.raft: snapshot network transfer progress: read-bytes=1610612736 percent-complete=100.00%
2022-02-09T17:26:20.307-0600 [INFO]  agent.server.raft: copied to local snapshot: bytes=1610612736
2022-02-09T17:26:27.816-0600 [INFO]  agent: Synced node info
2022-02-09T17:26:31.746-0600 [INFO]  agent.server.raft: snapshot restore progress: id=600-1900000-1444444444444 last-index=1900000 last-term=600 size-in-bytes=1610612736 read-bytes=133680854 percent-complete=8.30%
2022-02-09T17:26:41.746-0600 [INFO]  agent.server.raft: snapshot restore progress: id=600-1900000-1444444444444 last-index=1900000 last-term=600 size-in-bytes=1610612736 read-bytes=212278758 percent-complete=13.18%
2022-02-09T17:26:51.746-0600 [INFO]  agent.server.raft: snapshot restore progress: id=600-1900000-1444444444444 last-index=1900000 last-term=600 size-in-bytes=1610612736 read-bytes=1293483088 percent-complete=80.31%
2022-02-09T17:26:56.361-0600 [INFO]  agent.server.raft: Installed remote snapshot

TODO: need to figure out a test

mergeback

dhiaayachi

Really cool feature @rboyer!! I had a small comment.

dhiaayachi · 2022-02-10T19:05:02Z

api.go

@@ -650,6 +635,38 @@ func (r *Raft) restoreSnapshot() error {
 	return nil
 }

+func (r *Raft) tryRestoreSingleSnapshot(snapshot *SnapshotMeta) bool {


wouldn't be better to make tryRestoreSingleSnapshot return an error and create the logger outside of it and log the error when it's returned?

Maybe? I wanted to have a derived logger here (Logger.With()) so that all of the progress logs and the errors got the same set of bonus KV data logged.

Given that, I'd end up creating the logger outside of this method to pass in for progress in fsmRestoreAndMeasure, but returning an error that is immediately logged by the caller, which seemed differently strange.

Also the two current logs are slightly different today:

snapLogger.Error("failed to open snapshot", "error", err) snapLogger.Error("failed to restore snapshot", "error", err)

and are carried over from the existing code. If these were changed to fmt.Errorf return values there'd be a slight change of output as we'd end up having to do snapLogger.Error(err.Error()) instead, and we'd lose the "error" hclog attribute.

I don't mind inverting that logic if you think it is warranted.

progress.go

mkeeler

Just the one comment about what looks to be dead code. Otherwise LGTM

mkeeler

LGTM

rboyer added 6 commits February 9, 2022 16:57

extract logger creation block of code

68f5bca

extract single snapshot restore method

193eb18

add progress restoration logger tooling

9ddd4e5

thread a logger down into fsmRestoreAndMeasure

4577041

mergeback

integrate progress reporting for the snapshot restore function

9901a92

also check on times for snapshot transfer

03d8115

rboyer requested review from banks, dnephin and mkeeler February 10, 2022 18:03

rboyer self-assigned this Feb 10, 2022

dhiaayachi reviewed Feb 10, 2022

View reviewed changes

mkeeler reviewed Feb 10, 2022

View reviewed changes

progress.go Outdated Show resolved Hide resolved

mkeeler reviewed Feb 10, 2022

View reviewed changes

remove dead code

582f2a6

rboyer requested review from dhiaayachi and mkeeler February 10, 2022 19:38

mkeeler approved these changes Feb 11, 2022

View reviewed changes

add a simple test for the snapshot restoration progress

3bd3071

rboyer merged commit 3cb47c5 into main Feb 11, 2022

rboyer deleted the snapshot-restore-progress branch February 11, 2022 21:37

rboyer mentioned this pull request Feb 11, 2022

raft: update to v1.3.5 hashicorp/consul#12325

Merged

ncabatoff mentioned this pull request Apr 26, 2022

Ensure Vault can access the underlying snapshotInstaller. #501

Merged

banks mentioned this pull request Nov 15, 2022

Improve observability around snapshot size vs trailing logs hashicorp/consul#9609

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot restore progress #490

Snapshot restore progress #490

rboyer commented Feb 10, 2022

dhiaayachi left a comment

dhiaayachi Feb 10, 2022

rboyer Feb 10, 2022

mkeeler left a comment

mkeeler left a comment

Snapshot restore progress #490

Snapshot restore progress #490

Conversation

rboyer commented Feb 10, 2022

dhiaayachi left a comment

Choose a reason for hiding this comment

dhiaayachi Feb 10, 2022

Choose a reason for hiding this comment

rboyer Feb 10, 2022

Choose a reason for hiding this comment

mkeeler left a comment

Choose a reason for hiding this comment

mkeeler left a comment

Choose a reason for hiding this comment