-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API for atomic snapshot backups #8047
Conversation
Effectively Copied from https://github.com/hashicorp/consul/tree/v1.8.0-beta1/snapshot With addition of overall snapshot checksum file
err = r.forwardLeader(remoteServer, method, args, reply) | ||
return true, err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though it isn't exported a comment here explaining that nil means we are the leader would be helpful
nomad/operator_endpoint.go
Outdated
op.srv.setQueryMeta(&reply.QueryMeta) | ||
|
||
// Take the snapshot and capture the index. | ||
snap, err := snapshot.New(op.logger, op.srv.raft) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it might be helpful to rename the logger that's passed in to differentiate from the server logger
nomad/structs/operator.go
Outdated
|
||
// SnapshotSaveResponse is the header for the streaming snapshot endpoint, | ||
// and followed by the snapshot file content. | ||
// It is written to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfinished comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me just a few small things
Co-authored-by: Drew Bailey <[email protected]>
The callers for `forward` and old implementation expect failures to be accompanied with a true value! This fixes the issue and have tests passing!
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This introduces a generate snapshot API to ease taking backups of the Nomad state, along with a CLI command.
The PR introduce an endpoint (i.e.
/v1/operator/snapshot
) and new CLI subcommands (i.e.nomad operator snapshot save
for generating the snapshot andnomad operator snapshot inspect
for inspecting a snapshot file). These generate atomic consistent snapshots of the nomad state, representing latest view the servers have.The snapshots include all cluster data: job definitions, ACL policies, etc. It includes sensitive information (e.g. ACL Tokens, Vault Tokens), so they need to maintained with care. Also, it may contain ephemeral information about external world that may no longer be relevant - allocations/evaluations about nodes from two days ago may no longer relevant.
I'm mulling over some design consideration for restore capability to accommodate few use cases (disaster recovery vs provisioning a new cluster with recovered jobs but targeting new nodes), and to properly handle ephemeral state. I'll follow up with a new PR.