Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve observability around snapshot size vs trailing logs #9609

Closed
3 of 5 tasks
banks opened this issue Jan 21, 2021 · 2 comments
Closed
3 of 5 tasks

Improve observability around snapshot size vs trailing logs #9609

banks opened this issue Jan 21, 2021 · 2 comments
Assignees
Labels
theme/reliability theme/telemetry Anything related to telemetry or observability

Comments

@banks
Copy link
Member

banks commented Jan 21, 2021

Large Snapshot Replication Failure Problem Statement

Currently, raft writes out a complete snapshot of the state store to disk for every 16k writes to the DB, checked every 30s. After a successful snapshot, we truncate the logs to prevent the logs growing unboundedly. Instead of truncating all logs up to the index we just wrote the snapshot at, we always ensure we have at least 10k logs left in the log store (these parameters can be tuned).

The reason for leaving these "trailing logs" is so that a follower who goes down for a restart just before the snapshot completes can continue replicating once it comes backup (which may take a few seconds).

If a follower does get behind (e.g. it's partitioned for a few minutes) it's possible that there will have been more than 10k updates and a snapshot+truncate on the leader in the interim meaning the follower is unable to continue replicating from the logs. In this case the leader will deliver it an entire new snapshot over the wire before resuming log-based replication.

In general this works well for typical uses of Consul, however it has a hard assumption that the time required to transmit the most recent snapshot to a peer, and have the peer restore it into it's state store ready to resume accepting logs is much smaller than the time taking to append 10k more log updates on the leader.

On clusters that have both a high write rate (e.g. 500 writes per second) and a large state snapshot (e.g. larger than 1GB) even on reasonably fast hardware that assumption can be violated resulting in a permanent error condition where followers can never catch up to the leader again.

A worked example:

  • At 500 writes/second, the snapshot threshold is met within 32 seconds and with an interval of 30 seconds that means the leader will be writing out a full snapshot (1GB) file every minute.
  • Every minute after the snapshot, logs are truncated to 10k entries which is only the last 20 seconds worth of updates.
  • If sending 1GB snapshot and restoring it on a follower takes more than 20 seconds, the follower will never be able to get healthy after a restart unless the write volume drops.

Note that raft_trailing_logs can be increased, but since this requires a restart and only once the current leader has the new value will it improve things, it's often not possible to recover from this situation without downtime or somehow reducing the write rate on the cluster enough to allow followers to catch up and then be reconfigured with a rolling restart. Reducing snapshot size helps in the medium term but often isn't possible immediately and requires more writes to do it so is rarely an effective step that can be taken during an incident.

There is also currently a good reason to not increase raft_trailing_logs to something very high pre-emptively. Most deployments have ample disk space and so may be OK with additional disk usage that retaining lots more logs involves, however with the current log store implementation which is a grow-only copy-on-write Btree, there is a limit to how large that file can be before it starts to impact log "append" and truncate performance. This might change in the future with a different log store implementation which could reduce the throughput impact of a high raft_trailing_logs.

Observability Improvements

There are a number of things we can do to mitigate this issue which will be broken into several different issues. This one will act as a reference for the problem statement above.

One significant issue when encountering this scenario in the wild is that even if you know to look, working out if this is going to happen is very hard. There are a number of simple ways we can improve our metrics and logging that will help diagnose and monitor to prevent this situation occurring.

To resolve this issue I propose we consider the following. If we decide on investigation not to do any immediately, we can split them to separate issues or drop them. These should all be relatively small code/docs changes. The majority of the work would likely be around validating they work well for this intended case perhaps by simulating it (could do that by writing lots of large KV entries, potentially with an artificially small raft_trailing_logs config.

  1. Add a gauge metric that shows age in seconds of the oldest entry still in the logs
    • This will allow easy monitoring of how the current write rate, snapshot interval and trailing logs settings combine. Used along with the restore and snapshot timing metrics (consul.raft.fsm.restore, consul.raft.replication.installSnapshot) operators will be able to see if their cluster is heading towards a "danger zone" [edit: see note in comment below].
  2. We already have timing for making snapshots consul.raft.snapshot.takeSnapshot but it can be hard to quickly see how much of the time is spent writing out snapshots:
    image this was a very lightly loaded cluster with a small snapshot so it only snapshotted twice while loading lots of values in and wasn't snapshotting constantly like our example.
    • It would be good to export a metric for "percentage of time snapshotting" which is basically time taken to create snapshot divided by the time between snapshot starts. For example if you have a huge snapshot and it takes 30 seconds to write to disk, that might be fine if you only have a trickle of updates and so only have to complete the snapshot every 2 hours (< 0.01% percent of time writing snapshots). However if you are snapshotting every minute at that size and so every server is spending 50% of it's time writing snapshots to disk, the cluster is IO constrained and needs tuning and/or new hardware. It also significantly increases the risk of leader instability since it will be relatively common for a quorum of servers to all be writing snapshots at the same time and so delay their log appends causing the leader to loose its lease and step down. As a semi-scientific guess, I'd say if you're servers more than 5-10% of the time snapshotting there is an IO or tuning issue. Monitoring growth of this metric could be useful for capacity planning to avoid that outage.
    • We should document this well as with the above task.
  3. Document on our telemetry page how to use the above metrics to monitor for danger
    • If restore time is ever close to the minimum log age right after a snapshot is taken on the leader, the cluster is in danger of this condition. For example if max log age can drop to 20 seconds (write rate of 500/s, trailing logs of 10k) and restore is taking 15s the cluster is likely under provisioned in terms of disk IO available for the current write rate, or requires tuning (e.g. increasing the number of trailing logs). It should be noted that the time reported by restore is only part of the issue - if restore + transmit from leader is longer then the cluster is in danger, so a healthy overhead should be maintain - perhaps if restores took more than half of the minimum log age, operators should consider tuning or upgrading hardware.
    • notes for snapshotting time in 2. above.
    • Also document that raft.leader.dispatchLog and raft.rpc.appendEntries.storeLogs are important to look at in possibly diskIO issues as they will show if the diskIO from snapshotting is causing appends to be slowed down which is a direct cause of leader instability. Being able to see these metrics for all servers at once can be useful to see if multiple servers are being slowed down by snapshotting at the same time which is when commits start to stall and leadership becomes threatened.
  4. Review the logs emitted by raft in this state and see if there are easy ways to detect repeatedly not being able to catch up and log that more loudly - on busy servers it's pretty hard to work out from logs currently if this is even happening. It's possible if you know what to look for, but I'm sure we can make it more explicit and hopefully reasonably easily.
  5. Consider updates to our example Grafana dashboard that include some of these things. TBD cc @mkcp .
@banks
Copy link
Member Author

banks commented Apr 8, 2021

One issue discovered with raft metrics:

@banks
Copy link
Member Author

banks commented Nov 15, 2022

hashicorp/raft#490 added progress messages during snapshot restore which was one of the outstanding issues here.

I'm going to call this done for now. There were other items in the list that could possibly be picked up but we hope to have a more complete solution to this issue available in the near future so it's unlikely these other small items will be picked up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/reliability theme/telemetry Anything related to telemetry or observability
Projects
None yet
Development

No branches or pull requests

2 participants