Improve observability around snapshot size vs trailing logs #9609

banks · 2021-01-21T16:09:18Z

Large Snapshot Replication Failure Problem Statement

Currently, raft writes out a complete snapshot of the state store to disk for every 16k writes to the DB, checked every 30s. After a successful snapshot, we truncate the logs to prevent the logs growing unboundedly. Instead of truncating all logs up to the index we just wrote the snapshot at, we always ensure we have at least 10k logs left in the log store (these parameters can be tuned).

The reason for leaving these "trailing logs" is so that a follower who goes down for a restart just before the snapshot completes can continue replicating once it comes backup (which may take a few seconds).

If a follower does get behind (e.g. it's partitioned for a few minutes) it's possible that there will have been more than 10k updates and a snapshot+truncate on the leader in the interim meaning the follower is unable to continue replicating from the logs. In this case the leader will deliver it an entire new snapshot over the wire before resuming log-based replication.

In general this works well for typical uses of Consul, however it has a hard assumption that the time required to transmit the most recent snapshot to a peer, and have the peer restore it into it's state store ready to resume accepting logs is much smaller than the time taking to append 10k more log updates on the leader.

On clusters that have both a high write rate (e.g. 500 writes per second) and a large state snapshot (e.g. larger than 1GB) even on reasonably fast hardware that assumption can be violated resulting in a permanent error condition where followers can never catch up to the leader again.

A worked example:

At 500 writes/second, the snapshot threshold is met within 32 seconds and with an interval of 30 seconds that means the leader will be writing out a full snapshot (1GB) file every minute.
Every minute after the snapshot, logs are truncated to 10k entries which is only the last 20 seconds worth of updates.
If sending 1GB snapshot and restoring it on a follower takes more than 20 seconds, the follower will never be able to get healthy after a restart unless the write volume drops.

Note that raft_trailing_logs can be increased, but since this requires a restart and only once the current leader has the new value will it improve things, it's often not possible to recover from this situation without downtime or somehow reducing the write rate on the cluster enough to allow followers to catch up and then be reconfigured with a rolling restart. Reducing snapshot size helps in the medium term but often isn't possible immediately and requires more writes to do it so is rarely an effective step that can be taken during an incident.

There is also currently a good reason to not increase raft_trailing_logs to something very high pre-emptively. Most deployments have ample disk space and so may be OK with additional disk usage that retaining lots more logs involves, however with the current log store implementation which is a grow-only copy-on-write Btree, there is a limit to how large that file can be before it starts to impact log "append" and truncate performance. This might change in the future with a different log store implementation which could reduce the throughput impact of a high raft_trailing_logs.

Observability Improvements

There are a number of things we can do to mitigate this issue which will be broken into several different issues. This one will act as a reference for the problem statement above.

One significant issue when encountering this scenario in the wild is that even if you know to look, working out if this is going to happen is very hard. There are a number of simple ways we can improve our metrics and logging that will help diagnose and monitor to prevent this situation occurring.

To resolve this issue I propose we consider the following. If we decide on investigation not to do any immediately, we can split them to separate issues or drop them. These should all be relatively small code/docs changes. The majority of the work would likely be around validating they work well for this intended case perhaps by simulating it (could do that by writing lots of large KV entries, potentially with an artificially small raft_trailing_logs config.

Add a gauge metric that shows age in seconds of the oldest entry still in the logs
- This will allow easy monitoring of how the current write rate, snapshot interval and trailing logs settings combine. Used along with the restore and snapshot timing metrics (consul.raft.fsm.restore, consul.raft.replication.installSnapshot) operators will be able to see if their cluster is heading towards a "danger zone" [edit: see note in comment below].
We already have timing for making snapshots consul.raft.snapshot.takeSnapshot but it can be hard to quickly see how much of the time is spent writing out snapshots:
this was a very lightly loaded cluster with a small snapshot so it only snapshotted twice while loading lots of values in and wasn't snapshotting constantly like our example.
- It would be good to export a metric for "percentage of time snapshotting" which is basically time taken to create snapshot divided by the time between snapshot starts. For example if you have a huge snapshot and it takes 30 seconds to write to disk, that might be fine if you only have a trickle of updates and so only have to complete the snapshot every 2 hours (< 0.01% percent of time writing snapshots). However if you are snapshotting every minute at that size and so every server is spending 50% of it's time writing snapshots to disk, the cluster is IO constrained and needs tuning and/or new hardware. It also significantly increases the risk of leader instability since it will be relatively common for a quorum of servers to all be writing snapshots at the same time and so delay their log appends causing the leader to loose its lease and step down. As a semi-scientific guess, I'd say if you're servers more than 5-10% of the time snapshotting there is an IO or tuning issue. Monitoring growth of this metric could be useful for capacity planning to avoid that outage.
- We should document this well as with the above task.
Document on our telemetry page how to use the above metrics to monitor for danger
- If restore time is ever close to the minimum log age right after a snapshot is taken on the leader, the cluster is in danger of this condition. For example if max log age can drop to 20 seconds (write rate of 500/s, trailing logs of 10k) and restore is taking 15s the cluster is likely under provisioned in terms of disk IO available for the current write rate, or requires tuning (e.g. increasing the number of trailing logs). It should be noted that the time reported by restore is only part of the issue - if restore + transmit from leader is longer then the cluster is in danger, so a healthy overhead should be maintain - perhaps if restores took more than half of the minimum log age, operators should consider tuning or upgrading hardware.
- notes for snapshotting time in 2. above.
- Also document that raft.leader.dispatchLog and raft.rpc.appendEntries.storeLogs are important to look at in possibly diskIO issues as they will show if the diskIO from snapshotting is causing appends to be slowed down which is a direct cause of leader instability. Being able to see these metrics for all servers at once can be useful to see if multiple servers are being slowed down by snapshotting at the same time which is when commits start to stall and leadership becomes threatened.
Review the logs emitted by raft in this state and see if there are easy ways to detect repeatedly not being able to catch up and log that more loudly - on busy servers it's pretty hard to work out from logs currently if this is even happening. It's possible if you know what to look for, but I'm sure we can make it more explicit and hopefully reasonably easily.
Consider updates to our example Grafana dashboard that include some of these things. TBD cc @mkcp .

The text was updated successfully, but these errors were encountered:

banks · 2021-04-08T11:32:04Z

One issue discovered with raft metrics:

consul.raft.fsm.restore only gets called here https://github.com/hashicorp/raft/blob/f3ecdb684b19460ab44fa64f3b7e7ef127c9a252/fsm.go#L187 in the FSM loop. That means we only emit that metric for restores that happen at run time not at startup since that restore occurs here: https://github.com/hashicorp/raft/blob/f3ecdb684b19460ab44fa64f3b7e7ef127c9a252/api.go#L601
This means that on a regular restart, we don't current emit any metric to show how long the restore took - only on server-initiated snapshot installs (and in that case raft.replication.installSnapshot is a better measure to compare against.
In typical operation before the error condition is hit, you'd not get many opportunities to observe snapshot installs since replication is keeping up so we really need to emit a metric for restore on startup too.

banks · 2022-11-15T20:34:57Z

hashicorp/raft#490 added progress messages during snapshot restore which was one of the outstanding issues here.

I'm going to call this done for now. There were other items in the list that could possibly be picked up but we hope to have a more complete solution to this issue available in the near future so it's unlikely these other small items will be picked up.

banks added theme/telemetry Anything related to telemetry or observability theme/reliability labels Jan 21, 2021

banks mentioned this issue Feb 26, 2021

Make snapshot timing and trailing logs hot-reloadable in raft hashicorp/raft#444

Merged

banks mentioned this issue Mar 26, 2021

Add oldest log metric to leader hashicorp/raft#452

Merged

This was referenced Apr 8, 2021

Add a gauge to hold the last restore time that hashicorp/raft#454

Merged

Don't expire Prometheus metrics that have been explicitly defined hashicorp/go-metrics#123

Merged

banks mentioned this issue Apr 27, 2021

Make Raft trailing logs and snapshot timing reloadable #10129

Merged

1 task

banks closed this as completed Nov 15, 2022

hc-github-team-consul-core assigned nathancoleman Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve observability around snapshot size vs trailing logs #9609

Improve observability around snapshot size vs trailing logs #9609

banks commented Jan 21, 2021 •

edited

Loading

banks commented Apr 8, 2021

banks commented Nov 15, 2022

Improve observability around snapshot size vs trailing logs #9609

Improve observability around snapshot size vs trailing logs #9609

Comments

banks commented Jan 21, 2021 • edited Loading

Large Snapshot Replication Failure Problem Statement

Observability Improvements

banks commented Apr 8, 2021

banks commented Nov 15, 2022

banks commented Jan 21, 2021 •

edited

Loading