You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a follow up to several incidents where the failure described in #9609 was the root cause.
Currently raft writes out complete state snapshots periodically. For large state stores (e.g. 1GB or more) this can result in large amounts of diskIO. Since Consul only has a single "data dir" configurable, the snapshot file is almost always being written to the same physical device that the raft log store is on which means large snapshot writes that require a lot of IO operations often interfere with raft Append times as they contend for disk IO. Slowing log appends slows down commit times overall on the cluster especially when multiple servers are snapshotting at the same time which gets increasingly likely with high write rates.
While raft attempts to mitigate slow log appends from impacting heartbeat processing, we've observed that disk IO issues does still often cause cluster instability. Even if we find why that is and fix it, snapshotting can negatively affect write throughput overall as has been demonstrated in some public Consul KV benchmarks in the past.
This proposal is to be investigated as a possible "easy win" to reduce the impact snapshot writing has on log appends. Some time-boxed research on whether this reduces appendEntries tail-latency when writing snapshots with a very basic implementations is important before we both doing all the extra plumbing work etc. We might combine the investigation with #9620 as the test environment and code paths to change are essentially the same.
Proposal
Introduce a configurable rate limit for writing the snapshot. I imagine we'd need to choose a "chunk size" and then have the FileSnapshot return an io.WriteCloser that wraps the existing buffered file in another type that will check the write rate and sleep to prevent exceeding writing more than that rate to the underlying file.
It's not clear how well this would work and will certainly need to be tuned by operators so investigating whether it has an obvious affect on append latency or overall throughput when the snapshot is large is important. If we do it we should also document some guidance on tuning the parameters.
If we decide this is worth adding to improve performance or just as a way to get control over a cluster in overload situations, we should try to make the throttling configuration hot-reloadable to avoid needing restarts when the cluster is already in an unhealthy state.
The text was updated successfully, but these errors were encountered:
Background
This is a follow up to several incidents where the failure described in #9609 was the root cause.
Currently raft writes out complete state snapshots periodically. For large state stores (e.g. 1GB or more) this can result in large amounts of diskIO. Since Consul only has a single "data dir" configurable, the snapshot file is almost always being written to the same physical device that the raft log store is on which means large snapshot writes that require a lot of IO operations often interfere with raft Append times as they contend for disk IO. Slowing log appends slows down commit times overall on the cluster especially when multiple servers are snapshotting at the same time which gets increasingly likely with high write rates.
While raft attempts to mitigate slow log appends from impacting heartbeat processing, we've observed that disk IO issues does still often cause cluster instability. Even if we find why that is and fix it, snapshotting can negatively affect write throughput overall as has been demonstrated in some public Consul KV benchmarks in the past.
This proposal is to be investigated as a possible "easy win" to reduce the impact snapshot writing has on log appends. Some time-boxed research on whether this reduces appendEntries tail-latency when writing snapshots with a very basic implementations is important before we both doing all the extra plumbing work etc. We might combine the investigation with #9620 as the test environment and code paths to change are essentially the same.
Proposal
Introduce a configurable rate limit for writing the snapshot. I imagine we'd need to choose a "chunk size" and then have the FileSnapshot return an
io.WriteCloser
that wraps the existing buffered file in another type that will check the write rate and sleep to prevent exceeding writing more than that rate to the underlying file.It's not clear how well this would work and will certainly need to be tuned by operators so investigating whether it has an obvious affect on append latency or overall throughput when the snapshot is large is important. If we do it we should also document some guidance on tuning the parameters.
If we decide this is worth adding to improve performance or just as a way to get control over a cluster in overload situations, we should try to make the throttling configuration hot-reloadable to avoid needing restarts when the cluster is already in an unhealthy state.
The text was updated successfully, but these errors were encountered: