forked from cockroachdb/cockroach
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
kvserver: avoid redundant liveness heartbeats under a thundering herd
When a node's liveness expires, either because its liveness record's epoch is incremented or it is just slow to heartbeat its record, all of its epoch-based leases immediately become invalid. As a result, we often see a thundering herd of requests attempt to synchronously heartbeat the node's liveness record, on the order of the number of ranges that lost their lease. We already limit the concurrency of these heartbeats to 1, so there is not a huge concern that this will lead to overwhelming the liveness range, but it does cause other issues. For one, it means that we end up heartbeating the liveness record very rapidly, which causes large growth in MVCC history. It also means that heartbeats at the end of the queue have to wait for all other heartbeats in front of it to complete. Even if these heartbeats only take 5ms each, if there are 100 of them waiting, then the last one in line will wait for 500ms and its range will be unavailable during this time. This also has the potential to starve the liveness heartbeat loop, which isn't a problem in and of itself as long as other synchronous heartbeats are succeeding, but leads to concerning log warnings. Finally, this was an instance where we were adding additional load to a cluster once it was close to being overloaded. That's generally a bad property for a system that wants to stay stable, and this change helps avoid it. The solution here is to detect redundant heartbeats and make them no-ops where possible. This has a similar effect to if we were to explicitly coalesce heartbeats, but it's easier to reason about and requires us to maintain less state. The commit is conservative about this, providing a fairly strong guarantee that a heartbeat attempt, if successful, will ensure that the liveness record's expiration will be at least the liveness threshold above the time that the method was called. We may be able to relax this and say that the heartbeat attempt will just ensure that the expiration is now above that of the oldLiveness provided, but this weakened guarantee seems harder to reason about as a consumer of this interface. Release note (performance improvement): ranges recover moderately faster when their leaseholder is briefly down before becoming live again.
- Loading branch information
1 parent
09804ff
commit 1dc18df
Showing
8 changed files
with
193 additions
and
45 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.