-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/reports: change node liveness considerations #43825
storage/reports: change node liveness considerations #43825
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hold off the review, I want to do something else
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)
2974b45
to
6ab7b7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to go
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 4 of 4 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)
pkg/storage/reports/replication_stats_report_test.go, line 292 at r4 (raw file):
// Under-replicated. {key: "/Table/t1/pk/102", stores: []int{1, 2}}, // Under-replicated because 3 is dead.
Under-replicated because 4 is dead.
?
Release note: None
Localize liveness state. Release note: None
Release note: None
Before this patch, whether a node was "alive" or "dead" did not matter for the under-replicated ranges count in system.replication_stats. This patch makes replicas on dead stores be ignored when counting replicas for purposes of the underreplicated counter. Liveness used to matter for the unavailable count and for the system.critical_localities report (and continues to matter). Note that this change interestingly means that a range can be considered both under-replicated and over-replicated at the same time - if there's too many replicas, but sufficiently many of them are dead. This patch also changes the liveness criteria across the board for the reports that cared about liveness (unavailable ranges, under-replicated ranges and critical localities). The code was buggy in that a decomissioning node was considered live for 5 minutes after it stopped heartbeating its liveness record, whereas a non-decomissioning one was only considered live for a few seconds. The patch fixes it by using the same logic to make liveness determinations as the underreplicated metric does - you're dead as soon as the liveness record expires regardless of decomissioning status. Release note (sql change): Ranges are now considered under-replicated by the system.replication_stats report when one of the replicas is unresponsive (or the respective node is not running).
6ab7b7d
to
bd9a8af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)
pkg/storage/reports/replication_stats_report_test.go, line 292 at r4 (raw file):
Previously, ajwerner wrote…
Under-replicated because 4 is dead.
?
done
43825: storage/reports: change node liveness considerations r=andreimatei a=andreimatei Before this patch, whether a node was "alive" or "dead" did not matter for the under-replicated ranges count in system.replication_stats. This patch makes replicas on dead stores be ignored when counting replicas for purposes of the underreplicated counter. Liveness used to matter for the unavailable count and for the system.critical_localities report (and continues to matter). Note that this change interestingly means that a range can be considered both under-replicated and over-replicated at the same time - if there's too many replicas, but sufficiently many of them are dead. This patch also changes the liveness criteria across the board for the reports that cared about liveness (unavailable ranges, under-replicated ranges and critical localities). It used to be that a node was considered dead if its liveness record had not been pinged for server.time_until_store_dead (5 min by default). Now, a node is considered dead as soon as its liveness record expires (seconds). So all the reports become much more sensitive to node unresponsiveness. Release note (sql change): Ranges are now considered under-replicated by the system.replication_stats report when one of the replicas is unresponsive (or the respective node is not running). Release note (sql change): The system.critical_localities and system.replication_stats (fields unavailable_ranges and under_replicated_ranges) are now quicker to reflect dead or unresponsive nodes in their accounting. A node used to be considered dead if it was unresponsive for server.time_until_store_dead (5 min by default), now it is considered dead if it's been unresponsive for a few seconds. 43943: Revert "roachprod: Make multiple set [provider]-zones always geo-distribute nodes" r=ajwerner a=jlinder This reverts commit d24e40e. Reverting back to the prior roachprod functionality for how --geo and --zones worked because nightly roachtest running from teamcity is failing numerous tests and this will allow fixing roachtest without the time pressure. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: James H. Linder <[email protected]>
Build succeeded |
Before this patch, whether a node was "alive" or "dead" did not matter
for the under-replicated ranges count in system.replication_stats. This patch
makes replicas on dead stores be ignored when counting replicas for
purposes of the underreplicated counter.
Liveness used to matter for the unavailable count and for the
system.critical_localities report (and continues to matter).
Note that this change interestingly means that a range can be considered
both under-replicated and over-replicated at the same time - if there's
too many replicas, but sufficiently many of them are dead.
This patch also changes the liveness criteria across the board for the
reports that cared about liveness (unavailable ranges, under-replicated
ranges and critical localities). It used to be that a node was
considered dead if its liveness record had not been pinged for
server.time_until_store_dead (5 min by default).
Now, a node is considered dead as soon as its liveness record expires
(seconds). So all the reports become much more sensitive to node
unresponsiveness.
Release note (sql change): Ranges are now considered under-replicated by
the system.replication_stats report when one of the replicas is
unresponsive (or the respective node is not running).
Release note (sql change): The system.critical_localities and
system.replication_stats (fields unavailable_ranges and
under_replicated_ranges) are now quicker to reflect dead or unresponsive
nodes in their accounting. A node used to be considered dead if it was
unresponsive for server.time_until_store_dead (5 min by default), now it
is considered dead if it's been unresponsive for a few seconds.