storage/reports: change node liveness considerations #43825

andreimatei · 2020-01-08T20:10:12Z

Before this patch, whether a node was "alive" or "dead" did not matter
for the under-replicated ranges count in system.replication_stats. This patch
makes replicas on dead stores be ignored when counting replicas for
purposes of the underreplicated counter.
Liveness used to matter for the unavailable count and for the
system.critical_localities report (and continues to matter).

Note that this change interestingly means that a range can be considered
both under-replicated and over-replicated at the same time - if there's
too many replicas, but sufficiently many of them are dead.

This patch also changes the liveness criteria across the board for the
reports that cared about liveness (unavailable ranges, under-replicated
ranges and critical localities). It used to be that a node was
considered dead if its liveness record had not been pinged for
server.time_until_store_dead (5 min by default).
Now, a node is considered dead as soon as its liveness record expires
(seconds). So all the reports become much more sensitive to node
unresponsiveness.

Release note (sql change): Ranges are now considered under-replicated by
the system.replication_stats report when one of the replicas is
unresponsive (or the respective node is not running).
Release note (sql change): The system.critical_localities and
system.replication_stats (fields unavailable_ranges and
under_replicated_ranges) are now quicker to reflect dead or unresponsive
nodes in their accounting. A node used to be considered dead if it was
unresponsive for server.time_until_store_dead (5 min by default), now it
is considered dead if it's been unresponsive for a few seconds.

cockroach-teamcity · 2020-01-08T20:10:29Z

This change is

andreimatei

hold off the review, I want to do something else

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)

andreimatei

good to go

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)

ajwerner

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 4 of 4 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)

pkg/storage/reports/replication_stats_report_test.go, line 292 at r4 (raw file):

					// Under-replicated.
					{key: "/Table/t1/pk/102", stores: []int{1, 2}},
					// Under-replicated because 3 is dead.

Under-replicated because 4 is dead.?

Release note: None

Localize liveness state. Release note: None

Release note: None

Before this patch, whether a node was "alive" or "dead" did not matter for the under-replicated ranges count in system.replication_stats. This patch makes replicas on dead stores be ignored when counting replicas for purposes of the underreplicated counter. Liveness used to matter for the unavailable count and for the system.critical_localities report (and continues to matter). Note that this change interestingly means that a range can be considered both under-replicated and over-replicated at the same time - if there's too many replicas, but sufficiently many of them are dead. This patch also changes the liveness criteria across the board for the reports that cared about liveness (unavailable ranges, under-replicated ranges and critical localities). The code was buggy in that a decomissioning node was considered live for 5 minutes after it stopped heartbeating its liveness record, whereas a non-decomissioning one was only considered live for a few seconds. The patch fixes it by using the same logic to make liveness determinations as the underreplicated metric does - you're dead as soon as the liveness record expires regardless of decomissioning status. Release note (sql change): Ranges are now considered under-replicated by the system.replication_stats report when one of the replicas is unresponsive (or the respective node is not running).

andreimatei

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)

pkg/storage/reports/replication_stats_report_test.go, line 292 at r4 (raw file):

Previously, ajwerner wrote…

Under-replicated because 4 is dead.?

done

43825: storage/reports: change node liveness considerations r=andreimatei a=andreimatei Before this patch, whether a node was "alive" or "dead" did not matter for the under-replicated ranges count in system.replication_stats. This patch makes replicas on dead stores be ignored when counting replicas for purposes of the underreplicated counter. Liveness used to matter for the unavailable count and for the system.critical_localities report (and continues to matter). Note that this change interestingly means that a range can be considered both under-replicated and over-replicated at the same time - if there's too many replicas, but sufficiently many of them are dead. This patch also changes the liveness criteria across the board for the reports that cared about liveness (unavailable ranges, under-replicated ranges and critical localities). It used to be that a node was considered dead if its liveness record had not been pinged for server.time_until_store_dead (5 min by default). Now, a node is considered dead as soon as its liveness record expires (seconds). So all the reports become much more sensitive to node unresponsiveness. Release note (sql change): Ranges are now considered under-replicated by the system.replication_stats report when one of the replicas is unresponsive (or the respective node is not running). Release note (sql change): The system.critical_localities and system.replication_stats (fields unavailable_ranges and under_replicated_ranges) are now quicker to reflect dead or unresponsive nodes in their accounting. A node used to be considered dead if it was unresponsive for server.time_until_store_dead (5 min by default), now it is considered dead if it's been unresponsive for a few seconds. 43943: Revert "roachprod: Make multiple set [provider]-zones always geo-distribute nodes" r=ajwerner a=jlinder This reverts commit d24e40e. Reverting back to the prior roachprod functionality for how --geo and --zones worked because nightly roachtest running from teamcity is failing numerous tests and this will allow fixing roachtest without the time pressure. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: James H. Linder <[email protected]>

craig · 2020-01-14T00:52:21Z

Build succeeded

GitHub CI (Cockroach)

andreimatei requested review from ajwerner, darinpp and andy-kimball January 8, 2020 20:10

andreimatei commented Jan 10, 2020

View reviewed changes

rmloveland mentioned this pull request Jan 10, 2020

Document system replication report changes to node liveness definitions cockroachdb/docs#6323

Closed

andreimatei force-pushed the reports.underreplicated-on-dead branch 2 times, most recently from 2974b45 to 6ab7b7d Compare January 10, 2020 21:25

andreimatei commented Jan 10, 2020

View reviewed changes

ajwerner approved these changes Jan 10, 2020

View reviewed changes

andreimatei added 4 commits January 10, 2020 16:49

storage/reports: improve comment on GetLivenessStatusMap

5a5ce62

Release note: None

storage/reports: minor refactor in replication reports

4ca35e5

Localize liveness state. Release note: None

storage/reports: fix duplicate keys in test

b3f320a

Release note: None

andreimatei force-pushed the reports.underreplicated-on-dead branch from 6ab7b7d to bd9a8af Compare January 10, 2020 21:51

andreimatei commented Jan 14, 2020

View reviewed changes

craig bot merged commit bd9a8af into cockroachdb:master Jan 14, 2020

andreimatei deleted the reports.underreplicated-on-dead branch January 14, 2020 20:09

jseldess mentioned this pull request Feb 19, 2020

storage/reports: change node liveness considerations cockroachdb/docs#6604

Closed

andreimatei mentioned this pull request Feb 19, 2020

server,kv: IsLive() does not check the Draining flag #45123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/reports: change node liveness considerations #43825

storage/reports: change node liveness considerations #43825

andreimatei commented Jan 8, 2020

cockroach-teamcity commented Jan 8, 2020

andreimatei left a comment

andreimatei left a comment

ajwerner left a comment

andreimatei left a comment

craig bot commented Jan 14, 2020

storage/reports: change node liveness considerations #43825

storage/reports: change node liveness considerations #43825

Conversation

andreimatei commented Jan 8, 2020

cockroach-teamcity commented Jan 8, 2020

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

craig bot commented Jan 14, 2020

Build succeeded