Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/reports: change node liveness considerations #43825

Merged
merged 4 commits into from
Jan 14, 2020

Conversation

andreimatei
Copy link
Contributor

Before this patch, whether a node was "alive" or "dead" did not matter
for the under-replicated ranges count in system.replication_stats. This patch
makes replicas on dead stores be ignored when counting replicas for
purposes of the underreplicated counter.
Liveness used to matter for the unavailable count and for the
system.critical_localities report (and continues to matter).

Note that this change interestingly means that a range can be considered
both under-replicated and over-replicated at the same time - if there's
too many replicas, but sufficiently many of them are dead.

This patch also changes the liveness criteria across the board for the
reports that cared about liveness (unavailable ranges, under-replicated
ranges and critical localities). It used to be that a node was
considered dead if its liveness record had not been pinged for
server.time_until_store_dead (5 min by default).
Now, a node is considered dead as soon as its liveness record expires
(seconds). So all the reports become much more sensitive to node
unresponsiveness.

Release note (sql change): Ranges are now considered under-replicated by
the system.replication_stats report when one of the replicas is
unresponsive (or the respective node is not running).
Release note (sql change): The system.critical_localities and
system.replication_stats (fields unavailable_ranges and
under_replicated_ranges) are now quicker to reflect dead or unresponsive
nodes in their accounting. A node used to be considered dead if it was
unresponsive for server.time_until_store_dead (5 min by default), now it
is considered dead if it's been unresponsive for a few seconds.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hold off the review, I want to do something else

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to go

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 4 of 4 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)


pkg/storage/reports/replication_stats_report_test.go, line 292 at r4 (raw file):

					// Under-replicated.
					{key: "/Table/t1/pk/102", stores: []int{1, 2}},
					// Under-replicated because 3 is dead.

Under-replicated because 4 is dead.?

Localize liveness state.

Release note: None
Before this patch, whether a node was "alive" or "dead" did not matter
for the under-replicated ranges count in system.replication_stats. This patch
makes replicas on dead stores be ignored when counting replicas for
purposes of the underreplicated counter.
Liveness used to matter for the unavailable count and for the
system.critical_localities report (and continues to matter).

Note that this change interestingly means that a range can be considered
both under-replicated and over-replicated at the same time - if there's
too many replicas, but sufficiently many of them are dead.

This patch also changes the liveness criteria across the board for the
reports that cared about liveness (unavailable ranges, under-replicated
ranges and critical localities). The code was buggy in that a
decomissioning node was considered live for 5 minutes after it stopped
heartbeating its liveness record, whereas a non-decomissioning one was
only considered live for a few seconds. The patch fixes it by using the
same logic to make liveness determinations as the underreplicated metric
does - you're dead as soon as the liveness record expires regardless of
decomissioning status.

Release note (sql change): Ranges are now considered under-replicated by
the system.replication_stats report when one of the replicas is
unresponsive (or the respective node is not running).
@andreimatei andreimatei force-pushed the reports.underreplicated-on-dead branch from 6ab7b7d to bd9a8af Compare January 10, 2020 21:51
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @darinpp)


pkg/storage/reports/replication_stats_report_test.go, line 292 at r4 (raw file):

Previously, ajwerner wrote…

Under-replicated because 4 is dead.?

done

craig bot pushed a commit that referenced this pull request Jan 14, 2020
43825: storage/reports: change node liveness considerations r=andreimatei a=andreimatei

Before this patch, whether a node was "alive" or "dead" did not matter
for the under-replicated ranges count in system.replication_stats. This patch
makes replicas on dead stores be ignored when counting replicas for
purposes of the underreplicated counter.
Liveness used to matter for the unavailable count and for the
system.critical_localities report (and continues to matter).

Note that this change interestingly means that a range can be considered
both under-replicated and over-replicated at the same time - if there's
too many replicas, but sufficiently many of them are dead.

This patch also changes the liveness criteria across the board for the
reports that cared about liveness (unavailable ranges, under-replicated
ranges and critical localities). It used to be that a node was
considered dead if its liveness record had not been pinged for
server.time_until_store_dead (5 min by default).
Now, a node is considered dead as soon as its liveness record expires
(seconds). So all the reports become much more sensitive to node
unresponsiveness.

Release note (sql change): Ranges are now considered under-replicated by
the system.replication_stats report when one of the replicas is
unresponsive (or the respective node is not running).
Release note (sql change): The system.critical_localities and
system.replication_stats (fields unavailable_ranges and
under_replicated_ranges) are now quicker to reflect dead or unresponsive
nodes in their accounting. A node used to be considered dead if it was
unresponsive for server.time_until_store_dead (5 min by default), now it
is considered dead if it's been unresponsive for a few seconds.

43943: Revert "roachprod: Make multiple set [provider]-zones always geo-distribute nodes" r=ajwerner a=jlinder

This reverts commit d24e40e.

Reverting back to the prior roachprod functionality for how --geo and
--zones worked because nightly roachtest running from teamcity is failing
numerous tests and this will allow fixing roachtest without the time pressure.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: James H. Linder <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jan 14, 2020

Build succeeded

@craig craig bot merged commit bd9a8af into cockroachdb:master Jan 14, 2020
@andreimatei andreimatei deleted the reports.underreplicated-on-dead branch January 14, 2020 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants