server: general correctness fixes for ListSessions / Ranges / upgradeStatus #26821

knz · 2018-06-19T13:02:16Z

Prior to this patch, various parts of the code what were attempting to
"iterate over all nodes in the cluster" were confused and would
either:

skip over nodes that are known to exist in KV but for which gossip
information was currently unavailable, or
fail to skip over removed nodes (dead + decommissioned) or
incorrectly skip over temporarily failed nodes.

This patch comprehensively addresses this class of issues by:

tweaking (*NodeLiveness).GetIsLiveMap() to exclude decommissioned
nodes upfront.
documenting (*NodeLiveness).GetLivenessStatusMap() better, to
mention it includes decommissioned nodes and does not include
all nodes known to KV.
providing a new method (*statusServer).NodesWithLiveness() which
always includes all nodes known in KV, not just those for which
gossip information is available.
ensuring that (*Server).upgradeStatus() does not incorrectly skip
over temporarily dead nodes or nodes yet unknown to gossip by
using the new NodesWithLiveness() method appropriately.
providing a new method (*statusServer).iterateNodes() which does
"the right thing" via NodesWithLiveness() and using it for
the status RPCs ListSessions() and Range().

Additionally this patch makes SHOW QUERIES and SHOW SESSIONS report
error details when it fails to query information from a node.

Release note (bug fix): the server will not finalize a version upgrade
automatically and erroneously if there are nodes temporarily inactive
in the cluster.

Release note (sql change): SHOW CLUSTER QUERIES/SESSIONS now report
the details of the error upon failing to gather data from other nodes.

Fixes #22863.
Fixes #26852.
Fixes #26897.

cockroach-teamcity · 2018-06-19T13:02:22Z

This change is

knz · 2018-06-19T13:02:37Z

Any suggestion as to how to force a node failure in a test?

tbg · 2018-06-19T13:07:56Z

LGTM

If you wanted to test this, why not a unit test?

Also, while you're in that area of the code -- could you check that the RPC timeouts aren't too aggressive, and that it's aware of decommissioned/dead nodes? I'm seeing these null columns regularly in snippets from production users, even though their clusters are presumably healthy.

knz · 2018-06-19T17:23:40Z

could you check that the RPC timeouts aren't too aggressive

the code uses the standard base.NetworkTimeout which is currently set to 3s (constant). Is this appropriate?

and that it's aware of decommissioned/dead nodes?

The way the code currently works is:

issues a KV scan on the node status range to determine node IDs -- the other fields of the node descriptors are ignored at that point
for each node ID, it uses dialNode which in turn checks gossip to determine whether the node is alive. For every node that's not alive according to gossip, an "address not found" error is generated.

This raises a couple question for me, given that I am not knowledgeable with these APIs:

why use a KV scan no the node status range to establish the list of node IDs when s.gossip.nodeDescs already contains these node IDs?
is it possible to have a node ID in the KV range but not in gossip? (So that the nodeID-to-address resolution fails although the node is known in KV?)

tbg · 2018-06-19T17:46:36Z

why use a KV scan no the node status range to establish the list of node IDs when s.gossip.nodeDescs already contains these node IDs?

The KV information is more authoritative, you could theoretically be missing nodes in Gossip (for example if they recently started and the update hasn't propagated to you yet). Probably the code just tried really hard not to miss any nodes, or it was just to make tests not flaky.

I think a reasonable change is

keep using the kv information to determine node IDs
filter the retrieved node ids through s.nodeLiveness and keep only those that are !decommissioning || !dead, where a node is dead when it hasn't heartbeat within storage.TimeUntilStoreDead.

This means that if you have a down/dead/partitioned node, you will try to connect and generate an error row. If you decommission the node, it will vanish because by doing so you promise that the node is in fact offline.

knz · 2018-06-20T15:42:23Z

I have reworked the patch entirely and expanded its scope. PTAL.

knz · 2018-06-20T18:03:55Z

I have written a new test TestLivenessStatusMap to check the semantics of the new function but that test times out. I suspect it is because when I shut down a store it does not register as dead in gossip until 5 minutes (the standard time until store dead) which is higher than the test timeout.

How can I decreate the time until store dead in a test in a way that makes sense? @BramGruneir could you make suggestions?

BramGruneir · 2018-06-20T19:30:46Z

By the way, I really like this change.

LGTM if you can get the test working.

As for the test, take a look at TestStoreRangeRemoveDead in pkg/storage/client_raft_test.go. This sets the time until store dead and does the re-gossiping of the alive store manually.

Reviewed 6 of 6 files at r1, 3 of 3 files at r2.
Review status: complete! 0 of 0 LGTMs obtained

pkg/server/status.go, line 1294 at r1 (raw file):

}

// iterateNodes iterates nodeFn over all non-removed nodes concurrently.

This is wonderful. We should use it everywhere else throughout status.

Comments from Reviewable

knz · 2018-06-21T12:48:41Z

I have rewritten the test and in succeeds on my local machine (even stress runs succeed). However:

the test takes 10 seconds to recognize a node is dead, although I have set TimeUntilStoreDead to 5 milliseconds. What else should I change to make the thing complete faster?
it succeeds on my local machine but fails on CI sometimes (I think?). I think the issue is that I want to check that a node is marked as "decommissioning" but when I do so it also proceeds to shut down on its own. My test thus non-deterministically observes that the node is either "decommissioning" or "decommissioned". Is there a way to mark the node as decommissioning but not let it shut down until the test has observed the intermediate status?
Is spawning an entire TestCluster a good approach? I have tried to use the multiTestContext thing used by other tests but to no avail - when I run stopStore() on it the test deadlocks.

knz · 2018-06-21T14:59:41Z

I tweaked the test to remove the race condition / flake. Now the only remaining drawback is that the test takes 10 seconds.

knz · 2018-06-21T15:50:23Z

I am going to merge this now since it fixes an important bug and addresses our show queries shortcoming. Filed #26897 separately to make the test faster.

knz · 2018-06-21T15:50:29Z

bors r+

knz · 2018-06-21T16:02:18Z

bors r-

knz · 2018-06-25T14:58:04Z

Review status: complete! 0 of 0 LGTMs obtained

pkg/server/status.go, line 866 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why are we passing true here and then discarding decommissioned nodes separately, instead of using the same rule as GetLivenessStatusMap(false)? Any differences between the two are subtle and likely to lead to problems.

As Tobias explained, we may have nodes that have reported their existence in KV but not known yet to gossip. These should not "disappear" just because gossip doesn't know about them yet.

What would you suggest to do instead?

pkg/storage/node_liveness_test.go, line 781 at r3 (raw file):

I think I'd rather have a unit test than an end-to-end test here.

I have tried, trust me, and I have failed. And I have no idea how to do better than what's here (other than using the default tick interval but then the test takes ~11s to run).

I don't mind doing something else here but please guide me and tell me precisely what needs to happen. I have zero time available to investigate this myself at this time. Thanks.

Comments from Reviewable

bdarnell · 2018-06-26T16:39:30Z

Review status: complete! 0 of 0 LGTMs obtained

pkg/server/status.go, line 866 at r3 (raw file):

Previously, knz (kena) wrote…

As Tobias explained, we may have nodes that have reported their existence in KV but not known yet to gossip. These should not "disappear" just because gossip doesn't know about them yet.

What would you suggest to do instead?

What does "reported their existence in KV" mean? Should NodeLiveness be fetching these nodes from KV as well?

Is it that bad that nodes that are not yet in gossip are not included? Should s.Nodes only include nodes that are present in NodeLiveness gossip?

Basically I want to minimize the number of different pathways that incidentally return subtly different results. If liveness gossip and node status KV are unavoidably different, so be it, but let's not introduce a third hybrid of the two here.

pkg/storage/node_liveness_test.go, line 781 at r3 (raw file):

Previously, knz (kena) wrote…

I think I'd rather have a unit test than an end-to-end test here.

I have tried, trust me, and I have failed. And I have no idea how to do better than what's here (other than using the default tick interval but then the test takes ~11s to run).

I don't mind doing something else here but please guide me and tell me precisely what needs to happen. I have zero time available to investigate this myself at this time. Thanks.

I don't have any specific suggests, so as long as this test is stable under stress/stressrace I think we should leave it as is.

Comments from Reviewable

tbg · 2018-06-26T22:12:11Z

I agree with Ben that my suggestion to try to return "more correct" results here is not worth it. Let's just skip the KV reads.

On Tue, Jun 26, 2018, 6:39 PM Ben Darnell ***@***.***> wrote: Review status: [image:

] complete! 0 of 0 LGTMs obtained ------------------------------ *pkg/server/status.go, line 866 at r3 <https://reviewable.io/reviews/cockroachdb/cockroach/26821#-LFr5iA3EDmyvqyUcZUB:-LFwvQct-YnvyT7oe4CB:bb7kbmt> (raw file <https://github.com/cockroachdb/cockroach/blob/edc19644a38f3de66f4e8bd1eac2fcc8aa37f02d/pkg/server/status.go#L866>):* *Previously, knz (kena) wrote…* As Tobias explained, we may have nodes that have reported their existence in KV but not known yet to gossip. These should not "disappear" just because gossip doesn't know about them yet. What would you suggest to do instead? What does "reported their existence in KV" mean? Should NodeLiveness be fetching these nodes from KV as well? Is it that bad that nodes that are not yet in gossip are not included? Should s.Nodes only include nodes that are present in NodeLiveness gossip? Basically I want to minimize the number of different pathways that incidentally return subtly different results. If liveness gossip and node status KV are unavoidably different, so be it, but let's not introduce a third hybrid of the two here. ------------------------------ *pkg/storage/node_liveness_test.go, line 781 at r3 <https://reviewable.io/reviews/cockroachdb/cockroach/26821#-LFo4zeuC2IX_EVNEtFu:-LFwxTwO0mL4TnI8gvye:b-i10jui> (raw file <https://github.com/cockroachdb/cockroach/blob/edc19644a38f3de66f4e8bd1eac2fcc8aa37f02d/pkg/storage/node_liveness_test.go#L781>):* *Previously, knz (kena) wrote…* I think I'd rather have a unit test than an end-to-end test here. I have tried, trust me, and I have failed. And I have no idea how to do better than what's here (other than using the default tick interval but then the test takes ~11s to run). I don't mind doing something else here but *please* guide me and tell me precisely what needs to happen. I have zero time available to investigate this myself at this time. Thanks. I don't have any specific suggests, so as long as this test is stable under stress/stressrace I think we should leave it as is. ------------------------------ *Comments from Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/26821>* — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26821 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135Ez4FthlMcEdAxxH1shbgBT4ReyKks5uAmPQgaJpZM4UtdrS> .

--

…

-- Tobias

windchan7 · 2018-06-26T23:31:05Z

Regarding the upgradeStatus method, if a node is temporarily down, the auto upgrade is not going to happen. This case is tested by the upgrade roachtest as well. It's mainly because upgradeStatus will only return do upgrade if the last check (no non-decommissioned dead nodes) passes. There's a more detailed explanation on the issue page. In all other cases, auto upgrade will simply quit without doing anything or keep rechecking these conditions. @benesch Nikhil, please correct me if I'm wrong, thanks.

knz · 2018-06-27T10:33:09Z

I agree with Ben that my suggestion to try to return "more correct" results here is not worth it. Let's just skip the KV reads.

I fear this is not an option at all because of #26852.

Either you go up to that issue and convince me with a solid argument that skipping the KV reads keeps the version upgrade logic correct, or we'll have to do the KV reads here.

knz · 2018-06-27T10:37:17Z

@windchan7 I have responded to your comment on the linked issue. You're still assuming that the check in upgradeStatus is correct because you're assuming the liveness map is comprehensive. This is not the case and what this PR is addressing.

bdarnell · 2018-06-27T14:55:10Z

Got it. Part of my concern here was a misunderstanding on my part - I thought the rest of statusServer was gossip-driven but it was actually KV-driven. I apologize for the extra back-and-forth.

Review status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/server/status.go, line 866 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What does "reported their existence in KV" mean? Should NodeLiveness be fetching these nodes from KV as well?

Is it that bad that nodes that are not yet in gossip are not included? Should s.Nodes only include nodes that are present in NodeLiveness gossip?

Basically I want to minimize the number of different pathways that incidentally return subtly different results. If liveness gossip and node status KV are unavoidably different, so be it, but let's not introduce a third hybrid of the two here.

Every use of GetLivenessStatusMap is passing true for includeRemovedNodes (except the test of the false case). I suggest removing the argument and making statusServer.NodesWithLiveness the standard way to get a list of non-removed nodes (minimizing direct access to NodeLiveness because it may be incomplete).

Comments from Reviewable

knz · 2018-06-29T11:16:18Z

Review status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/server/status.go, line 866 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Every use of GetLivenessStatusMap is passing true for includeRemovedNodes (except the test of the false case). I suggest removing the argument and making statusServer.NodesWithLiveness the standard way to get a list of non-removed nodes (minimizing direct access to NodeLiveness because it may be incomplete).

Done.

pkg/sql/crdb_internal.go, line 742 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/query ID/query/

Done.

pkg/sql/crdb_internal.go, line 861 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/session ID/active queries/

Done.

Comments from Reviewable

knz · 2018-06-29T11:18:15Z

Review status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/node_liveness_test.go, line 781 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't have any specific suggests, so as long as this test is stable under stress/stressrace I think we should leave it as is.

I have ran it under stress + stressrace multiple minutes and not seen it failing so far.

Comments from Reviewable

knz · 2018-06-29T11:23:54Z

All right merging this, we can do finishing touches in a later PR if need be.

Thanks for all the reviews!

bors r+

knz · 2018-06-29T11:27:17Z

welp, found a stress failure.

bors r-

craig · 2018-06-29T11:27:18Z

Canceled

…Version Prior to this patch, various parts of the code what were attempting to "iterate over all nodes in the cluster" were confused and would either: - skip over nodes that are known to exist in KV but for which gossip information was currently unavailable, or - fail to skip over removed nodes (dead + decommissioned) or - incorrectly skip over temporarile failed nodes. This patch comprehensively addresses this class of issues by: - tweaking `(*NodeLiveness).GetIsLiveMap()` to exclude decommissioned nodes upfront. - documenting `(*NodeLiveness).GetLivenessStatusMap()` better, to mention it includes decommissioned nodes and does not include all nodes known to KV. - providing a new method `(*statusServer).NodesWithLiveness()` which always includes all nodes known in KV, not just those for which gossip information is available. - ensuring that `(*Server).upgradeStatus()` does not incorrectly skip over temporarily dead nodes or nodes yet unknown to gossip by using the new `NodesWithLiveness()` method appropriately. - providing a new method `(*statusServer).iterateNodes()` which does "the right thing" via `NodesWithLiveness()` and using it for the status RPCs `ListSessions()` and `Range()`. Additionally this patch makes SHOW QUERIES and SHOW SESSIONS report error details when it fails to query information from a node. Release note (bug fix): the server will not finalize a version upgrade automatically and erroneously if there are nodes temporarily inactive in the cluster. Release note (sql change): SHOW CLUSTER QUERIES/SESSIONS now report the details of the error upon failing to gather data from other nodes.

knz · 2018-06-29T11:38:38Z

All right I have relaxed the timing in the test and that makes the stress happy.

bors r+

26821: server: general correctness fixes for ListSessions / Ranges / upgradeStatus r=knz a=knz Prior to this patch, various parts of the code what were attempting to "iterate over all nodes in the cluster" were confused and would either: - skip over nodes that are known to exist in KV but for which gossip information was currently unavailable, or - fail to skip over removed nodes (dead + decommissioned) or - incorrectly skip over temporarily failed nodes. This patch comprehensively addresses this class of issues by: - tweaking `(*NodeLiveness).GetIsLiveMap()` to exclude decommissioned nodes upfront. - tweaking `(*NodeLiveness).GetLivenessStatusMap()` to conditionally exclude decommissioned nodes upfront, and changing all callers accordingly. - providing a new method `(*statusServer).NodesWithLiveness()` which always includes all nodes known in KV, not just those for which gossip information is available. - ensuring that `(*Server).upgradeStatus()` does not incorrectly skip over temporarily dead nodes or nodes yet unknown to gossip by using the new `NodesWithLiveness()` method appropriately. - providing a new method `(*statusServer).iterateNodes()` which does "the right thing" via `NodesWithLiveness()` and using it for the status RPCs `ListSessions()` and `Range()`. Additionally this patch makes SHOW QUERIES and SHOW SESSIONS report error details when it fails to query information from a node. Release note (bug fix): the server will not finalize a version upgrade automatically and erroneously if there are nodes temporarily inactive in the cluster. Release note (sql change): SHOW CLUSTER QUERIES/SESSIONS now report the details of the error upon failing to gather data from other nodes. Fixes #22863. Fixes #26852. Fixes #26897. Co-authored-by: Raphael 'kena' Poss <[email protected]>

craig · 2018-06-29T12:00:31Z

Build succeeded

GitHub CI (Cockroach)

knz requested a review from tbg June 19, 2018 13:02

knz requested review from a team June 19, 2018 13:02

knz force-pushed the 20180619-show-queries branch from 3fdd1cf to fea11f8 Compare June 19, 2018 18:30

knz requested a review from a team June 19, 2018 18:30

knz force-pushed the 20180619-show-queries branch 2 times, most recently from a0a8db3 to 1f06482 Compare June 20, 2018 15:40

knz changed the title ~~sql: make SHOW QUERIES and SHOW SESSIONS report error details~~ server: general correctness fixes for ListSessions / Ranges / upgradeStatus Jun 20, 2018

knz requested a review from BramGruneir June 20, 2018 15:53

knz force-pushed the 20180619-show-queries branch from 1f06482 to ed53ee4 Compare June 20, 2018 18:02

knz force-pushed the 20180619-show-queries branch 2 times, most recently from 6c51b91 to 7e65dde Compare June 21, 2018 12:29

knz force-pushed the 20180619-show-queries branch from 7e65dde to 11fbb5a Compare June 21, 2018 14:59

knz mentioned this pull request Jun 21, 2018

storage: TestNodeLivenessStatusMap is too slow #26897

Closed

couchand mentioned this pull request Jun 25, 2018

server/storage: Add admin endpoint to manually run range through queue #26554

Merged

knz mentioned this pull request Jun 27, 2018

server: version upgrade predicate is wrong #26852

Closed

couchand mentioned this pull request Jun 28, 2018

ui: collect statement stats from all nodes #26605

Merged

knz force-pushed the 20180619-show-queries branch 2 times, most recently from 1365b89 to af65663 Compare June 29, 2018 11:16

knz force-pushed the 20180619-show-queries branch from af65663 to 1ab9cb5 Compare June 29, 2018 11:38

craig bot merged commit 1ab9cb5 into cockroachdb:master Jun 29, 2018

knz deleted the 20180619-show-queries branch June 29, 2018 12:05

jseldess mentioned this pull request Jul 31, 2018

server: general correctness fixes for ListSessions / Ranges / upgradeStatus cockroachdb/docs#3484

Closed

knz mentioned this pull request Sep 8, 2018

SHOW QUERIES shows NULLed queries from decommissioned nodes #29970

Closed

knz mentioned this pull request Oct 5, 2018

sql: SHOW QUERIES/SESSIONS appears to list decommissioned nodes #30954

Closed

knz mentioned this pull request Sep 12, 2019

sql: Add a SHOW NODES command #40636

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: general correctness fixes for ListSessions / Ranges / upgradeStatus #26821

server: general correctness fixes for ListSessions / Ranges / upgradeStatus #26821

knz commented Jun 19, 2018 •

edited

Loading

cockroach-teamcity commented Jun 19, 2018

knz commented Jun 19, 2018

tbg commented Jun 19, 2018

knz commented Jun 19, 2018

tbg commented Jun 19, 2018

knz commented Jun 20, 2018

knz commented Jun 20, 2018

BramGruneir commented Jun 20, 2018

knz commented Jun 21, 2018

knz commented Jun 21, 2018

knz commented Jun 21, 2018

knz commented Jun 21, 2018

knz commented Jun 21, 2018

knz commented Jun 25, 2018

bdarnell commented Jun 26, 2018

tbg commented Jun 26, 2018 via email

windchan7 commented Jun 26, 2018 •

edited

Loading

knz commented Jun 27, 2018

knz commented Jun 27, 2018

bdarnell commented Jun 27, 2018

knz commented Jun 29, 2018

knz commented Jun 29, 2018

knz commented Jun 29, 2018

knz commented Jun 29, 2018

craig bot commented Jun 29, 2018

knz commented Jun 29, 2018

craig bot commented Jun 29, 2018

server: general correctness fixes for ListSessions / Ranges / upgradeStatus #26821

server: general correctness fixes for ListSessions / Ranges / upgradeStatus #26821

Conversation

knz commented Jun 19, 2018 • edited Loading

cockroach-teamcity commented Jun 19, 2018

knz commented Jun 19, 2018

tbg commented Jun 19, 2018

knz commented Jun 19, 2018

tbg commented Jun 19, 2018

knz commented Jun 20, 2018

knz commented Jun 20, 2018

BramGruneir commented Jun 20, 2018

knz commented Jun 21, 2018

knz commented Jun 21, 2018

knz commented Jun 21, 2018

knz commented Jun 21, 2018

knz commented Jun 21, 2018

knz commented Jun 25, 2018

bdarnell commented Jun 26, 2018

tbg commented Jun 26, 2018 via email

windchan7 commented Jun 26, 2018 • edited Loading

knz commented Jun 27, 2018

knz commented Jun 27, 2018

bdarnell commented Jun 27, 2018

knz commented Jun 29, 2018

knz commented Jun 29, 2018

knz commented Jun 29, 2018

knz commented Jun 29, 2018

craig bot commented Jun 29, 2018

Canceled

knz commented Jun 29, 2018

craig bot commented Jun 29, 2018

Build succeeded

knz commented Jun 19, 2018 •

edited

Loading

windchan7 commented Jun 26, 2018 •

edited

Loading