-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: general correctness fixes for ListSessions / Ranges / upgradeStatus #26821
Conversation
Any suggestion as to how to force a node failure in a test? |
LGTM If you wanted to test this, why not a unit test? Also, while you're in that area of the code -- could you check that the RPC timeouts aren't too aggressive, and that it's aware of decommissioned/dead nodes? I'm seeing these null columns regularly in snippets from production users, even though their clusters are presumably healthy. |
the code uses the standard
The way the code currently works is:
This raises a couple question for me, given that I am not knowledgeable with these APIs:
|
The KV information is more authoritative, you could theoretically be missing nodes in Gossip (for example if they recently started and the update hasn't propagated to you yet). Probably the code just tried really hard not to miss any nodes, or it was just to make tests not flaky. I think a reasonable change is
This means that if you have a down/dead/partitioned node, you will try to connect and generate an error row. If you decommission the node, it will vanish because by doing so you promise that the node is in fact offline. |
3fdd1cf
to
fea11f8
Compare
a0a8db3
to
1f06482
Compare
I have reworked the patch entirely and expanded its scope. PTAL. |
1f06482
to
ed53ee4
Compare
I have written a new test How can I decreate the time until store dead in a test in a way that makes sense? @BramGruneir could you make suggestions? |
By the way, I really like this change. LGTM if you can get the test working. As for the test, take a look at Reviewed 6 of 6 files at r1, 3 of 3 files at r2. pkg/server/status.go, line 1294 at r1 (raw file):
This is wonderful. We should use it everywhere else throughout status. Comments from Reviewable |
6c51b91
to
7e65dde
Compare
I have rewritten the test and in succeeds on my local machine (even stress runs succeed). However:
|
7e65dde
to
11fbb5a
Compare
I tweaked the test to remove the race condition / flake. Now the only remaining drawback is that the test takes 10 seconds. |
I am going to merge this now since it fixes an important bug and addresses our |
bors r+ |
bors r- |
Review status: complete! 0 of 0 LGTMs obtained pkg/server/status.go, line 866 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
As Tobias explained, we may have nodes that have reported their existence in KV but not known yet to gossip. These should not "disappear" just because gossip doesn't know about them yet. What would you suggest to do instead? pkg/storage/node_liveness_test.go, line 781 at r3 (raw file):
I have tried, trust me, and I have failed. And I have no idea how to do better than what's here (other than using the default tick interval but then the test takes ~11s to run). I don't mind doing something else here but please guide me and tell me precisely what needs to happen. I have zero time available to investigate this myself at this time. Thanks. Comments from Reviewable |
Review status: complete! 0 of 0 LGTMs obtained pkg/server/status.go, line 866 at r3 (raw file): Previously, knz (kena) wrote…
What does "reported their existence in KV" mean? Should NodeLiveness be fetching these nodes from KV as well? Is it that bad that nodes that are not yet in gossip are not included? Should Basically I want to minimize the number of different pathways that incidentally return subtly different results. If liveness gossip and node status KV are unavoidably different, so be it, but let's not introduce a third hybrid of the two here. pkg/storage/node_liveness_test.go, line 781 at r3 (raw file): Previously, knz (kena) wrote…
I don't have any specific suggests, so as long as this test is stable under stress/stressrace I think we should leave it as is. Comments from Reviewable |
I agree with Ben that my suggestion to try to return "more correct" results
here is not worth it. Let's just skip the KV reads.
On Tue, Jun 26, 2018, 6:39 PM Ben Darnell ***@***.***> wrote:
Review status: [image: ] complete! 0 of 0 LGTMs obtained
------------------------------
*pkg/server/status.go, line 866 at r3
<https://reviewable.io/reviews/cockroachdb/cockroach/26821#-LFr5iA3EDmyvqyUcZUB:-LFwvQct-YnvyT7oe4CB:bb7kbmt>
(raw file
<https://github.com/cockroachdb/cockroach/blob/edc19644a38f3de66f4e8bd1eac2fcc8aa37f02d/pkg/server/status.go#L866>):*
*Previously, knz (kena) wrote…*
As Tobias explained, we may have nodes that have reported their existence
in KV but not known yet to gossip. These should not "disappear" just
because gossip doesn't know about them yet.
What would you suggest to do instead?
What does "reported their existence in KV" mean? Should NodeLiveness be
fetching these nodes from KV as well?
Is it that bad that nodes that are not yet in gossip are not included?
Should s.Nodes only include nodes that are present in NodeLiveness gossip?
Basically I want to minimize the number of different pathways that
incidentally return subtly different results. If liveness gossip and node
status KV are unavoidably different, so be it, but let's not introduce a
third hybrid of the two here.
------------------------------
*pkg/storage/node_liveness_test.go, line 781 at r3
<https://reviewable.io/reviews/cockroachdb/cockroach/26821#-LFo4zeuC2IX_EVNEtFu:-LFwxTwO0mL4TnI8gvye:b-i10jui>
(raw file
<https://github.com/cockroachdb/cockroach/blob/edc19644a38f3de66f4e8bd1eac2fcc8aa37f02d/pkg/storage/node_liveness_test.go#L781>):*
*Previously, knz (kena) wrote…*
I think I'd rather have a unit test than an end-to-end test here.
I have tried, trust me, and I have failed. And I have no idea how to do
better than what's here (other than using the default tick interval but
then the test takes ~11s to run).
I don't mind doing something else here but *please* guide me and tell me
precisely what needs to happen. I have zero time available to investigate
this myself at this time. Thanks.
I don't have any specific suggests, so as long as this test is stable
under stress/stressrace I think we should leave it as is.
------------------------------
*Comments from Reviewable
<https://reviewable.io/reviews/cockroachdb/cockroach/26821>*
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26821 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135Ez4FthlMcEdAxxH1shbgBT4ReyKks5uAmPQgaJpZM4UtdrS>
.
--
…-- Tobias
|
Regarding the |
I fear this is not an option at all because of #26852. Either you go up to that issue and convince me with a solid argument that skipping the KV reads keeps the version upgrade logic correct, or we'll have to do the KV reads here. |
@windchan7 I have responded to your comment on the linked issue. You're still assuming that the check in |
Got it. Part of my concern here was a misunderstanding on my part - I thought the rest of Review status: complete! 0 of 0 LGTMs obtained (and 1 stale) pkg/server/status.go, line 866 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Every use of GetLivenessStatusMap is passing Comments from Reviewable |
1365b89
to
af65663
Compare
Review status: complete! 0 of 0 LGTMs obtained (and 1 stale) pkg/server/status.go, line 866 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/sql/crdb_internal.go, line 742 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/sql/crdb_internal.go, line 861 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
Review status: complete! 0 of 0 LGTMs obtained (and 1 stale) pkg/storage/node_liveness_test.go, line 781 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I have ran it under stress + stressrace multiple minutes and not seen it failing so far. Comments from Reviewable |
All right merging this, we can do finishing touches in a later PR if need be. Thanks for all the reviews! bors r+ |
welp, found a stress failure. bors r- |
Canceled |
…Version Prior to this patch, various parts of the code what were attempting to "iterate over all nodes in the cluster" were confused and would either: - skip over nodes that are known to exist in KV but for which gossip information was currently unavailable, or - fail to skip over removed nodes (dead + decommissioned) or - incorrectly skip over temporarile failed nodes. This patch comprehensively addresses this class of issues by: - tweaking `(*NodeLiveness).GetIsLiveMap()` to exclude decommissioned nodes upfront. - documenting `(*NodeLiveness).GetLivenessStatusMap()` better, to mention it includes decommissioned nodes and does not include all nodes known to KV. - providing a new method `(*statusServer).NodesWithLiveness()` which always includes all nodes known in KV, not just those for which gossip information is available. - ensuring that `(*Server).upgradeStatus()` does not incorrectly skip over temporarily dead nodes or nodes yet unknown to gossip by using the new `NodesWithLiveness()` method appropriately. - providing a new method `(*statusServer).iterateNodes()` which does "the right thing" via `NodesWithLiveness()` and using it for the status RPCs `ListSessions()` and `Range()`. Additionally this patch makes SHOW QUERIES and SHOW SESSIONS report error details when it fails to query information from a node. Release note (bug fix): the server will not finalize a version upgrade automatically and erroneously if there are nodes temporarily inactive in the cluster. Release note (sql change): SHOW CLUSTER QUERIES/SESSIONS now report the details of the error upon failing to gather data from other nodes.
af65663
to
1ab9cb5
Compare
All right I have relaxed the timing in the test and that makes the stress happy. bors r+ |
26821: server: general correctness fixes for ListSessions / Ranges / upgradeStatus r=knz a=knz Prior to this patch, various parts of the code what were attempting to "iterate over all nodes in the cluster" were confused and would either: - skip over nodes that are known to exist in KV but for which gossip information was currently unavailable, or - fail to skip over removed nodes (dead + decommissioned) or - incorrectly skip over temporarily failed nodes. This patch comprehensively addresses this class of issues by: - tweaking `(*NodeLiveness).GetIsLiveMap()` to exclude decommissioned nodes upfront. - tweaking `(*NodeLiveness).GetLivenessStatusMap()` to conditionally exclude decommissioned nodes upfront, and changing all callers accordingly. - providing a new method `(*statusServer).NodesWithLiveness()` which always includes all nodes known in KV, not just those for which gossip information is available. - ensuring that `(*Server).upgradeStatus()` does not incorrectly skip over temporarily dead nodes or nodes yet unknown to gossip by using the new `NodesWithLiveness()` method appropriately. - providing a new method `(*statusServer).iterateNodes()` which does "the right thing" via `NodesWithLiveness()` and using it for the status RPCs `ListSessions()` and `Range()`. Additionally this patch makes SHOW QUERIES and SHOW SESSIONS report error details when it fails to query information from a node. Release note (bug fix): the server will not finalize a version upgrade automatically and erroneously if there are nodes temporarily inactive in the cluster. Release note (sql change): SHOW CLUSTER QUERIES/SESSIONS now report the details of the error upon failing to gather data from other nodes. Fixes #22863. Fixes #26852. Fixes #26897. Co-authored-by: Raphael 'kena' Poss <[email protected]>
Build succeeded |
Prior to this patch, various parts of the code what were attempting to
"iterate over all nodes in the cluster" were confused and would
either:
information was currently unavailable, or
This patch comprehensively addresses this class of issues by:
(*NodeLiveness).GetIsLiveMap()
to exclude decommissionednodes upfront.
(*NodeLiveness).GetLivenessStatusMap()
better, tomention it includes decommissioned nodes and does not include
all nodes known to KV.
(*statusServer).NodesWithLiveness()
whichalways includes all nodes known in KV, not just those for which
gossip information is available.
(*Server).upgradeStatus()
does not incorrectly skipover temporarily dead nodes or nodes yet unknown to gossip by
using the new
NodesWithLiveness()
method appropriately.(*statusServer).iterateNodes()
which does"the right thing" via
NodesWithLiveness()
and using it forthe status RPCs
ListSessions()
andRange()
.Additionally this patch makes SHOW QUERIES and SHOW SESSIONS report
error details when it fails to query information from a node.
Release note (bug fix): the server will not finalize a version upgrade
automatically and erroneously if there are nodes temporarily inactive
in the cluster.
Release note (sql change): SHOW CLUSTER QUERIES/SESSIONS now report
the details of the error upon failing to gather data from other nodes.
Fixes #22863.
Fixes #26852.
Fixes #26897.