Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: general correctness fixes for ListSessions / Ranges / upgradeStatus #26821

Merged
merged 1 commit into from
Jun 29, 2018

Conversation

knz
Copy link
Contributor

@knz knz commented Jun 19, 2018

Prior to this patch, various parts of the code what were attempting to
"iterate over all nodes in the cluster" were confused and would
either:

  • skip over nodes that are known to exist in KV but for which gossip
    information was currently unavailable, or
  • fail to skip over removed nodes (dead + decommissioned) or
  • incorrectly skip over temporarily failed nodes.

This patch comprehensively addresses this class of issues by:

  • tweaking (*NodeLiveness).GetIsLiveMap() to exclude decommissioned
    nodes upfront.
  • documenting (*NodeLiveness).GetLivenessStatusMap() better, to
    mention it includes decommissioned nodes and does not include
    all nodes known to KV.
  • providing a new method (*statusServer).NodesWithLiveness() which
    always includes all nodes known in KV, not just those for which
    gossip information is available.
  • ensuring that (*Server).upgradeStatus() does not incorrectly skip
    over temporarily dead nodes or nodes yet unknown to gossip by
    using the new NodesWithLiveness() method appropriately.
  • providing a new method (*statusServer).iterateNodes() which does
    "the right thing" via NodesWithLiveness() and using it for
    the status RPCs ListSessions() and Range().

Additionally this patch makes SHOW QUERIES and SHOW SESSIONS report
error details when it fails to query information from a node.

Release note (bug fix): the server will not finalize a version upgrade
automatically and erroneously if there are nodes temporarily inactive
in the cluster.

Release note (sql change): SHOW CLUSTER QUERIES/SESSIONS now report
the details of the error upon failing to gather data from other nodes.

Fixes #22863.
Fixes #26852.
Fixes #26897.

@knz knz requested a review from tbg June 19, 2018 13:02
@knz knz requested review from a team June 19, 2018 13:02
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Jun 19, 2018

Any suggestion as to how to force a node failure in a test?

@tbg
Copy link
Member

tbg commented Jun 19, 2018

LGTM

If you wanted to test this, why not a unit test?

Also, while you're in that area of the code -- could you check that the RPC timeouts aren't too aggressive, and that it's aware of decommissioned/dead nodes? I'm seeing these null columns regularly in snippets from production users, even though their clusters are presumably healthy.

@knz
Copy link
Contributor Author

knz commented Jun 19, 2018

could you check that the RPC timeouts aren't too aggressive

the code uses the standard base.NetworkTimeout which is currently set to 3s (constant). Is this appropriate?

and that it's aware of decommissioned/dead nodes?

The way the code currently works is:

  1. issues a KV scan on the node status range to determine node IDs -- the other fields of the node descriptors are ignored at that point
  2. for each node ID, it uses dialNode which in turn checks gossip to determine whether the node is alive. For every node that's not alive according to gossip, an "address not found" error is generated.

This raises a couple question for me, given that I am not knowledgeable with these APIs:

  • why use a KV scan no the node status range to establish the list of node IDs when s.gossip.nodeDescs already contains these node IDs?
  • is it possible to have a node ID in the KV range but not in gossip? (So that the nodeID-to-address resolution fails although the node is known in KV?)

@tbg
Copy link
Member

tbg commented Jun 19, 2018

why use a KV scan no the node status range to establish the list of node IDs when s.gossip.nodeDescs already contains these node IDs?

The KV information is more authoritative, you could theoretically be missing nodes in Gossip (for example if they recently started and the update hasn't propagated to you yet). Probably the code just tried really hard not to miss any nodes, or it was just to make tests not flaky.

I think a reasonable change is

  1. keep using the kv information to determine node IDs
  2. filter the retrieved node ids through s.nodeLiveness and keep only those that are !decommissioning || !dead, where a node is dead when it hasn't heartbeat within storage.TimeUntilStoreDead.

This means that if you have a down/dead/partitioned node, you will try to connect and generate an error row. If you decommission the node, it will vanish because by doing so you promise that the node is in fact offline.

@knz knz requested a review from a team June 19, 2018 18:30
@knz knz force-pushed the 20180619-show-queries branch 2 times, most recently from a0a8db3 to 1f06482 Compare June 20, 2018 15:40
@knz knz changed the title sql: make SHOW QUERIES and SHOW SESSIONS report error details server: general correctness fixes for ListSessions / Ranges / upgradeStatus Jun 20, 2018
@knz
Copy link
Contributor Author

knz commented Jun 20, 2018

I have reworked the patch entirely and expanded its scope. PTAL.

@knz
Copy link
Contributor Author

knz commented Jun 20, 2018

I have written a new test TestLivenessStatusMap to check the semantics of the new function but that test times out. I suspect it is because when I shut down a store it does not register as dead in gossip until 5 minutes (the standard time until store dead) which is higher than the test timeout.

How can I decreate the time until store dead in a test in a way that makes sense? @BramGruneir could you make suggestions?

@BramGruneir
Copy link
Member

By the way, I really like this change.

LGTM if you can get the test working.

As for the test, take a look at TestStoreRangeRemoveDead in pkg/storage/client_raft_test.go. This sets the time until store dead and does the re-gossiping of the alive store manually.


Reviewed 6 of 6 files at r1, 3 of 3 files at r2.
Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/status.go, line 1294 at r1 (raw file):

}

// iterateNodes iterates nodeFn over all non-removed nodes concurrently.

This is wonderful. We should use it everywhere else throughout status.


Comments from Reviewable

@knz knz force-pushed the 20180619-show-queries branch 2 times, most recently from 6c51b91 to 7e65dde Compare June 21, 2018 12:29
@knz
Copy link
Contributor Author

knz commented Jun 21, 2018

I have rewritten the test and in succeeds on my local machine (even stress runs succeed). However:

  1. the test takes 10 seconds to recognize a node is dead, although I have set TimeUntilStoreDead to 5 milliseconds. What else should I change to make the thing complete faster?

  2. it succeeds on my local machine but fails on CI sometimes (I think?). I think the issue is that I want to check that a node is marked as "decommissioning" but when I do so it also proceeds to shut down on its own. My test thus non-deterministically observes that the node is either "decommissioning" or "decommissioned". Is there a way to mark the node as decommissioning but not let it shut down until the test has observed the intermediate status?

  3. Is spawning an entire TestCluster a good approach? I have tried to use the multiTestContext thing used by other tests but to no avail - when I run stopStore() on it the test deadlocks.

@knz
Copy link
Contributor Author

knz commented Jun 21, 2018

I tweaked the test to remove the race condition / flake. Now the only remaining drawback is that the test takes 10 seconds.

@knz
Copy link
Contributor Author

knz commented Jun 21, 2018

I am going to merge this now since it fixes an important bug and addresses our show queries shortcoming. Filed #26897 separately to make the test faster.

@knz
Copy link
Contributor Author

knz commented Jun 21, 2018

bors r+

@knz
Copy link
Contributor Author

knz commented Jun 21, 2018

bors r-

@knz
Copy link
Contributor Author

knz commented Jun 25, 2018

Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/status.go, line 866 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why are we passing true here and then discarding decommissioned nodes separately, instead of using the same rule as GetLivenessStatusMap(false)? Any differences between the two are subtle and likely to lead to problems.

As Tobias explained, we may have nodes that have reported their existence in KV but not known yet to gossip. These should not "disappear" just because gossip doesn't know about them yet.

What would you suggest to do instead?


pkg/storage/node_liveness_test.go, line 781 at r3 (raw file):

I think I'd rather have a unit test than an end-to-end test here.

I have tried, trust me, and I have failed. And I have no idea how to do better than what's here (other than using the default tick interval but then the test takes ~11s to run).

I don't mind doing something else here but please guide me and tell me precisely what needs to happen. I have zero time available to investigate this myself at this time. Thanks.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/status.go, line 866 at r3 (raw file):

Previously, knz (kena) wrote…

As Tobias explained, we may have nodes that have reported their existence in KV but not known yet to gossip. These should not "disappear" just because gossip doesn't know about them yet.

What would you suggest to do instead?

What does "reported their existence in KV" mean? Should NodeLiveness be fetching these nodes from KV as well?

Is it that bad that nodes that are not yet in gossip are not included? Should s.Nodes only include nodes that are present in NodeLiveness gossip?

Basically I want to minimize the number of different pathways that incidentally return subtly different results. If liveness gossip and node status KV are unavoidably different, so be it, but let's not introduce a third hybrid of the two here.


pkg/storage/node_liveness_test.go, line 781 at r3 (raw file):

Previously, knz (kena) wrote…

I think I'd rather have a unit test than an end-to-end test here.

I have tried, trust me, and I have failed. And I have no idea how to do better than what's here (other than using the default tick interval but then the test takes ~11s to run).

I don't mind doing something else here but please guide me and tell me precisely what needs to happen. I have zero time available to investigate this myself at this time. Thanks.

I don't have any specific suggests, so as long as this test is stable under stress/stressrace I think we should leave it as is.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 26, 2018 via email

@windchan7
Copy link
Contributor

windchan7 commented Jun 26, 2018

Regarding the upgradeStatus method, if a node is temporarily down, the auto upgrade is not going to happen. This case is tested by the upgrade roachtest as well. It's mainly because upgradeStatus will only return do upgrade if the last check (no non-decommissioned dead nodes) passes. There's a more detailed explanation on the issue page. In all other cases, auto upgrade will simply quit without doing anything or keep rechecking these conditions. @benesch Nikhil, please correct me if I'm wrong, thanks.

@knz
Copy link
Contributor Author

knz commented Jun 27, 2018

I agree with Ben that my suggestion to try to return "more correct" results here is not worth it. Let's just skip the KV reads.

I fear this is not an option at all because of #26852.

Either you go up to that issue and convince me with a solid argument that skipping the KV reads keeps the version upgrade logic correct, or we'll have to do the KV reads here.

@knz
Copy link
Contributor Author

knz commented Jun 27, 2018

@windchan7 I have responded to your comment on the linked issue. You're still assuming that the check in upgradeStatus is correct because you're assuming the liveness map is comprehensive. This is not the case and what this PR is addressing.

@bdarnell
Copy link
Contributor

:lgtm:

Got it. Part of my concern here was a misunderstanding on my part - I thought the rest of statusServer was gossip-driven but it was actually KV-driven. I apologize for the extra back-and-forth.


Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/status.go, line 866 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What does "reported their existence in KV" mean? Should NodeLiveness be fetching these nodes from KV as well?

Is it that bad that nodes that are not yet in gossip are not included? Should s.Nodes only include nodes that are present in NodeLiveness gossip?

Basically I want to minimize the number of different pathways that incidentally return subtly different results. If liveness gossip and node status KV are unavoidably different, so be it, but let's not introduce a third hybrid of the two here.

Every use of GetLivenessStatusMap is passing true for includeRemovedNodes (except the test of the false case). I suggest removing the argument and making statusServer.NodesWithLiveness the standard way to get a list of non-removed nodes (minimizing direct access to NodeLiveness because it may be incomplete).


Comments from Reviewable

@knz knz force-pushed the 20180619-show-queries branch 2 times, most recently from 1365b89 to af65663 Compare June 29, 2018 11:16
@knz
Copy link
Contributor Author

knz commented Jun 29, 2018

Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/status.go, line 866 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Every use of GetLivenessStatusMap is passing true for includeRemovedNodes (except the test of the false case). I suggest removing the argument and making statusServer.NodesWithLiveness the standard way to get a list of non-removed nodes (minimizing direct access to NodeLiveness because it may be incomplete).

Done.


pkg/sql/crdb_internal.go, line 742 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/query ID/query/

Done.


pkg/sql/crdb_internal.go, line 861 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/session ID/active queries/

Done.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Jun 29, 2018

Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/node_liveness_test.go, line 781 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't have any specific suggests, so as long as this test is stable under stress/stressrace I think we should leave it as is.

I have ran it under stress + stressrace multiple minutes and not seen it failing so far.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Jun 29, 2018

All right merging this, we can do finishing touches in a later PR if need be.

Thanks for all the reviews!

bors r+

@knz
Copy link
Contributor Author

knz commented Jun 29, 2018

welp, found a stress failure.

bors r-

@craig
Copy link
Contributor

craig bot commented Jun 29, 2018

Canceled

…Version

Prior to this patch, various parts of the code what were attempting to
"iterate over all nodes in the cluster" were confused and would
either:

- skip over nodes that are known to exist in KV but for which gossip
  information was currently unavailable, or
- fail to skip over removed nodes (dead + decommissioned) or
- incorrectly skip over temporarile failed nodes.

This patch comprehensively addresses this class of issues by:

- tweaking `(*NodeLiveness).GetIsLiveMap()` to exclude decommissioned
  nodes upfront.
- documenting `(*NodeLiveness).GetLivenessStatusMap()` better, to
  mention it includes decommissioned nodes and does not include
  all nodes known to KV.
- providing a new method `(*statusServer).NodesWithLiveness()` which
  always includes all nodes known in KV, not just those for which
  gossip information is available.
- ensuring that `(*Server).upgradeStatus()` does not incorrectly skip
  over temporarily dead nodes or nodes yet unknown to gossip by
  using the new `NodesWithLiveness()` method appropriately.
- providing a new method `(*statusServer).iterateNodes()` which does
  "the right thing" via `NodesWithLiveness()` and using it for
  the status RPCs `ListSessions()` and `Range()`.

Additionally this patch makes SHOW QUERIES and SHOW SESSIONS report
error details when it fails to query information from a node.

Release note (bug fix): the server will not finalize a version upgrade
automatically and erroneously if there are nodes temporarily inactive
in the cluster.

Release note (sql change): SHOW CLUSTER QUERIES/SESSIONS now report
the details of the error upon failing to gather data from other nodes.
@knz
Copy link
Contributor Author

knz commented Jun 29, 2018

All right I have relaxed the timing in the test and that makes the stress happy.

bors r+

craig bot pushed a commit that referenced this pull request Jun 29, 2018
26821: server: general correctness fixes for ListSessions / Ranges / upgradeStatus r=knz a=knz

Prior to this patch, various parts of the code what were attempting to
"iterate over all nodes in the cluster" were confused and would
either:

- skip over nodes that are known to exist in KV but for which gossip
  information was currently unavailable, or
- fail to skip over removed nodes (dead + decommissioned) or
- incorrectly skip over temporarily failed nodes.

This patch comprehensively addresses this class of issues by:

- tweaking `(*NodeLiveness).GetIsLiveMap()` to exclude decommissioned
  nodes upfront.
- tweaking `(*NodeLiveness).GetLivenessStatusMap()` to conditionally
  exclude decommissioned nodes upfront, and changing all callers
  accordingly.
- providing a new method `(*statusServer).NodesWithLiveness()` which
  always includes all nodes known in KV, not just those for which
  gossip information is available.
- ensuring that `(*Server).upgradeStatus()` does not incorrectly skip
  over temporarily dead nodes or nodes yet unknown to gossip by
  using the new `NodesWithLiveness()` method appropriately.
- providing a new method `(*statusServer).iterateNodes()` which does
  "the right thing" via `NodesWithLiveness()` and using it for
  the status RPCs `ListSessions()` and `Range()`.

Additionally this patch makes SHOW QUERIES and SHOW SESSIONS report
error details when it fails to query information from a node.

Release note (bug fix): the server will not finalize a version upgrade
automatically and erroneously if there are nodes temporarily inactive
in the cluster.

Release note (sql change): SHOW CLUSTER QUERIES/SESSIONS now report
the details of the error upon failing to gather data from other nodes.

Fixes #22863.
Fixes #26852.
Fixes #26897.

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jun 29, 2018

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants