Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QPS Drops to Zero during rolling upgrade #22424

Closed
nstewart opened this issue Feb 6, 2018 · 23 comments
Closed

QPS Drops to Zero during rolling upgrade #22424

nstewart opened this issue Feb 6, 2018 · 23 comments
Assignees
Milestone

Comments

@nstewart
Copy link
Contributor

nstewart commented Feb 6, 2018

Running kv workload on a 6 node k8s cluster (5 cockroachDB nodes). I triggered a rolling update from 1.1.5 to v2.0-alpha.20180129 and saw QPS temporarily dropped to zero

   6m53s        0          861.8          626.5      2.5   3892.3   4026.5
   6m54s        0          560.9          626.3    226.5    335.5    352.3
   6m55s        0           25.0          624.9    436.2    436.2    436.2
   6m56s        0            0.0          623.4      0.0      0.0      0.0
   6m57s        0            0.0          621.9      0.0      0.0      0.0
   6m58s        0            0.0          620.4      0.0      0.0      0.0
   6m59s        0            0.0          618.9      0.0      0.0      0.0
    7m0s        0            0.0          617.4      0.0      0.0      0.0
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
    7m1s        0            0.0          616.0      0.0      0.0      0.0
    7m2s        0            0.0          614.5      0.0      0.0      0.0
    7m3s        0            0.0          613.1      0.0      0.0      0.0
    7m4s        0            0.0          611.6      0.0      0.0      0.0
    7m5s        0            0.0          610.2      0.0      0.0      0.0
    7m6s        0            0.0          608.8      0.0      0.0      0.0
    7m7s        0            0.0          607.3      0.0      0.0      0.0
    7m8s        0            0.0          605.9      0.0      0.0      0.0
    7m9s        0            0.0          604.5      0.0      0.0      0.0
   7m10s        0            0.0          603.1      0.0      0.0      0.0
   7m11s        0            0.0          601.7      0.0      0.0      0.0
   7m12s        0            0.0          600.3      0.0      0.0      0.0

I also saw pq: server is not accepting clients once, which incremented the error count to 1.

image

I'm using the k8s internal load balancer, which checks /health on nodes every 5 seconds.

@nstewart
Copy link
Contributor Author

nstewart commented Feb 6, 2018

note from @bdarnell re: the server not accepting clients

our health endpoint may need to be updated to support our draining state
or we need to augment the draining protocol to wait long enough for the LB's health checks to occur before going into this draining state

@rjnn
Copy link
Contributor

rjnn commented Feb 6, 2018

pinging @jordanlewis for triage. It would be great to get a fix in before 2.0, and also cherry-pick it into 1.1 so that we can perform a rolling upgrade that's zero downtime.

@nstewart
Copy link
Contributor Author

nstewart commented Feb 6, 2018

Also including latencies if thats helpful:
image

@nstewart
Copy link
Contributor Author

nstewart commented Feb 6, 2018

Thanks @arjunravinarayan I'll hold off on recording a demo until we have a fix or workaround

@petermattis
Copy link
Collaborator

@nstewart Can you provide step-by-step instructions for what you did? Doing so will make it easier for an engineer to reproduce and fix.

@a-robinson
Copy link
Contributor

For what it's worth, I'd put like 95% odds on this being an issue with the Kubernetes configuration, not with Cockroach.

@bdarnell
Copy link
Contributor

bdarnell commented Feb 6, 2018

I think it's in the code: I think we're setting the flag on the pgwire server to refuse connections at the beginning of the draining process, instead of waiting for a period of time during which we fail health checks but still serve new connections.

The process should look something like this:

  1. Health checks fail, but nothing else changes. Stay in this state long enough for load balancers to notice. (I've generally used 30s for this phase)
  2. Begin returning errors on new connections and close existing connections as they become idle. I'd stay in this state for ~10s
  3. Transfer leases away. Exit this state as soon as our lease count reaches zero (also set an upper limit - don't stay in this state longer than about a minute)
  4. Close rocksdb and terminate the process.

@nstewart
Copy link
Contributor Author

nstewart commented Feb 6, 2018

@petermattis detailed repro:

  1. Create 5-node m4.xl cluster using the 'stable' release and KV workload using cloudformation template: https://amzn.to/2CZjJLZ
  2. SSH into the machine (proxy command is in the outputs section of the template)
  3. sudo nano /tmp/cockroachdb-statefulset.yaml and update image to image: cockroachdb/cockroach-unstable:latest
  4. kubectl apply -f /tmp/cockroachdb-statefulset.yaml

You'll see k8s start the rolling update

you can check logs with kubectl logs [kv-podid] -f

admin ui link is also in the cloudformation templates output section. Note when I repro-ed this time, QPS still went to zero, but not long enough to register in the admin ui. I still saw the 99-tile latency spike in the graph and refused connection in the logs, though.

The initial stateful set config is here: https://github.com/cockroachdb/cockroachdb-cloudformation/blob/master/scripts/cockroachdb-statefulset.yaml though some fields get modified based on the template parameters you use (they don't change loadbalancer settings, though)

@petermattis
Copy link
Collaborator

@asubiotto Can you take a look at this tomorrow? Is Ben's diagnosis correct, or is something else going on? Should be easy to reproduce with Nate's instructions.

@asubiotto
Copy link
Contributor

Ben's diagnosis is correct given the configuration, the main issue is that we don't report a node as unavailable through the Health endpoint when it is draining. The idea I had when writing the draining code was that the health check would be performed as a SQL level check through the pgwire server. We might be able to change the kubernetes configuration to do this check but I am not certain this is an option.

Regardless, I think it would be good to leave a grace period in which we accept new clients but health checks fail, my biggest question is: is it correct to return nil, error when a node is unavailable or decommissioning in adminServer.Health? Will this have any unwanted effects on the admin ui? If not, I'll make the change to make draining a bit more graceful.

@a-robinson
Copy link
Contributor

The idea I had when writing the draining code was that the health check would be performed as a SQL level check through the pgwire server. We might be able to change the kubernetes configuration to do this check but I am not certain this is an option.

Yes, that could be done.

@asubiotto
Copy link
Contributor

I think I confused myself. Although the above related to load balancers/draining is a problem that I will fix, it's weird that only an error or two is reported during the rolling update due to a connection attempt to a draining node, so it seems that the queries are hanging somehow. I don't think it's a lease issue since all nodes seem to drain properly. I'm going to take a closer look at this.

@asubiotto
Copy link
Contributor

Status update:

I reproduce this easily during a rolling upgrade. My initial thought was that since we don't move ranges off of draining nodes, we could have a situation in which two consecutive nodes part of the same raft groups are unable to service raft requests. This was due to seeing this message in the logs like:
W180209 19:39:43.147945 140 vendor/github.com/coreos/etcd/raft/raft.go:825 [n1,s1,r10/1:/Table/1{3-4}] 1 stepped down to follower since quorum is not active
Table 13 seems to be the rangelog table. Since we're doing a rolling upgrade, we should never see messages like this.

The time taken from shutdown to be able to receive raft requests again was around 1 minute and we might simply not be giving enough time to nodes to come back up. I changed the readiness probe to be much stricter (a probe that must be passed to move on to the upgrade of the next node) and although there was a reduction in the amount of time kv dropped to 0 QPS, it still happened.

I can reproduce a drop in QPS by simply shutting down and restarting a node running v2.0-alpha.20180129. I start seeing:

I180209 22:52:28.288427 150 storage/node_liveness.go:627  [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (error=rpc error: code = DeadlineExceeded desc = context deadline exceeded [propagate])
W180209 22:52:28.288650 150 storage/node_liveness.go:426  [n1,hb] slow heartbeat took 4.5s
W180209 22:52:28.288692 150 storage/node_liveness.go:365  [n1,hb] failed node liveness heartbeat: context deadline exceeded

on all other nodes.
Trying to check the /reports/network page also seems to hang during this period. I think there might be a networking issue.

@a-robinson
Copy link
Contributor

I can reproduce a drop in QPS by simply shutting down and restarting a node running v2.0-alpha.20180129. I start seeing:

Did you shut it down gracefully or forcefully? If the former, that's definitely a bug.

If it's the latter, it brings up an interesting issue with our fault tolerance. Because our expiration-based range leases last 9 seconds (and we renew them when there's 4.5 seconds left), if the node liveness range's leaseholder gets forcefully killed, it'll typically be 4.5-9 seconds before any other node is able to ping their liveness record. And because node liveness records also only last 9 seconds before they're considered expired, that 4.5-9 seconds could often be enough of a delay for nodes to lose their liveness. In other words, the node liveness range's leaseholder going down can cause all nodes to be non-live for a short period of time. It shouldn't be for long, but could make for short 0 qps periods across the cluster.

@asubiotto
Copy link
Contributor

It was a graceful shutdown. I can reproduce the drop in QPS with a 3-node local cluster, definitely seems like there's something wrong with the draining. I'll keep on investigating.

@vivekmenezes
Copy link
Contributor

possible related issue #22630

@vivekmenezes
Copy link
Contributor

@bobvawter to QA this once it's fixed.

@asubiotto
Copy link
Contributor

asubiotto commented Feb 15, 2018

I just want to consolidate the items of work to be done here into one list:

@asubiotto
Copy link
Contributor

asubiotto commented Mar 2, 2018

We have two types of changes:

  1. Improvements to the draining process ( storage: transfer raft leadership and wait grace period when draining #22767 and storage: Avoid transferring leases to draining stores #23265)
  2. Improvements to the interaction with load balancers (server: fail readiness checks for server.drain.unready_wait #23233 and server: split health endpoint into health and readiness endpoints #22911)

All these changes have been cherrypicked into 2.0, however, only the first type of changes have been cherrypicked into 1.1 due to some commits being required that would make this more of a feature change rather than a bug fix. This means that for rolling upgrades from 1.1.6 to 2.0, the load balancer integration won't be there so clients may see a dip in QPS + errors observed in initial comment caused by retrying connections to draining servers (however it won't be a drop to 0).

For 2.0, there are still some extra items of work to be done:

@bobvawter, for your QA, you might want to check what the effect to QPS is when doing a rolling upgrade from 1.1 to 2.0 using the changes.

To close this issue, I think we should just do a rolling restart of the new 2.0 version and verify that QPS doesn't dip.

@vivekmenezes
Copy link
Contributor

Excellent update @asubiotto !

@asubiotto
Copy link
Contributor

@bobvawter, when you QA, note that the default drain_wait should be set to something reasonable (#23333) and that there might also be something to do wrt configuring the readiness/liveness check period and threshold.

@a-robinson
Copy link
Contributor

I looked into re-adding the liveness/readiness checks last night, and the liveness check isn't re-addable yet, at least not without additional work on the kubernetes config file to run cockroach init immediately: #22468 (comment)

@asubiotto
Copy link
Contributor

Closing this as @a-robinson has verified that QPS doesn't dip anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants