QPS Drops to Zero during rolling upgrade #22424

nstewart · 2018-02-06T17:38:11Z

Running kv workload on a 6 node k8s cluster (5 cockroachDB nodes). I triggered a rolling update from 1.1.5 to v2.0-alpha.20180129 and saw QPS temporarily dropped to zero

   6m53s        0          861.8          626.5      2.5   3892.3   4026.5
   6m54s        0          560.9          626.3    226.5    335.5    352.3
   6m55s        0           25.0          624.9    436.2    436.2    436.2
   6m56s        0            0.0          623.4      0.0      0.0      0.0
   6m57s        0            0.0          621.9      0.0      0.0      0.0
   6m58s        0            0.0          620.4      0.0      0.0      0.0
   6m59s        0            0.0          618.9      0.0      0.0      0.0
    7m0s        0            0.0          617.4      0.0      0.0      0.0
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
    7m1s        0            0.0          616.0      0.0      0.0      0.0
    7m2s        0            0.0          614.5      0.0      0.0      0.0
    7m3s        0            0.0          613.1      0.0      0.0      0.0
    7m4s        0            0.0          611.6      0.0      0.0      0.0
    7m5s        0            0.0          610.2      0.0      0.0      0.0
    7m6s        0            0.0          608.8      0.0      0.0      0.0
    7m7s        0            0.0          607.3      0.0      0.0      0.0
    7m8s        0            0.0          605.9      0.0      0.0      0.0
    7m9s        0            0.0          604.5      0.0      0.0      0.0
   7m10s        0            0.0          603.1      0.0      0.0      0.0
   7m11s        0            0.0          601.7      0.0      0.0      0.0
   7m12s        0            0.0          600.3      0.0      0.0      0.0

I also saw pq: server is not accepting clients once, which incremented the error count to 1.

I'm using the k8s internal load balancer, which checks /health on nodes every 5 seconds.

The text was updated successfully, but these errors were encountered:

nstewart · 2018-02-06T17:39:46Z

note from @bdarnell re: the server not accepting clients

our health endpoint may need to be updated to support our draining state
or we need to augment the draining protocol to wait long enough for the LB's health checks to occur before going into this draining state

rjnn · 2018-02-06T17:58:12Z

pinging @jordanlewis for triage. It would be great to get a fix in before 2.0, and also cherry-pick it into 1.1 so that we can perform a rolling upgrade that's zero downtime.

nstewart · 2018-02-06T17:59:28Z

Also including latencies if thats helpful:

nstewart · 2018-02-06T18:00:30Z

Thanks @arjunravinarayan I'll hold off on recording a demo until we have a fix or workaround

petermattis · 2018-02-06T18:33:36Z

@nstewart Can you provide step-by-step instructions for what you did? Doing so will make it easier for an engineer to reproduce and fix.

a-robinson · 2018-02-06T18:46:42Z

For what it's worth, I'd put like 95% odds on this being an issue with the Kubernetes configuration, not with Cockroach.

bdarnell · 2018-02-06T18:54:37Z

I think it's in the code: I think we're setting the flag on the pgwire server to refuse connections at the beginning of the draining process, instead of waiting for a period of time during which we fail health checks but still serve new connections.

The process should look something like this:

Health checks fail, but nothing else changes. Stay in this state long enough for load balancers to notice. (I've generally used 30s for this phase)
Begin returning errors on new connections and close existing connections as they become idle. I'd stay in this state for ~10s
Transfer leases away. Exit this state as soon as our lease count reaches zero (also set an upper limit - don't stay in this state longer than about a minute)
Close rocksdb and terminate the process.

nstewart · 2018-02-06T19:47:24Z

@petermattis detailed repro:

Create 5-node m4.xl cluster using the 'stable' release and KV workload using cloudformation template: https://amzn.to/2CZjJLZ
SSH into the machine (proxy command is in the outputs section of the template)
sudo nano /tmp/cockroachdb-statefulset.yaml and update image to image: cockroachdb/cockroach-unstable:latest
kubectl apply -f /tmp/cockroachdb-statefulset.yaml

You'll see k8s start the rolling update

you can check logs with kubectl logs [kv-podid] -f

admin ui link is also in the cloudformation templates output section. Note when I repro-ed this time, QPS still went to zero, but not long enough to register in the admin ui. I still saw the 99-tile latency spike in the graph and refused connection in the logs, though.

The initial stateful set config is here: https://github.com/cockroachdb/cockroachdb-cloudformation/blob/master/scripts/cockroachdb-statefulset.yaml though some fields get modified based on the template parameters you use (they don't change loadbalancer settings, though)

petermattis · 2018-02-07T01:00:07Z

@asubiotto Can you take a look at this tomorrow? Is Ben's diagnosis correct, or is something else going on? Should be easy to reproduce with Nate's instructions.

asubiotto · 2018-02-07T16:55:08Z

Ben's diagnosis is correct given the configuration, the main issue is that we don't report a node as unavailable through the Health endpoint when it is draining. The idea I had when writing the draining code was that the health check would be performed as a SQL level check through the pgwire server. We might be able to change the kubernetes configuration to do this check but I am not certain this is an option.

Regardless, I think it would be good to leave a grace period in which we accept new clients but health checks fail, my biggest question is: is it correct to return nil, error when a node is unavailable or decommissioning in adminServer.Health? Will this have any unwanted effects on the admin ui? If not, I'll make the change to make draining a bit more graceful.

a-robinson · 2018-02-07T18:13:42Z

The idea I had when writing the draining code was that the health check would be performed as a SQL level check through the pgwire server. We might be able to change the kubernetes configuration to do this check but I am not certain this is an option.

Yes, that could be done.

asubiotto · 2018-02-09T18:43:08Z

I think I confused myself. Although the above related to load balancers/draining is a problem that I will fix, it's weird that only an error or two is reported during the rolling update due to a connection attempt to a draining node, so it seems that the queries are hanging somehow. I don't think it's a lease issue since all nodes seem to drain properly. I'm going to take a closer look at this.

asubiotto · 2018-02-09T23:17:57Z

Status update:

I reproduce this easily during a rolling upgrade. My initial thought was that since we don't move ranges off of draining nodes, we could have a situation in which two consecutive nodes part of the same raft groups are unable to service raft requests. This was due to seeing this message in the logs like:
W180209 19:39:43.147945 140 vendor/github.com/coreos/etcd/raft/raft.go:825 [n1,s1,r10/1:/Table/1{3-4}] 1 stepped down to follower since quorum is not active
Table 13 seems to be the rangelog table. Since we're doing a rolling upgrade, we should never see messages like this.

The time taken from shutdown to be able to receive raft requests again was around 1 minute and we might simply not be giving enough time to nodes to come back up. I changed the readiness probe to be much stricter (a probe that must be passed to move on to the upgrade of the next node) and although there was a reduction in the amount of time kv dropped to 0 QPS, it still happened.

I can reproduce a drop in QPS by simply shutting down and restarting a node running v2.0-alpha.20180129. I start seeing:

I180209 22:52:28.288427 150 storage/node_liveness.go:627  [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (error=rpc error: code = DeadlineExceeded desc = context deadline exceeded [propagate])
W180209 22:52:28.288650 150 storage/node_liveness.go:426  [n1,hb] slow heartbeat took 4.5s
W180209 22:52:28.288692 150 storage/node_liveness.go:365  [n1,hb] failed node liveness heartbeat: context deadline exceeded

on all other nodes.
Trying to check the /reports/network page also seems to hang during this period. I think there might be a networking issue.

a-robinson · 2018-02-10T18:36:18Z

I can reproduce a drop in QPS by simply shutting down and restarting a node running v2.0-alpha.20180129. I start seeing:

Did you shut it down gracefully or forcefully? If the former, that's definitely a bug.

If it's the latter, it brings up an interesting issue with our fault tolerance. Because our expiration-based range leases last 9 seconds (and we renew them when there's 4.5 seconds left), if the node liveness range's leaseholder gets forcefully killed, it'll typically be 4.5-9 seconds before any other node is able to ping their liveness record. And because node liveness records also only last 9 seconds before they're considered expired, that 4.5-9 seconds could often be enough of a delay for nodes to lose their liveness. In other words, the node liveness range's leaseholder going down can cause all nodes to be non-live for a short period of time. It shouldn't be for long, but could make for short 0 qps periods across the cluster.

asubiotto · 2018-02-12T16:05:44Z

It was a graceful shutdown. I can reproduce the drop in QPS with a 3-node local cluster, definitely seems like there's something wrong with the draining. I'll keep on investigating.

vivekmenezes · 2018-02-13T11:31:31Z

possible related issue #22630

vivekmenezes · 2018-02-15T15:35:24Z

@bobvawter to QA this once it's fixed.

asubiotto · 2018-02-15T15:53:07Z

I just want to consolidate the items of work to be done here into one list:

Fix failure to transfer raft leadership after transferring lease *: substantial qps dips when "gracefully" quitting a server #22573
- PR: storage: transfer raft leadership and wait grace period when draining #22767
Split liveness and readiness checks into two different checks and audit the configuration.
- Liveness check should be just a node liveness check.
- Readiness check should be a SELECT 1 + some other stuff described below.
- PR: server: split health endpoint into health and readiness endpoints #22911
Make readiness check play well with draining.
- When draining, fail readiness check for a while while still being operational before actually draining.
- Ensure that these timeouts play well with each other (note that kubernetes has a timeout to kill a node if it has not shut down gracefully within a certain amount of time)
- PR: server: fail readiness checks for server.drain.unready_wait #23233
Additional work discovered: storage: Avoid transferring leases to draining stores #23265

asubiotto · 2018-03-02T17:51:22Z

We have two types of changes:

Improvements to the draining process ( storage: transfer raft leadership and wait grace period when draining #22767 and storage: Avoid transferring leases to draining stores #23265)
Improvements to the interaction with load balancers (server: fail readiness checks for server.drain.unready_wait #23233 and server: split health endpoint into health and readiness endpoints #22911)

All these changes have been cherrypicked into 2.0, however, only the first type of changes have been cherrypicked into 1.1 due to some commits being required that would make this more of a feature change rather than a bug fix. This means that for rolling upgrades from 1.1.6 to 2.0, the load balancer integration won't be there so clients may see a dip in QPS + errors observed in initial comment caused by retrying connections to draining servers (however it won't be a drop to 0).

For 2.0, there are still some extra items of work to be done:

Modify haproxy gen to use the new /health?ready=1 endpoint: cli: update generated haproxy config with readiness endpoint #23463
Modify kubernetes load balancer to use the new endpoint: kubernetes: re-add liveness/readiness checks #23478

@bobvawter, for your QA, you might want to check what the effect to QPS is when doing a rolling upgrade from 1.1 to 2.0 using the changes.

To close this issue, I think we should just do a rolling restart of the new 2.0 version and verify that QPS doesn't dip.

vivekmenezes · 2018-03-02T18:05:23Z

Excellent update @asubiotto !

asubiotto · 2018-03-06T15:17:17Z

@bobvawter, when you QA, note that the default drain_wait should be set to something reasonable (#23333) and that there might also be something to do wrt configuring the readiness/liveness check period and threshold.

a-robinson · 2018-03-06T15:27:01Z

I looked into re-adding the liveness/readiness checks last night, and the liveness check isn't re-addable yet, at least not without additional work on the kubernetes config file to run cockroach init immediately: #22468 (comment)

asubiotto · 2018-04-02T18:22:18Z

Closing this as @a-robinson has verified that QPS doesn't dip anymore.

petermattis assigned asubiotto Feb 7, 2018

a-robinson mentioned this issue Feb 7, 2018

kubernetes: revisit removal of health checks #22468

Closed

asubiotto mentioned this issue Feb 8, 2018

server: consider node unhealthy when liveness status is not LIVE #22502

Merged

a-robinson mentioned this issue Feb 11, 2018

*: substantial qps dips when "gracefully" quitting a server #22573

Closed

asubiotto mentioned this issue Feb 15, 2018

Flaky test: crashed pods may temporarily be "ready" on instantiation. cockroachdb/cockroachdb-cloudformation#13

Closed

vivekmenezes assigned bobvawter Feb 15, 2018

petermattis added this to the 2.0 milestone Feb 21, 2018

This was referenced Feb 21, 2018

server: split health endpoint into health and readiness endpoints #22911

Merged

cherrypick-2.0: server: add readiness endpoint #23247

Merged

asubiotto closed this as completed Apr 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QPS Drops to Zero during rolling upgrade #22424

QPS Drops to Zero during rolling upgrade #22424

nstewart commented Feb 6, 2018

nstewart commented Feb 6, 2018

rjnn commented Feb 6, 2018

nstewart commented Feb 6, 2018

nstewart commented Feb 6, 2018

petermattis commented Feb 6, 2018

a-robinson commented Feb 6, 2018

bdarnell commented Feb 6, 2018 •

edited

Loading

nstewart commented Feb 6, 2018 •

edited

Loading

petermattis commented Feb 7, 2018

asubiotto commented Feb 7, 2018

a-robinson commented Feb 7, 2018

asubiotto commented Feb 9, 2018

asubiotto commented Feb 9, 2018

a-robinson commented Feb 10, 2018

asubiotto commented Feb 12, 2018

vivekmenezes commented Feb 13, 2018

vivekmenezes commented Feb 15, 2018

asubiotto commented Feb 15, 2018 •

edited

Loading

asubiotto commented Mar 2, 2018 •

edited by a-robinson

Loading

vivekmenezes commented Mar 2, 2018

asubiotto commented Mar 6, 2018

a-robinson commented Mar 6, 2018

asubiotto commented Apr 2, 2018

QPS Drops to Zero during rolling upgrade #22424

QPS Drops to Zero during rolling upgrade #22424

Comments

nstewart commented Feb 6, 2018

nstewart commented Feb 6, 2018

rjnn commented Feb 6, 2018

nstewart commented Feb 6, 2018

nstewart commented Feb 6, 2018

petermattis commented Feb 6, 2018

a-robinson commented Feb 6, 2018

bdarnell commented Feb 6, 2018 • edited Loading

nstewart commented Feb 6, 2018 • edited Loading

petermattis commented Feb 7, 2018

asubiotto commented Feb 7, 2018

a-robinson commented Feb 7, 2018

asubiotto commented Feb 9, 2018

asubiotto commented Feb 9, 2018

a-robinson commented Feb 10, 2018

asubiotto commented Feb 12, 2018

vivekmenezes commented Feb 13, 2018

vivekmenezes commented Feb 15, 2018

asubiotto commented Feb 15, 2018 • edited Loading

asubiotto commented Mar 2, 2018 • edited by a-robinson Loading

vivekmenezes commented Mar 2, 2018

asubiotto commented Mar 6, 2018

a-robinson commented Mar 6, 2018

asubiotto commented Apr 2, 2018

bdarnell commented Feb 6, 2018 •

edited

Loading

nstewart commented Feb 6, 2018 •

edited

Loading

asubiotto commented Feb 15, 2018 •

edited

Loading

asubiotto commented Mar 2, 2018 •

edited by a-robinson

Loading