-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QPS Drops to Zero during rolling upgrade #22424
Comments
note from @bdarnell re: the server not accepting clients
|
pinging @jordanlewis for triage. It would be great to get a fix in before 2.0, and also cherry-pick it into 1.1 so that we can perform a rolling upgrade that's zero downtime. |
Thanks @arjunravinarayan I'll hold off on recording a demo until we have a fix or workaround |
@nstewart Can you provide step-by-step instructions for what you did? Doing so will make it easier for an engineer to reproduce and fix. |
For what it's worth, I'd put like 95% odds on this being an issue with the Kubernetes configuration, not with Cockroach. |
I think it's in the code: I think we're setting the flag on the pgwire server to refuse connections at the beginning of the draining process, instead of waiting for a period of time during which we fail health checks but still serve new connections. The process should look something like this:
|
@petermattis detailed repro:
You'll see k8s start the rolling update you can check logs with admin ui link is also in the cloudformation templates The initial stateful set config is here: https://github.com/cockroachdb/cockroachdb-cloudformation/blob/master/scripts/cockroachdb-statefulset.yaml though some fields get modified based on the template parameters you use (they don't change loadbalancer settings, though) |
@asubiotto Can you take a look at this tomorrow? Is Ben's diagnosis correct, or is something else going on? Should be easy to reproduce with Nate's instructions. |
Ben's diagnosis is correct given the configuration, the main issue is that we don't report a node as unavailable through the Regardless, I think it would be good to leave a grace period in which we accept new clients but health checks fail, my biggest question is: is it correct to return |
Yes, that could be done. |
I think I confused myself. Although the above related to load balancers/draining is a problem that I will fix, it's weird that only an error or two is reported during the rolling update due to a connection attempt to a draining node, so it seems that the queries are hanging somehow. I don't think it's a lease issue since all nodes seem to drain properly. I'm going to take a closer look at this. |
Status update: I reproduce this easily during a rolling upgrade. My initial thought was that since we don't move ranges off of draining nodes, we could have a situation in which two consecutive nodes part of the same raft groups are unable to service raft requests. This was due to seeing this message in the logs like: The time taken from shutdown to be able to receive raft requests again was around 1 minute and we might simply not be giving enough time to nodes to come back up. I changed the readiness probe to be much stricter (a probe that must be passed to move on to the upgrade of the next node) and although there was a reduction in the amount of time kv dropped to 0 QPS, it still happened. I can reproduce a drop in QPS by simply shutting down and restarting a node running
on all other nodes. |
Did you shut it down gracefully or forcefully? If the former, that's definitely a bug. If it's the latter, it brings up an interesting issue with our fault tolerance. Because our expiration-based range leases last 9 seconds (and we renew them when there's 4.5 seconds left), if the node liveness range's leaseholder gets forcefully killed, it'll typically be 4.5-9 seconds before any other node is able to ping their liveness record. And because node liveness records also only last 9 seconds before they're considered expired, that 4.5-9 seconds could often be enough of a delay for nodes to lose their liveness. In other words, the node liveness range's leaseholder going down can cause all nodes to be non-live for a short period of time. It shouldn't be for long, but could make for short 0 qps periods across the cluster. |
It was a graceful shutdown. I can reproduce the drop in QPS with a 3-node local cluster, definitely seems like there's something wrong with the draining. I'll keep on investigating. |
possible related issue #22630 |
@bobvawter to QA this once it's fixed. |
I just want to consolidate the items of work to be done here into one list:
|
We have two types of changes:
All these changes have been cherrypicked into 2.0, however, only the first type of changes have been cherrypicked into 1.1 due to some commits being required that would make this more of a feature change rather than a bug fix. This means that for rolling upgrades from For 2.0, there are still some extra items of work to be done:
@bobvawter, for your QA, you might want to check what the effect to QPS is when doing a rolling upgrade from To close this issue, I think we should just do a rolling restart of the new |
Excellent update @asubiotto ! |
@bobvawter, when you QA, note that the default |
I looked into re-adding the liveness/readiness checks last night, and the liveness check isn't re-addable yet, at least not without additional work on the kubernetes config file to run |
Closing this as @a-robinson has verified that QPS doesn't dip anymore. |
Running kv workload on a 6 node k8s cluster (5 cockroachDB nodes). I triggered a rolling update from 1.1.5 to v2.0-alpha.20180129 and saw QPS temporarily dropped to zero
I also saw
pq: server is not accepting clients
once, which incremented the error count to 1.I'm using the k8s internal load balancer, which checks /health on nodes every 5 seconds.
The text was updated successfully, but these errors were encountered: