-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: Latency spike when starting a previously node killed using pkill #36397
Comments
@awoods187 what did your workload report through all of this? Did it also show a high p99 latency and/or a drop in qps? I'm seeing indications in the logs that things weren't quite right when n2 came back. Just to double-check, you did not restart the cluster to get out of the problematic state, but only the workload? The problems that I'm seeing below don't look related to the workload at all, so perhaps restarting the workload coincided with something in the cluster "fixing itself". I would suspect something like #36413 but that would persist through a workload restart and would require a restart of the affected nodes instead.
|
@tbg I was testing using tpc-c. It ran the whole time (although with increased latency). I never restarted the cluster/nodes (except for the one node i pkilled). I only ctrl c workload and then restarted the workload in the other terminal. As far as I can tell the cluster never fixed itself it just had a brief moment in which it appeared to do so. |
Heh, wonderful:
The event triggering this is a Raft snapshot coming in from n4, which it looks like we had been waiting on for a long time. Seeing the sequence of events, this suggests the following happened
This definitely explains the high p99 latencies. It's tempting to somehow detect that and to send requests to the leader more proactively, but I don't see a very good signal. The Raft group is a follower and doesn't know it needs a snapshot; that's something the leader decides when the follower rejects an append. The leader will still communicate an updated commit index via the (coalesced) heartbeats, we could try to use that to get a signal that we're behind and probably shouldn't try to get the lease until we're caught up. |
No, the commit index in the heartbeat will be capped at what the leader knows the follower has acked. We could add a separate "you're behind and i need to send you a snapshot" flag, but it would be separate from what raft sends. My suggestion for a heuristic would be that if we have a lease request that has been outstanding for more than base.NetworkTimeout, we don't wait for it and return NotLeaseHolderError (with no hint) immediately. Longer term, this could also tie into the ErrProposalDropped work (#21849). If proposal becomes a round-trip RPC to the leader instead of a fire-and-forget raft MsgProp, the leader could reject proposals coming from followers that aren't up to date. |
Ah, right.
I just worry that this heuristic doesn't get us far enough. We'll get random requests ending up at these replicas regularly, and can't afford to spend three seconds for each of them if we want to control p99. It's still a lot better than what we have today, though, so we should just do it to see where we land. |
To be clear, I'm not suggesting a 3s wait per request, I'm suggesting 3s per Replica (at which point we'd set a flag on the pendingLeaseRequest or something to fail fast). That's still not great, but might be enough that we'd see a brief spike in p99 instead of something that is sustained until all the snapshots have caught up. We may also want to reevaluate the aggressiveness of log truncation. Should a ~4 minute outage really be enough to cause this much snapshot activity? |
Log truncation got so aggressive because of the inclusion of log entries in snapshots. Now that this isn't an issue any more, we may want to dial it back significantly (but then need to do work to make sure log truncations don't get too large, I think they're not chunked right now).
Ah, yeah, setting a flag is somehow the way to go but I'm worried that if we're not careful we can end up in a situation where everyone has the flag and so nobody gets to do anything. |
Even when the flag is set the lease request would keep going in the background. The flag would be cleared when it completes, so the node whose lease request completes would eventually be able to proceed. This would effectively move the blocking from redirectOnOrAcquireLeaderLease to the DistSender's retry loop. |
We have marked this issue as stale because it has been inactive for |
Hi guys, do you still consider this improvement in the future release? |
Hi @cindyzqtnew, there isn't any particular action proposed by this issue. We are generally improving the behavior of CRDB around restarts. In fact, I believe we have addressed the problem that was likely the root cause on the original post already. |
Hi @tbg , i've opened up another thread (#68225) which I think related to this. We're encountering the problem of high latency when there is any node taken down, and the root cause is the range desc cache is not updated instantly but wait for 3s timeout (grpc) and then try the next replica until it finds any valid lease holder, and the cache will be updated accordingly. Is there any way to update the cache actively without waiting any queries? |
Describe the problem
After killing a node, I saw a sustained 10s latency spike when re-starting the same node:
The first QPS dip is where I killed the node. The giant latency spike is when I started the node backup again.
To Reproduce
export CLUSTER=andy-schema$CLUSTER -- "DEV=$ (mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier $ {DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod run
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=10m --warehouses=5000 --active-warehouses=3800 --tolerate-errors --duration=10h --split --scatter {pgurl:1-6}"
Then, in another terminal I created an index:
While creating that index, in a third window I pkilled cockroach:
ubuntu@ip-172-31-33-34:~$ pkill cockroach
Cockroach handled this well. I went to clean up the cluster to try a new test:
And then started the node I pkilled:
Andrews-MBP-2:~ andrewwoods$ roachprod start $CLUSTER:2
This is where I saw the large latency spike.
I then verified that node 2 is reconnected with:
Environment:
v19.1.0-beta.20190318-547-gac1ec6a
The text was updated successfully, but these errors were encountered: