core: Latency spike when starting a previously node killed using pkill #36397

awoods187 · 2019-04-01T21:25:07Z

Describe the problem

After killing a node, I saw a sustained 10s latency spike when re-starting the same node:

The first QPS dip is where I killed the node. The giant latency spike is when I started the node backup again.

To Reproduce

export CLUSTER=andy-schema
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=10m --warehouses=5000 --active-warehouses=3800 --tolerate-errors --duration=10h --split --scatter {pgurl:1-6}"

Then, in another terminal I created an index:

root@localhost:26257/tpcc> CREATE INDEX ON item (i_price);
CREATE INDEX

Time: 2m26.368406171s

While creating that index, in a third window I pkilled cockroach:
ubuntu@ip-172-31-33-34:~$ pkill cockroach

Cockroach handled this well. I went to clean up the cluster to try a new test:

root@localhost:26257/tpcc> ALTER RANGE default CONFIGURE ZONE USING gc.ttlseconds = 600;
CONFIGURE ZONE 1

Time: 20.736338ms

root@localhost:26257/tpcc> DROP INDEX item@item_i_price_idx                                                                                                                                                                       ;
DROP INDEX

Time: 494.814586ms

And then started the node I pkilled:
Andrews-MBP-2:~ andrewwoods$ roachprod start $CLUSTER:2

This is where I saw the large latency spike.

I then verified that node 2 is reconnected with:

Connection to 3.19.27.150 closed.
Andrews-MBP-2:~ andrewwoods$ roachprod sql $CLUSTER:2
Warning: Permanently added '3.19.27.150' (ECDSA) to the list of known hosts.
# Welcome to the cockroach SQL interface.
# All statements must be terminated by a semicolon.
# To exit: CTRL + D.
#
# Server version: CockroachDB CCL v19.1.0-beta.20190318-547-gac1ec6a (x86_64-unknown-linux-gnu, built 2019/04/01 17:15:23, go1.11.6) (same version as client)
# Cluster ID: 880796a4-277b-4b2e-8c26-bced066a1fda
# Organization: Cockroach Labs - Production Testing
#
# Enter \? for a brief introduction.
#
root@localhost:26257/defaultdb>

Environment:
v19.1.0-beta.20190318-547-gac1ec6a

The text was updated successfully, but these errors were encountered:

awoods187 · 2019-04-01T21:27:33Z

cockroach.log

awoods187 · 2019-04-01T22:54:46Z

At some point it figured out the node was now alive but it didn't seem to help much:

awoods187 · 2019-04-01T22:56:40Z

Debug.zip https://drive.google.com/open?id=1Bqx52WxeERnzkbZ8mbs-Ds9DrOY2b-9r

awoods187 · 2019-04-01T23:12:03Z

After stopping the load, pausing, then restarting it, the cluster returned to decent performance of under 1s latency:

awoods187 · 2019-04-02T01:05:34Z

It came back when I got home:

tbg · 2019-04-02T11:53:54Z

@awoods187 what did your workload report through all of this? Did it also show a high p99 latency and/or a drop in qps?

I'm seeing indications in the logs that things weren't quite right when n2 came back. Just to double-check, you did not restart the cluster to get out of the problematic state, but only the workload? The problems that I'm seeing below don't look related to the workload at all, so perhaps restarting the workload coincided with something in the cluster "fixing itself". I would suspect something like #36413 but that would persist through a workload restart and would require a restart of the affected nodes instead.

ip-172-31-33-34> W190401 21:18:04.563209 1685 storage/replica_range_lease.go:951  [n2,s2,r2/3:/System/NodeLiveness{-Max}] have been waiting 1m0s attempting to acquire lease
ip-172-31-33-34> W190401 21:18:04.564888 1541 storage/replica_write.go:199  [n2,s2,r2/3:/System/NodeLiveness{-Max}] have been waiting 60.00s for proposing command RequestLease [/System/NodeLiveness,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

        https://yourhost:8080/#/reports/range/2

and the following Raft status: {"id":"3","term":9,"vote":"5","commit":19286,"lead":"5","raftState":"StateFollower","applied":19286,"progress":{},"leadtransferee":"0"}

ip-172-31-33-20> W190401 21:21:05.495628 20842079 storage/txnwait/txnqueue.go:498  [n6,s6,r195/1:/Table/58/1/36{29/3/…-66/2/…}] pusher 7eb956a2: have been waiting 60.00s for pushee 8a539f06

ip-172-31-33-34> W190401 21:26:05.443289 1725246 storage/replica_range_lease.go:951  [n2,replicate,s2,r2/3:/System/NodeLiveness{-Max}] have been waiting 1m0s attempting to acquire lease
ip-172-31-33-34> I190401 21:26:05.443323 1725246 storage/replica_range_lease.go:956  [n2,replicate,s2,r2/3:/System/NodeLiveness{-Max}] slow lease acquisition finished after 1m0.000093223s with error [NotLeaseHolderError] r2: replica (n2,s2):3 not lease holder; lease holder unknown after 1 attempts

awoods187 · 2019-04-02T12:07:32Z

@tbg I was testing using tpc-c. It ran the whole time (although with increased latency).
I think I was wrong that things "fixed themselves." I only ran a short test before leaving to go home. When I got home and re-ran the test (thinking it was fixed to test something else), I saw that the 10s latency came back fast (my last comment above).

I never restarted the cluster/nodes (except for the one node i pkilled). I only ctrl c workload and then restarted the workload in the other terminal.

As far as I can tell the cluster never fixed itself it just had a brief moment in which it appeared to do so.

tbg · 2019-04-02T12:08:23Z

Heh, wonderful:

ip-172-31-33-34> I190401 22:40:23.873611 1541 storage/replica_write.go:218 [n2,s2,r2/3:/System/NodeLiveness{-Max}] slo
w command RequestLease [/System/NodeLiveness,/Min) finished after 4999.31s with error result is ambiguous (unable to de
termine whether command was applied via snapshot)
ip-172-31-33-34> I190401 22:40:23.873873 1685 storage/replica_range_lease.go:956 [n2,s2,r2/3:/System/NodeLiveness{-Max
}] slow lease acquisition finished after 1h23m19.310759549s with error after 1 attempts

The event triggering this is a Raft snapshot coming in from n4, which it looks like we had been waiting on for a long time.

Seeing the sequence of events, this suggests the following happened

n2 was taken down
the other nodes started truncating their logs since n2 stopped being live

n2 came back, now with lots of replicas needing raft snapshots

$ cockroach debug merge-logs nodes/*/logs/*.log | grep -F 'sending Raft' | wc -l
1395

these take over an hour to go through
while they go through, requests arriving at replicas on n2 waiting for a snapshot will get stuck waiting to see an updated lease (the local state won't update until the snapshot is applied)

This definitely explains the high p99 latencies.

It's tempting to somehow detect that and to send requests to the leader more proactively, but I don't see a very good signal. The Raft group is a follower and doesn't know it needs a snapshot; that's something the leader decides when the follower rejects an append. The leader will still communicate an updated commit index via the (coalesced) heartbeats, we could try to use that to get a signal that we're behind and probably shouldn't try to get the lease until we're caught up.

bdarnell · 2019-04-02T14:44:27Z

The leader will still communicate an updated commit index via the (coalesced) heartbeats

No, the commit index in the heartbeat will be capped at what the leader knows the follower has acked. We could add a separate "you're behind and i need to send you a snapshot" flag, but it would be separate from what raft sends.

My suggestion for a heuristic would be that if we have a lease request that has been outstanding for more than base.NetworkTimeout, we don't wait for it and return NotLeaseHolderError (with no hint) immediately.

Longer term, this could also tie into the ErrProposalDropped work (#21849). If proposal becomes a round-trip RPC to the leader instead of a fire-and-forget raft MsgProp, the leader could reject proposals coming from followers that aren't up to date.

tbg · 2019-04-02T14:48:26Z

No, the commit index in the heartbeat will be capped at what the leader knows the follower has acked.

Ah, right.

My suggestion for a heuristic would be that if we have a lease request that has been outstanding for more than base.NetworkTimeout, we don't wait for it and return NotLeaseHolderError (with no hint) immediately.

I just worry that this heuristic doesn't get us far enough. We'll get random requests ending up at these replicas regularly, and can't afford to spend three seconds for each of them if we want to control p99. It's still a lot better than what we have today, though, so we should just do it to see where we land.

bdarnell · 2019-04-02T14:58:57Z

To be clear, I'm not suggesting a 3s wait per request, I'm suggesting 3s per Replica (at which point we'd set a flag on the pendingLeaseRequest or something to fail fast). That's still not great, but might be enough that we'd see a brief spike in p99 instead of something that is sustained until all the snapshots have caught up.

We may also want to reevaluate the aggressiveness of log truncation. Should a ~4 minute outage really be enough to cause this much snapshot activity?

tbg · 2019-04-02T15:02:00Z

Log truncation got so aggressive because of the inclusion of log entries in snapshots. Now that this isn't an issue any more, we may want to dial it back significantly (but then need to do work to make sure log truncations don't get too large, I think they're not chunked right now).

I'm suggesting 3s per Replica (at which point we'd set a flag on the pendingLeaseRequest or something to fail fast)

Ah, yeah, setting a flag is somehow the way to go but I'm worried that if we're not careful we can end up in a situation where everyone has the flag and so nobody gets to do anything.

bdarnell · 2019-04-02T15:03:25Z

Even when the flag is set the lease request would keep going in the background. The flag would be cleared when it completes, so the node whose lease request completes would eventually be able to proceed. This would effectively move the blocking from redirectOnOrAcquireLeaderLease to the DistSender's retry loop.

github-actions · 2021-06-04T19:53:01Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

cindyzqtnew · 2021-07-29T09:03:15Z

Hi guys,

do you still consider this improvement in the future release?

tbg · 2021-07-29T09:12:22Z

Hi @cindyzqtnew, there isn't any particular action proposed by this issue. We are generally improving the behavior of CRDB around restarts. In fact, I believe we have addressed the problem that was likely the root cause on the original post already.

cindyzqtnew · 2021-07-30T05:46:33Z

Hi @tbg , i've opened up another thread (#68225) which I think related to this. We're encountering the problem of high latency when there is any node taken down, and the root cause is the range desc cache is not updated instantly but wait for 3s timeout (grpc) and then try the next replica until it finds any valid lease holder, and the cache will be updated accordingly. Is there any way to update the cache actively without waiting any queries?

awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Apr 1, 2019

awoods187 changed the title ~~latency spike when starting a previously node killed using pkill~~ core: latency spike when starting a previously node killed using pkill Apr 1, 2019

awoods187 changed the title ~~core: latency spike when starting a previously node killed using pkill~~ core: Latency spike when starting a previously node killed using pkill Apr 1, 2019

awoods187 assigned tbg Apr 2, 2019

tbg mentioned this issue Apr 2, 2019

stats: spam of buffered channel is full. Unable to refresh stats for table X #36416

Closed

tbg added this to the 19.2 milestone Apr 2, 2019

rmloveland mentioned this issue Sep 4, 2019

Improve decommissioning, recommissioning, and recovery of killed nodes cockroachdb/docs#4725

Closed

github-actions bot added the no-issue-activity label Jun 4, 2021

github-actions bot added the X-stale label Jun 15, 2021

github-actions bot closed this as completed Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: Latency spike when starting a previously node killed using pkill #36397

core: Latency spike when starting a previously node killed using pkill #36397

awoods187 commented Apr 1, 2019 •

edited

Loading

awoods187 commented Apr 1, 2019

awoods187 commented Apr 1, 2019

awoods187 commented Apr 1, 2019

awoods187 commented Apr 1, 2019

awoods187 commented Apr 2, 2019

tbg commented Apr 2, 2019

awoods187 commented Apr 2, 2019

tbg commented Apr 2, 2019

bdarnell commented Apr 2, 2019

tbg commented Apr 2, 2019

bdarnell commented Apr 2, 2019

tbg commented Apr 2, 2019

bdarnell commented Apr 2, 2019

github-actions bot commented Jun 4, 2021

cindyzqtnew commented Jul 29, 2021

tbg commented Jul 29, 2021

cindyzqtnew commented Jul 30, 2021

core: Latency spike when starting a previously node killed using pkill #36397

core: Latency spike when starting a previously node killed using pkill #36397

Comments

awoods187 commented Apr 1, 2019 • edited Loading

awoods187 commented Apr 1, 2019

awoods187 commented Apr 1, 2019

awoods187 commented Apr 1, 2019

awoods187 commented Apr 1, 2019

awoods187 commented Apr 2, 2019

tbg commented Apr 2, 2019

awoods187 commented Apr 2, 2019

tbg commented Apr 2, 2019

bdarnell commented Apr 2, 2019

tbg commented Apr 2, 2019

bdarnell commented Apr 2, 2019

tbg commented Apr 2, 2019

bdarnell commented Apr 2, 2019

github-actions bot commented Jun 4, 2021

cindyzqtnew commented Jul 29, 2021

tbg commented Jul 29, 2021

cindyzqtnew commented Jul 30, 2021

awoods187 commented Apr 1, 2019 •

edited

Loading