Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Latency spike when starting a previously node killed using pkill #36397

Closed
awoods187 opened this issue Apr 1, 2019 · 17 comments
Closed
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors X-stale
Milestone

Comments

@awoods187
Copy link
Contributor

awoods187 commented Apr 1, 2019

Describe the problem

After killing a node, I saw a sustained 10s latency spike when re-starting the same node:
image
The first QPS dip is where I killed the node. The giant latency spike is when I started the node backup again.

To Reproduce

export CLUSTER=andy-schema
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=10m --warehouses=5000 --active-warehouses=3800 --tolerate-errors --duration=10h --split --scatter {pgurl:1-6}"

Then, in another terminal I created an index:

root@localhost:26257/tpcc> CREATE INDEX ON item (i_price);
CREATE INDEX

Time: 2m26.368406171s

While creating that index, in a third window I pkilled cockroach:
ubuntu@ip-172-31-33-34:~$ pkill cockroach

Cockroach handled this well. I went to clean up the cluster to try a new test:

root@localhost:26257/tpcc> ALTER RANGE default CONFIGURE ZONE USING gc.ttlseconds = 600;
CONFIGURE ZONE 1

Time: 20.736338ms

root@localhost:26257/tpcc> DROP INDEX item@item_i_price_idx                                                                                                                                                                       ;
DROP INDEX

Time: 494.814586ms

And then started the node I pkilled:
Andrews-MBP-2:~ andrewwoods$ roachprod start $CLUSTER:2

This is where I saw the large latency spike.

I then verified that node 2 is reconnected with:

Connection to 3.19.27.150 closed.
Andrews-MBP-2:~ andrewwoods$ roachprod sql $CLUSTER:2
Warning: Permanently added '3.19.27.150' (ECDSA) to the list of known hosts.
# Welcome to the cockroach SQL interface.
# All statements must be terminated by a semicolon.
# To exit: CTRL + D.
#
# Server version: CockroachDB CCL v19.1.0-beta.20190318-547-gac1ec6a (x86_64-unknown-linux-gnu, built 2019/04/01 17:15:23, go1.11.6) (same version as client)
# Cluster ID: 880796a4-277b-4b2e-8c26-bced066a1fda
# Organization: Cockroach Labs - Production Testing
#
# Enter \? for a brief introduction.
#
root@localhost:26257/defaultdb>

Environment:
v19.1.0-beta.20190318-547-gac1ec6a

@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Apr 1, 2019
@awoods187 awoods187 changed the title latency spike when starting a previously node killed using pkill core: latency spike when starting a previously node killed using pkill Apr 1, 2019
@awoods187 awoods187 changed the title core: latency spike when starting a previously node killed using pkill core: Latency spike when starting a previously node killed using pkill Apr 1, 2019
@awoods187
Copy link
Contributor Author

cockroach.log

@awoods187
Copy link
Contributor Author

At some point it figured out the node was now alive but it didn't seem to help much:
image
image
image

@awoods187
Copy link
Contributor Author

@awoods187
Copy link
Contributor Author

After stopping the load, pausing, then restarting it, the cluster returned to decent performance of under 1s latency:
image

@awoods187
Copy link
Contributor Author

It came back when I got home:
image

@tbg
Copy link
Member

tbg commented Apr 2, 2019

@awoods187 what did your workload report through all of this? Did it also show a high p99 latency and/or a drop in qps?

I'm seeing indications in the logs that things weren't quite right when n2 came back. Just to double-check, you did not restart the cluster to get out of the problematic state, but only the workload? The problems that I'm seeing below don't look related to the workload at all, so perhaps restarting the workload coincided with something in the cluster "fixing itself". I would suspect something like #36413 but that would persist through a workload restart and would require a restart of the affected nodes instead.

ip-172-31-33-34> W190401 21:18:04.563209 1685 storage/replica_range_lease.go:951  [n2,s2,r2/3:/System/NodeLiveness{-Max}] have been waiting 1m0s attempting to acquire lease
ip-172-31-33-34> W190401 21:18:04.564888 1541 storage/replica_write.go:199  [n2,s2,r2/3:/System/NodeLiveness{-Max}] have been waiting 60.00s for proposing command RequestLease [/System/NodeLiveness,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

        https://yourhost:8080/#/reports/range/2

and the following Raft status: {"id":"3","term":9,"vote":"5","commit":19286,"lead":"5","raftState":"StateFollower","applied":19286,"progress":{},"leadtransferee":"0"}
ip-172-31-33-20> W190401 21:21:05.495628 20842079 storage/txnwait/txnqueue.go:498  [n6,s6,r195/1:/Table/58/1/36{29/3/…-66/2/…}] pusher 7eb956a2: have been waiting 60.00s for pushee 8a539f06
ip-172-31-33-34> W190401 21:26:05.443289 1725246 storage/replica_range_lease.go:951  [n2,replicate,s2,r2/3:/System/NodeLiveness{-Max}] have been waiting 1m0s attempting to acquire lease
ip-172-31-33-34> I190401 21:26:05.443323 1725246 storage/replica_range_lease.go:956  [n2,replicate,s2,r2/3:/System/NodeLiveness{-Max}] slow lease acquisition finished after 1m0.000093223s with error [NotLeaseHolderError] r2: replica (n2,s2):3 not lease holder; lease holder unknown after 1 attempts

@awoods187
Copy link
Contributor Author

@tbg I was testing using tpc-c. It ran the whole time (although with increased latency).
I think I was wrong that things "fixed themselves." I only ran a short test before leaving to go home. When I got home and re-ran the test (thinking it was fixed to test something else), I saw that the 10s latency came back fast (my last comment above).

I never restarted the cluster/nodes (except for the one node i pkilled). I only ctrl c workload and then restarted the workload in the other terminal.

As far as I can tell the cluster never fixed itself it just had a brief moment in which it appeared to do so.

@tbg
Copy link
Member

tbg commented Apr 2, 2019

Heh, wonderful:

ip-172-31-33-34> I190401 22:40:23.873611 1541 storage/replica_write.go:218 [n2,s2,r2/3:/System/NodeLiveness{-Max}] slo
w command RequestLease [/System/NodeLiveness,/Min) finished after 4999.31s with error result is ambiguous (unable to de
termine whether command was applied via snapshot)
ip-172-31-33-34> I190401 22:40:23.873873 1685 storage/replica_range_lease.go:956 [n2,s2,r2/3:/System/NodeLiveness{-Max
}] slow lease acquisition finished after 1h23m19.310759549s with error after 1 attempts

The event triggering this is a Raft snapshot coming in from n4, which it looks like we had been waiting on for a long time.

Seeing the sequence of events, this suggests the following happened

  1. n2 was taken down
  2. the other nodes started truncating their logs since n2 stopped being live
  3. n2 came back, now with lots of replicas needing raft snapshots
    $ cockroach debug merge-logs nodes/*/logs/*.log | grep -F 'sending Raft' | wc -l
    1395
    
  4. these take over an hour to go through
  5. while they go through, requests arriving at replicas on n2 waiting for a snapshot will get stuck waiting to see an updated lease (the local state won't update until the snapshot is applied)

This definitely explains the high p99 latencies.

It's tempting to somehow detect that and to send requests to the leader more proactively, but I don't see a very good signal. The Raft group is a follower and doesn't know it needs a snapshot; that's something the leader decides when the follower rejects an append. The leader will still communicate an updated commit index via the (coalesced) heartbeats, we could try to use that to get a signal that we're behind and probably shouldn't try to get the lease until we're caught up.

@tbg tbg added this to the 19.2 milestone Apr 2, 2019
@bdarnell
Copy link
Contributor

bdarnell commented Apr 2, 2019

The leader will still communicate an updated commit index via the (coalesced) heartbeats

No, the commit index in the heartbeat will be capped at what the leader knows the follower has acked. We could add a separate "you're behind and i need to send you a snapshot" flag, but it would be separate from what raft sends.

My suggestion for a heuristic would be that if we have a lease request that has been outstanding for more than base.NetworkTimeout, we don't wait for it and return NotLeaseHolderError (with no hint) immediately.

Longer term, this could also tie into the ErrProposalDropped work (#21849). If proposal becomes a round-trip RPC to the leader instead of a fire-and-forget raft MsgProp, the leader could reject proposals coming from followers that aren't up to date.

@tbg
Copy link
Member

tbg commented Apr 2, 2019

No, the commit index in the heartbeat will be capped at what the leader knows the follower has acked.

Ah, right.

My suggestion for a heuristic would be that if we have a lease request that has been outstanding for more than base.NetworkTimeout, we don't wait for it and return NotLeaseHolderError (with no hint) immediately.

I just worry that this heuristic doesn't get us far enough. We'll get random requests ending up at these replicas regularly, and can't afford to spend three seconds for each of them if we want to control p99. It's still a lot better than what we have today, though, so we should just do it to see where we land.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 2, 2019

To be clear, I'm not suggesting a 3s wait per request, I'm suggesting 3s per Replica (at which point we'd set a flag on the pendingLeaseRequest or something to fail fast). That's still not great, but might be enough that we'd see a brief spike in p99 instead of something that is sustained until all the snapshots have caught up.

We may also want to reevaluate the aggressiveness of log truncation. Should a ~4 minute outage really be enough to cause this much snapshot activity?

@tbg
Copy link
Member

tbg commented Apr 2, 2019

Log truncation got so aggressive because of the inclusion of log entries in snapshots. Now that this isn't an issue any more, we may want to dial it back significantly (but then need to do work to make sure log truncations don't get too large, I think they're not chunked right now).

I'm suggesting 3s per Replica (at which point we'd set a flag on the pendingLeaseRequest or something to fail fast)

Ah, yeah, setting a flag is somehow the way to go but I'm worried that if we're not careful we can end up in a situation where everyone has the flag and so nobody gets to do anything.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 2, 2019

Even when the flag is set the lease request would keep going in the background. The flag would be cleared when it completes, so the node whose lease request completes would eventually be able to proceed. This would effectively move the blocking from redirectOnOrAcquireLeaderLease to the DistSender's retry loop.

@github-actions
Copy link

github-actions bot commented Jun 4, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@cindyzqtnew
Copy link

Hi guys,

do you still consider this improvement in the future release?

@tbg
Copy link
Member

tbg commented Jul 29, 2021

Hi @cindyzqtnew, there isn't any particular action proposed by this issue. We are generally improving the behavior of CRDB around restarts. In fact, I believe we have addressed the problem that was likely the root cause on the original post already.

@cindyzqtnew
Copy link

Hi @tbg , i've opened up another thread (#68225) which I think related to this. We're encountering the problem of high latency when there is any node taken down, and the root cause is the range desc cache is not updated instantly but wait for 3s timeout (grpc) and then try the next replica until it finds any valid lease holder, and the cache will be updated accordingly. Is there any way to update the cache actively without waiting any queries?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors X-stale
Projects
None yet
Development

No branches or pull requests

4 participants