stability: tail of under-replicated ranges lasts too long #11984

petermattis · 2016-12-03T18:57:48Z

This graph shows the under-replicated ranges for a 6m chaos event. The tail lasts for almost exactly 10m which is suspiciously close to the replica scanner interval. Regardless, we need to figure out what is preventing those under-replicated ranges from catching up quicker.

spencerkimball · 2016-12-03T21:10:38Z

We had talked in the past about having nodes listen for changes in liveness in other nodes and visiting replicas to add to the replicate queue as necessary. Something to keep in mind if we discover it's just the 10 minutes it takes for the scanner to discover all underreplicated ranges.

petermattis · 2016-12-03T21:14:20Z

That would be useful if the ranges were idle, but given the load its hard to imagine that would be helpful here.

petermattis · 2016-12-04T02:31:03Z

Here's a different 6m chaos event. The under-replicated metric is slightly different here in that the behindThreshold is 100 Raft log entries. I'm not sure if that accounts for the different shape of the curve or not. Very curious how this isn't just a steady drop. Suggests that the recovering node experienced some sort of event that caused it to fall behind again.

petermattis · 2017-01-09T19:44:05Z

I performed two chaos events just now to see what impact block_writer load has on node recovery. Both graphs show a 45m time period. I built a custom binary where behindThreshold is 10 Raft log entries to make the under-replicated metric more sensitive. The first graph shows node recovery with block_writer load continuing to run against the cluster:

The next graph shows node recovery with block_writer stopped when the down node is restarted (6m after the chaos event started):

In the first chaos event, the cluster took ~40m to recover all of the replicas and in the second in took ~3m.

See also #12485 and #8659 which propose mechanisms for improving the behavior of the system during node recovery.

petermattis · 2017-01-09T22:21:44Z

The graph below shows a chaos event where block_writer is stopped after the node restarts. The "raft log behind" metric is the number of raft log messages that followers on live replicas are behind.

spencerkimball · 2017-01-09T23:08:23Z

Somehow a random restart of a node at 4m:30s caused a particularly impressive restart profile as shown in the first recovery in the attached graph at 22:20.

This makes me suspicious of a bad interaction going on with recovery activity which starts at the 5m replica-is-dead threshold. I tried again, this time purposefully restarting at 5m instead of 6m. That graph starts at 22:41.

I've now updated the time until dead (COCKROACH_TIME_UNTIL_STORE_DEAD=10m) and ran the same chaos recovery step with a node dead for each of {1,2,3,4,5,6,7,8,9,10}m with 10m intervals. Will post that when graph is done.

petermattis · 2017-01-10T01:13:58Z

Replica removal is fairly expensive and we allow it to proceed as fast as possible when driven by the replica GC queue. Perhaps we need to throttle it. I'm interested to see what the graphs look like for 6-10m outages with that time-until-store-dead setting.

spencerkimball · 2017-01-10T06:59:57Z

The graph below is with recoveries not starting until 10m of dead time with chaos events starting at 1m and going by 1 minute increments to 10m. I think it's fairly clear from this experiment that the issue isn't with replaying logs or doing raft-supplied snapshots while load is ongoing, but instead with our concurrent replicate_queue "recovery" activity which is initiated for only 1m of the 6m chaos event (re: the graph 4 graphs above this comment). Once the missing node comes back, we ought to be stopping our concurrent snapshot recoveries via the replicate_queue, so why the horrific recovery profile when we have 5m for time-until-dead?

spencerkimball · 2017-01-10T13:49:57Z

Just pushed a build with the replica-gc queue disabled and time-until-dead set back to 5m. Going to run a 6m chaos event to verify it's replica GC which is causing the recovery badness.

petermattis · 2017-01-10T13:49:58Z

It might be interesting to see the opposite experiment: setting COCKROACH_TIME_UNTIL_STORE_DEAD=1m and trying a 2m node outage. I'm missing some understanding here about what is cause the long node recovery and spiky throughput. Clearly it isn't Raft log replays and it doesn't seem to be Raft snapshots. It's curious that node recovery proceeds quickly if load is removed from the system and that it also proceeds quickly (and without affecting load) if there is no rebalancing activity.

spencerkimball · 2017-01-10T14:31:21Z

Looks like the replica GC activity is what's most affecting the recovery profile. Here's a graph showing the activity with a 6m chaos event with replica-gc queue disabled:

But with replicate queue active, the performance dips (IMO unnecessarily badly). The top graph shows performance during the 6m chaos event with replicate queue and replica-gc queue off. The bottom graph shows performance with the 6m chaos event with replicate queue on and replica-gc queue off.

petermattis · 2017-01-11T16:01:27Z

Here are graphs from the experiment where we set COCKROACH_TIME_UNTIL_STORE_DEAD=1m and took a node down for 2m:

Difficult to see in the above graph, but the count of under-replicated ranges dropped to 319 when the node restarted and took until 15:25 to drop to 0 (19m).

Performance doesn't fully recover until 15:50 (~40m after the node outage).

The replica GC queue and replicate queue are busy until shortly before 15:25.

And then we see the raft log queue kick in, trimming raft log entries.

petermattis · 2017-01-11T21:24:11Z

The above shows a new raftlog.truncated metric vs the raftlog.behind metric. The spike in the truncated metric rose up to 103k. Currently, raft log queue truncates as many entries as possible with no regarding to throttling. The second event is running a binary with a small patch to truncate a maximum of 200 entries at a time. Other than smoothing out the truncated metric, this doesn't seem to have provided any improvement.

spencerkimball · 2017-01-11T21:43:09Z

Two things:

Add a metric to the same graph for keys GCd
Remember to save the Grafana graph definitions!

petermattis · 2017-01-12T19:08:09Z

I added a 5s delay between operations in the replica GC queue and this seemed to improve the behavior on node recovery. I also added some logging to the replica destroy-data operation:

I170112 18:27:48.790781 127 storage/replica.go:708  [replicaGC,n6,s6,r3285/5:/Table/51/1/336{18325…-29601…},@c42135c300] removed 401243 keys in 1956ms [clear=1665ms commit=289ms]
I170112 18:27:55.971792 127 storage/replica.go:708  [replicaGC,n6,s6,r2147/5:/Table/51/1/386{80605…-91822…},@c4213cec00] removed 398027 keys in 2152ms [clear=1873ms commit=279ms]
I170112 18:28:02.629724 127 storage/replica.go:708  [replicaGC,n6,s6,r1757/1:/Table/51/1/104{22793…-34099…},@c420a52f00] removed 402055 keys in 1426ms [clear=1231ms commit=191ms]
I170112 18:28:09.359959 127 storage/replica.go:708  [replicaGC,n6,s6,r3766/1:/Table/51/1/187{59903…-71116…},@c42115ef00] removed 399454 keys in 1720ms [clear=1372ms commit=345ms]
I170112 18:28:17.106822 127 storage/replica.go:708  [replicaGC,n6,s6,r6052/1:/Table/51/1/782{33496…-44791…},@c421444900] removed 401921 keys in 2742ms [clear=2538ms commit=204ms]
I170112 18:28:24.741533 127 storage/replica.go:708  [replicaGC,n6,s6,r4360/1:/Table/51/1/561{21605…-32822…},@c4215a2000] removed 399191 keys in 2627ms [clear=2346ms commit=281ms]

Above we see a half-dozen remove operations, each removing ~400k keys and taking 2-3s. It is very interesting that the bulk of the time is spent generating the batch. Why is the iteration so slow? Or perhaps it is something else.

petermattis · 2017-01-13T02:19:58Z

RocksDB 5.0.1 has been released with support for DeleteRange. I'm going to experiment with this when I get a chance.

petermattis · 2017-01-23T21:05:37Z

With #12913 which utilizes the new RocksDB DeleteRange operation, replica recovery proceeds smoothly after node recovery.

petermattis self-assigned this Dec 3, 2016

petermattis assigned spencerkimball Dec 12, 2016

petermattis closed this as completed in c9025ac Jan 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: tail of under-replicated ranges lasts too long #11984

stability: tail of under-replicated ranges lasts too long #11984

petermattis commented Dec 3, 2016

spencerkimball commented Dec 3, 2016

petermattis commented Dec 3, 2016

petermattis commented Dec 4, 2016

petermattis commented Jan 9, 2017

petermattis commented Jan 9, 2017

spencerkimball commented Jan 9, 2017

petermattis commented Jan 10, 2017

spencerkimball commented Jan 10, 2017 •

edited

Loading

spencerkimball commented Jan 10, 2017

petermattis commented Jan 10, 2017

spencerkimball commented Jan 10, 2017

petermattis commented Jan 11, 2017

petermattis commented Jan 11, 2017

spencerkimball commented Jan 11, 2017

petermattis commented Jan 12, 2017

petermattis commented Jan 13, 2017

petermattis commented Jan 23, 2017

stability: tail of under-replicated ranges lasts too long #11984

stability: tail of under-replicated ranges lasts too long #11984

Comments

petermattis commented Dec 3, 2016

spencerkimball commented Dec 3, 2016

petermattis commented Dec 3, 2016

petermattis commented Dec 4, 2016

petermattis commented Jan 9, 2017

petermattis commented Jan 9, 2017

spencerkimball commented Jan 9, 2017

petermattis commented Jan 10, 2017

spencerkimball commented Jan 10, 2017 • edited Loading

spencerkimball commented Jan 10, 2017

petermattis commented Jan 10, 2017

spencerkimball commented Jan 10, 2017

petermattis commented Jan 11, 2017

petermattis commented Jan 11, 2017

spencerkimball commented Jan 11, 2017

petermattis commented Jan 12, 2017

petermattis commented Jan 13, 2017

petermattis commented Jan 23, 2017

spencerkimball commented Jan 10, 2017 •

edited

Loading