Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: tail of under-replicated ranges lasts too long #11984

Closed
petermattis opened this issue Dec 3, 2016 · 17 comments
Closed

stability: tail of under-replicated ranges lasts too long #11984

petermattis opened this issue Dec 3, 2016 · 17 comments
Assignees

Comments

@petermattis
Copy link
Collaborator

screen shot 2016-12-03 at 1 53 09 pm

This graph shows the under-replicated ranges for a 6m chaos event. The tail lasts for almost exactly 10m which is suspiciously close to the replica scanner interval. Regardless, we need to figure out what is preventing those under-replicated ranges from catching up quicker.

@petermattis petermattis self-assigned this Dec 3, 2016
@spencerkimball
Copy link
Member

We had talked in the past about having nodes listen for changes in liveness in other nodes and visiting replicas to add to the replicate queue as necessary. Something to keep in mind if we discover it's just the 10 minutes it takes for the scanner to discover all underreplicated ranges.

@petermattis
Copy link
Collaborator Author

That would be useful if the ranges were idle, but given the load its hard to imagine that would be helpful here.

@petermattis
Copy link
Collaborator Author

screen shot 2016-12-03 at 9 28 32 pm

Here's a different 6m chaos event. The under-replicated metric is slightly different here in that the behindThreshold is 100 Raft log entries. I'm not sure if that accounts for the different shape of the curve or not. Very curious how this isn't just a steady drop. Suggests that the recovering node experienced some sort of event that caused it to fall behind again.

@petermattis
Copy link
Collaborator Author

I performed two chaos events just now to see what impact block_writer load has on node recovery. Both graphs show a 45m time period. I built a custom binary where behindThreshold is 10 Raft log entries to make the under-replicated metric more sensitive. The first graph shows node recovery with block_writer load continuing to run against the cluster:

screen shot 2017-01-09 at 2 36 06 pm

The next graph shows node recovery with block_writer stopped when the down node is restarted (6m after the chaos event started):

screen shot 2017-01-09 at 2 36 48 pm

In the first chaos event, the cluster took ~40m to recover all of the replicas and in the second in took ~3m.

See also #12485 and #8659 which propose mechanisms for improving the behavior of the system during node recovery.

@petermattis
Copy link
Collaborator Author

The graph below shows a chaos event where block_writer is stopped after the node restarts. The "raft log behind" metric is the number of raft log messages that followers on live replicas are behind.

screen shot 2017-01-09 at 4 26 22 pm

@spencerkimball
Copy link
Member

Somehow a random restart of a node at 4m:30s caused a particularly impressive restart profile as shown in the first recovery in the attached graph at 22:20.

This makes me suspicious of a bad interaction going on with recovery activity which starts at the 5m replica-is-dead threshold. I tried again, this time purposefully restarting at 5m instead of 6m. That graph starts at 22:41.

I've now updated the time until dead (COCKROACH_TIME_UNTIL_STORE_DEAD=10m) and ran the same chaos recovery step with a node dead for each of {1,2,3,4,5,6,7,8,9,10}m with 10m intervals. Will post that when graph is done.

screen shot 2017-01-09 at 5 58 45 pm

@petermattis
Copy link
Collaborator Author

Replica removal is fairly expensive and we allow it to proceed as fast as possible when driven by the replica GC queue. Perhaps we need to throttle it. I'm interested to see what the graphs look like for 6-10m outages with that time-until-store-dead setting.

@spencerkimball
Copy link
Member

spencerkimball commented Jan 10, 2017

The graph below is with recoveries not starting until 10m of dead time with chaos events starting at 1m and going by 1 minute increments to 10m. I think it's fairly clear from this experiment that the issue isn't with replaying logs or doing raft-supplied snapshots while load is ongoing, but instead with our concurrent replicate_queue "recovery" activity which is initiated for only 1m of the 6m chaos event (re: the graph 4 graphs above this comment). Once the missing node comes back, we ought to be stopping our concurrent snapshot recoveries via the replicate_queue, so why the horrific recovery profile when we have 5m for time-until-dead?

screen shot 2017-01-10 at 1 52 51 am

@spencerkimball
Copy link
Member

Just pushed a build with the replica-gc queue disabled and time-until-dead set back to 5m. Going to run a 6m chaos event to verify it's replica GC which is causing the recovery badness.

@petermattis
Copy link
Collaborator Author

It might be interesting to see the opposite experiment: setting COCKROACH_TIME_UNTIL_STORE_DEAD=1m and trying a 2m node outage. I'm missing some understanding here about what is cause the long node recovery and spiky throughput. Clearly it isn't Raft log replays and it doesn't seem to be Raft snapshots. It's curious that node recovery proceeds quickly if load is removed from the system and that it also proceeds quickly (and without affecting load) if there is no rebalancing activity.

@spencerkimball
Copy link
Member

Looks like the replica GC activity is what's most affecting the recovery profile. Here's a graph showing the activity with a 6m chaos event with replica-gc queue disabled:

screen shot 2017-01-10 at 9 28 26 am

But with replicate queue active, the performance dips (IMO unnecessarily badly). The top graph shows performance during the 6m chaos event with replicate queue and replica-gc queue off. The bottom graph shows performance with the 6m chaos event with replicate queue on and replica-gc queue off.

screen shot 2017-01-10 at 9 25 00 am

screen shot 2017-01-10 at 9 25 51 am

@petermattis
Copy link
Collaborator Author

Here are graphs from the experiment where we set COCKROACH_TIME_UNTIL_STORE_DEAD=1m and took a node down for 2m:

screen shot 2017-01-11 at 10 55 50 am

Difficult to see in the above graph, but the count of under-replicated ranges dropped to 319 when the node restarted and took until 15:25 to drop to 0 (19m).

screen shot 2017-01-11 at 10 56 01 am

Performance doesn't fully recover until 15:50 (~40m after the node outage).

screen shot 2017-01-11 at 10 56 13 am

screen shot 2017-01-11 at 10 56 30 am

The replica GC queue and replicate queue are busy until shortly before 15:25.

screen shot 2017-01-11 at 10 56 38 am

And then we see the raft log queue kick in, trimming raft log entries.

@petermattis
Copy link
Collaborator Author

screen shot 2017-01-11 at 4 21 04 pm

The above shows a new raftlog.truncated metric vs the raftlog.behind metric. The spike in the truncated metric rose up to 103k. Currently, raft log queue truncates as many entries as possible with no regarding to throttling. The second event is running a binary with a small patch to truncate a maximum of 200 entries at a time. Other than smoothing out the truncated metric, this doesn't seem to have provided any improvement.

@spencerkimball
Copy link
Member

Two things:

  • Add a metric to the same graph for keys GCd
  • Remember to save the Grafana graph definitions!

@petermattis
Copy link
Collaborator Author

screen shot 2017-01-12 at 1 30 27 pm

I added a 5s delay between operations in the replica GC queue and this seemed to improve the behavior on node recovery. I also added some logging to the replica destroy-data operation:

I170112 18:27:48.790781 127 storage/replica.go:708  [replicaGC,n6,s6,r3285/5:/Table/51/1/336{18325…-29601…},@c42135c300] removed 401243 keys in 1956ms [clear=1665ms commit=289ms]
I170112 18:27:55.971792 127 storage/replica.go:708  [replicaGC,n6,s6,r2147/5:/Table/51/1/386{80605…-91822…},@c4213cec00] removed 398027 keys in 2152ms [clear=1873ms commit=279ms]
I170112 18:28:02.629724 127 storage/replica.go:708  [replicaGC,n6,s6,r1757/1:/Table/51/1/104{22793…-34099…},@c420a52f00] removed 402055 keys in 1426ms [clear=1231ms commit=191ms]
I170112 18:28:09.359959 127 storage/replica.go:708  [replicaGC,n6,s6,r3766/1:/Table/51/1/187{59903…-71116…},@c42115ef00] removed 399454 keys in 1720ms [clear=1372ms commit=345ms]
I170112 18:28:17.106822 127 storage/replica.go:708  [replicaGC,n6,s6,r6052/1:/Table/51/1/782{33496…-44791…},@c421444900] removed 401921 keys in 2742ms [clear=2538ms commit=204ms]
I170112 18:28:24.741533 127 storage/replica.go:708  [replicaGC,n6,s6,r4360/1:/Table/51/1/561{21605…-32822…},@c4215a2000] removed 399191 keys in 2627ms [clear=2346ms commit=281ms]

Above we see a half-dozen remove operations, each removing ~400k keys and taking 2-3s. It is very interesting that the bulk of the time is spent generating the batch. Why is the iteration so slow? Or perhaps it is something else.

@petermattis
Copy link
Collaborator Author

RocksDB 5.0.1 has been released with support for DeleteRange. I'm going to experiment with this when I get a chance.

@petermattis
Copy link
Collaborator Author

With #12913 which utilizes the new RocksDB DeleteRange operation, replica recovery proceeds smoothly after node recovery.

screen shot 2017-01-23 at 4 05 34 pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants