-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: tail of under-replicated ranges lasts too long #11984
Comments
We had talked in the past about having nodes listen for changes in liveness in other nodes and visiting replicas to add to the replicate queue as necessary. Something to keep in mind if we discover it's just the 10 minutes it takes for the scanner to discover all underreplicated ranges. |
That would be useful if the ranges were idle, but given the load its hard to imagine that would be helpful here. |
Here's a different 6m chaos event. The under-replicated metric is slightly different here in that the |
I performed two chaos events just now to see what impact The next graph shows node recovery with In the first chaos event, the cluster took ~40m to recover all of the replicas and in the second in took ~3m. See also #12485 and #8659 which propose mechanisms for improving the behavior of the system during node recovery. |
Somehow a random restart of a node at This makes me suspicious of a bad interaction going on with recovery activity which starts at the I've now updated the time until dead ( |
Replica removal is fairly expensive and we allow it to proceed as fast as possible when driven by the replica GC queue. Perhaps we need to throttle it. I'm interested to see what the graphs look like for 6-10m outages with that time-until-store-dead setting. |
The graph below is with recoveries not starting until |
Just pushed a build with the replica-gc queue disabled and time-until-dead set back to |
It might be interesting to see the opposite experiment: setting |
Looks like the replica GC activity is what's most affecting the recovery profile. Here's a graph showing the activity with a But with |
Here are graphs from the experiment where we set Difficult to see in the above graph, but the count of under-replicated ranges dropped to 319 when the node restarted and took until 15:25 to drop to 0 (19m). Performance doesn't fully recover until 15:50 (~40m after the node outage). The replica GC queue and replicate queue are busy until shortly before 15:25. And then we see the raft log queue kick in, trimming raft log entries. |
The above shows a new |
Two things:
|
I added a 5s delay between operations in the replica GC queue and this seemed to improve the behavior on node recovery. I also added some logging to the replica destroy-data operation:
Above we see a half-dozen remove operations, each removing ~400k keys and taking 2-3s. It is very interesting that the bulk of the time is spent generating the batch. Why is the iteration so slow? Or perhaps it is something else. |
RocksDB 5.0.1 has been released with support for DeleteRange. I'm going to experiment with this when I get a chance. |
With #12913 which utilizes the new RocksDB DeleteRange operation, replica recovery proceeds smoothly after node recovery. |
This graph shows the under-replicated ranges for a 6m chaos event. The tail lasts for almost exactly 10m which is suspiciously close to the replica scanner interval. Regardless, we need to figure out what is preventing those under-replicated ranges from catching up quicker.
The text was updated successfully, but these errors were encountered: