Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Decommissioning can get stuck by dormant replicas never getting GC'ed #17288

Closed
a-robinson opened this issue Jul 28, 2017 · 5 comments
Closed
Assignees
Milestone

Comments

@a-robinson
Copy link
Contributor

While playing around with replica decommissioning, I was able to get the process stuck. It's stuck because even though all replicas have been officially replicated away from the node, it still has two dormant, non-GC'ed replicas on it, and thus it still shows up as not being empty:

screen shot 2017-07-28 at 2 06 35 pm

screen shot 2017-07-28 at 2 07 02 pm

To be honest, I'm not quite sure how the two replicas left on the node never received the raft commands that removed them from the range, but now that they're in this state they're stuck forever (or until restarting the node, presumably) because their dormant state keeps them from trying to send traffic to the other replicas and learning about the fact they were removed.

@tschottdorf @garvitjuniwal

@a-robinson a-robinson added this to the 1.1 milestone Jul 28, 2017
@tbg
Copy link
Member

tbg commented Jul 28, 2017

Processing such a Replica requires a consistent RangeLookup, and we wanted to avoid hammering the metadata ranges, but with quiescence it looks like we're not going to try to GC the replica until after 10 days, which clearly isn't going to cut it. I think it's fine (at least for now) to wake up dormant replicas in shouldQueue.

@a-robinson
Copy link
Contributor Author

Yeah, although with how fast the scanner runs on nodes that don't have many ranges, we definitely shouldn't do a consistent lookup or wake dormant replicas every time.

And to correct my initial post, even restarting the node doesn't wake them up, it turns out.

@tbg
Copy link
Member

tbg commented Jul 28, 2017

It's not that expensive to wake them up though, is it? The group will go dormant again after ~1 round.

One perhaps better solution could be to signal to replicas which are about to removed that that is happening. I'd have to page the details back in, but iirc the new configuration in a replica change goes into effect pretty early, and that's why a removed replica often doesn't learn about it until later. We could just commit a Raft command (could do a direct RPC to the node too, but that doesn't seem less onerous) that triggers "eager gc" for a while on the replica that is supposedly getting removed. Then the scanner would only do eager work for replicas with that flag (as long as the flag is reasonably fresh, say 5min).

@a-robinson
Copy link
Contributor Author

Yeah, but they'll be getting woken up every 200ms * <number of replicas>, which effectively eliminates the point of dormancy on nodes with only tens or hundreds of replicas.

Do we need to worry about how fast GC happens in situations other than decommissioning? If not, we can just GC more eagerly when in a decommissioning state.

@tbg
Copy link
Member

tbg commented Jul 28, 2017

If not, we can just GC more eagerly when in a decommissioning state.

That's a good idea.

Do we need to worry about how fast GC happens in situations other than decommissioning?

Not really, though it's one of those things that's often annoying. You're debugging something, and there are these old replicas laying around -- I'd say it'd be nice to smoothen out this process, but it's shouldn't be crucial.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Jul 29, 2017
@a-robinson a-robinson self-assigned this Jul 29, 2017
a-robinson added a commit to tbg/cockroach that referenced this issue Aug 1, 2017
tbg pushed a commit to tbg/cockroach that referenced this issue Aug 3, 2017
tbg pushed a commit to tbg/cockroach that referenced this issue Aug 4, 2017
tbg pushed a commit to tbg/cockroach that referenced this issue Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants