-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Fix bug where cluster gets stuck if node restarts without its data #10544
Conversation
2ab7c56
to
d827efb
Compare
Repushed to fix a broken test |
Second commit needs a test? Reviewed 5 of 5 files at r1, 1 of 1 files at r2, 7 of 7 files at r3. pkg/gossip/gossip.go, line 193 at r1 (raw file):
is this a superset of pkg/gossip/gossip.go, line 584 at r1 (raw file):
pkg/gossip/gossip.go, line 605 at r1 (raw file):
how many tests? It might be better to add a test-only node descriptor factory method. pkg/gossip/gossip.go, line 608 at r1 (raw file):
log the address as well? pkg/gossip/gossip.go, line 610 at r1 (raw file):
i think you should add a comment here to explain what's going on and that this is recursive because this is a gossip callback. pkg/gossip/gossip.go, line 620 at r1 (raw file):
this map grows without bound, yeah? probably fine, just sayin'. pkg/gossip/gossip_test.go, line 70 at r1 (raw file):
shouldn't these magic numbers be pkg/gossip/gossip_test.go, line 71 at r1 (raw file):
nit: all these pkg/gossip/gossip_test.go, line 83 at r1 (raw file):
if it's the same as node1, use node1.Address rather than the same magic string. pkg/gossip/gossip_test.go, line 92 at r1 (raw file):
nice! pkg/gossip/keys.go, line 117 at r1 (raw file):
oddly, two calls to TrimPrefix would be more efficient, since you wouldn't allocate the concatenated string. pkg/gossip/keys_test.go, line 50 at r1 (raw file):
nit: you could use subtests named after the keys instead of these indexes. that would produce better errors. pkg/storage/store.go, line 2958 at r3 (raw file):
Just remove this comment and say that it implements whatever interface it implements. pkg/storage/store.go, line 2970 at r3 (raw file):
NYC but this should really check for Comments from Reviewable |
@spencerkimball might want to take a look at the gossip changes. Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. pkg/gossip/gossip.go, line 606 at r2 (raw file):
Might be cleaner to add an Comments from Reviewable |
Review status: 1 of 12 files reviewed at latest revision, 14 unresolved discussions. pkg/gossip/gossip.go, line 193 at r1 (raw file):
|
Review status: 1 of 12 files reviewed at latest revision, 14 unresolved discussions, some commit checks pending. pkg/gossip/keys.go, line 117 at r1 (raw file):
|
Reviewed 11 of 11 files at r4, 7 of 7 files at r5, 1 of 1 files at r6. pkg/gossip/gossip.go, line 193 at r1 (raw file):
|
Reviewed 1 of 5 files at r1, 4 of 11 files at r4, 6 of 7 files at r5, 1 of 1 files at r6. pkg/gossip/gossip.go, line 530 at r6 (raw file):
Use UnresolvedAddr.IsEmpty here. pkg/gossip/gossip.go, line 605 at r6 (raw file):
s/tests// pkg/gossip/gossip.go, line 619 at r6 (raw file):
As long as the proto has no pkg/gossip/gossip.go, line 621 at r6 (raw file):
This should probably be a fatal (if it's not removed completely) since any failure here should never happen, but if it did it would happen every time and we'd have reintroduced the original bug. pkg/gossip/keys.go, line 114 at r6 (raw file):
Mention that they key should have been constructed by pkg/storage/store.go, line 2970 at r3 (raw file):
|
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. pkg/gossip/gossip.go, line 193 at r1 (raw file):
|
Reviewed 12 of 12 files at r7. pkg/gossip/gossip.go, line 193 at r1 (raw file):
|
Review status: 5 of 12 files reviewed at latest revision, 11 unresolved discussions, all commit checks successful. pkg/gossip/gossip.go, line 193 at r1 (raw file):
|
Reviewed 2 of 3 files at r8, 6 of 7 files at r9, 1 of 1 files at r10, 1 of 1 files at r11. pkg/gossip/gossip.go, line 685 at r11 (raw file):
this is checking the empty, right? should mirror the other site that checks for empty descriptor Comments from Reviewable |
Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed. pkg/gossip/gossip.go, line 685 at r11 (raw file):
|
Reviewed 6 of 7 files at r13, 1 of 1 files at r14, 1 of 1 files at r15. Comments from Reviewable |
When a new node registers with the same address as an existing node, assume that the existing node has been replaced by the new one and clear it out of the infoStore.
Do so by adding a new roachpb error type that can be recognized in such cases.
gossip: Don't allow two nodes to share the same address
When a new node registers with the same address as an existing node,
assume that the existing node has been replaced by the new one and
clear it out of the infoStore.
storage: Stop processing if target node doesn't contain desired store
Do so by adding a new roachpb error type that can be recognized in such
cases.
I'm open to suggestions on other approaches to the gossip commit. It looks as though our gossip depends on each piece of Info being immutable, so replacing the populated descriptor with an empty descriptor at a later TTL seemed like a better approach than trying to add a different way to remove an entry, but I may be missing a cleaner way.
I'm still working on getting a testcluster test in place. Sending this out now for feedback in the meantime. I've manually verified that this fixes the scenario in practice.
Fixes #10266
@tamird
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)