-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade raft lib and fix group checksum #3085
Conversation
worker/mutation.go
Outdated
ErrUnservedTabletMessage = "Tablet isn't being served by this instance" | ||
errUnservedTablet = x.Errorf(ErrUnservedTabletMessage) | ||
errUnservedTablet = x.Errorf("Tablet isn't being served by this instance.") | ||
errPredicateMoving = x.Errorf("Predicate is being moved. Please retry later") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errPredicateMoving
is unused (from varcheck
)
…sums for groups, because map does random ordering of keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 27 of 30 files at r1, 1 of 10 files at r3, 7 of 7 files at r4.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @golangcibot)
worker/mutation.go, line 43 at r3 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
errPredicateMoving
is unused (fromvarcheck
)
This error is unused. Should it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 34 of 35 files reviewed, 1 unresolved discussion (waiting on @golangcibot and @martinmr)
worker/mutation.go, line 43 at r3 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
This error is unused. Should it be removed?
Done.
Found 2 issues recently: 1. If a Zero leader becomes a follower, but Dgraph already has a proposal for it, raft lib would silently drop the proposal, causing Zero to retry indefinitely. Better error handling was added in the master version of raft lib, so in this PR, I vendor that in. That fixes the issue. 2. Group checksum was being calculated by looping over map of tablets directly. This caused the checksum to be different for Zeros, causing queries to block indefinitely. This PR sorts the tablets before calculating the checksum, fixing the issue. Figured out this issue via blockade test. Also fixed up tracing, so waiting for checksum is better indicated (took me a while to realize that it was the group checksum, not the timestamp wait which was blocking). Longer version for issue 1: The master version has better error handling and reporting, so when a proposal is dropped, either because there's no leader or because there's no forwarding is not allowed, it can return an error instead of being silent. This is very useful in a bug we saw today where Dgraph Zero indefinitely tries to send a proposal, which got passed through `AmLeader()` checks, but then failed after `raft.Propose`. Because raft did not complain about the proposal, the proposal was getting retried indefinitely. Changes: * Update raft lib to use master * Fix the issue where different Zeros would end up with different checksums for groups, because map does random ordering of keys. * Remove unused var
Found 2 issues recently: 1. If a Zero leader becomes a follower, but Dgraph already has a proposal for it, raft lib would silently drop the proposal, causing Zero to retry indefinitely. Better error handling was added in the master version of raft lib, so in this PR, I vendor that in. That fixes the issue. 2. Group checksum was being calculated by looping over map of tablets directly. This caused the checksum to be different for Zeros, causing queries to block indefinitely. This PR sorts the tablets before calculating the checksum, fixing the issue. Figured out this issue via blockade test. Also fixed up tracing, so waiting for checksum is better indicated (took me a while to realize that it was the group checksum, not the timestamp wait which was blocking). Longer version for issue 1: The master version has better error handling and reporting, so when a proposal is dropped, either because there's no leader or because there's no forwarding is not allowed, it can return an error instead of being silent. This is very useful in a bug we saw today where Dgraph Zero indefinitely tries to send a proposal, which got passed through `AmLeader()` checks, but then failed after `raft.Propose`. Because raft did not complain about the proposal, the proposal was getting retried indefinitely. Changes: * Update raft lib to use master * Fix the issue where different Zeros would end up with different checksums for groups, because map does random ordering of keys. * Remove unused var
Found 2 issues recently:
If a Zero leader becomes a follower, but Dgraph already has a proposal for it, raft lib would silently drop the proposal, causing Zero to retry indefinitely. Better error handling was added in the master version of raft lib, so in this PR, I vendor that in. That fixes the issue.
Group checksum was being calculated by looping over map of tablets directly. This caused the checksum to be different for Zeros, causing queries to block indefinitely. This PR sorts the tablets before calculating the checksum, fixing the issue. Figured out this issue via blockade test. Also fixed up tracing, so waiting for checksum is better indicated (took me a while to realize that it was the group checksum, not the timestamp wait which was blocking).
Longer version for issue 1:
The master version has better error handling and reporting, so when a proposal is dropped, either because there's no leader or because there's no forwarding is not allowed, it can return an error instead of being silent.
This is very useful in a bug we saw today where Dgraph Zero indefinitely tries to send a proposal, which got passed through
AmLeader()
checks, but then failed afterraft.Propose
. Because raft did not complain about the proposal, the proposal was getting retried indefinitely.This change is