Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade raft lib and fix group checksum #3085

Merged
merged 13 commits into from
Mar 3, 2019
Merged

Conversation

manishrjain
Copy link
Contributor

@manishrjain manishrjain commented Mar 2, 2019

Found 2 issues recently:

  1. If a Zero leader becomes a follower, but Dgraph already has a proposal for it, raft lib would silently drop the proposal, causing Zero to retry indefinitely. Better error handling was added in the master version of raft lib, so in this PR, I vendor that in. That fixes the issue.

  2. Group checksum was being calculated by looping over map of tablets directly. This caused the checksum to be different for Zeros, causing queries to block indefinitely. This PR sorts the tablets before calculating the checksum, fixing the issue. Figured out this issue via blockade test. Also fixed up tracing, so waiting for checksum is better indicated (took me a while to realize that it was the group checksum, not the timestamp wait which was blocking).

Longer version for issue 1:

The master version has better error handling and reporting, so when a proposal is dropped, either because there's no leader or because there's no forwarding is not allowed, it can return an error instead of being silent.

This is very useful in a bug we saw today where Dgraph Zero indefinitely tries to send a proposal, which got passed through AmLeader() checks, but then failed after raft.Propose. Because raft did not complain about the proposal, the proposal was getting retried indefinitely.


This change is Reviewable

ErrUnservedTabletMessage = "Tablet isn't being served by this instance"
errUnservedTablet = x.Errorf(ErrUnservedTabletMessage)
errUnservedTablet = x.Errorf("Tablet isn't being served by this instance.")
errPredicateMoving = x.Errorf("Predicate is being moved. Please retry later")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errPredicateMoving is unused (from varcheck)

@manishrjain manishrjain changed the title Update etcd/raft library to master Upgrade raft lib and fix group checksum Mar 3, 2019
Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 27 of 30 files at r1, 1 of 10 files at r3, 7 of 7 files at r4.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @golangcibot)


worker/mutation.go, line 43 at r3 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

errPredicateMoving is unused (from varcheck)

This error is unused. Should it be removed?

Copy link
Contributor Author

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 34 of 35 files reviewed, 1 unresolved discussion (waiting on @golangcibot and @martinmr)


worker/mutation.go, line 43 at r3 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

This error is unused. Should it be removed?

Done.

@manishrjain manishrjain merged commit fe3da12 into master Mar 3, 2019
@manishrjain manishrjain deleted the mrjn/raft-master branch March 3, 2019 05:29
manishrjain added a commit that referenced this pull request Mar 3, 2019
Found 2 issues recently:
1. If a Zero leader becomes a follower, but Dgraph already has a proposal for it, raft lib would silently drop the proposal, causing Zero to retry indefinitely. Better error handling was added in the master version of raft lib, so in this PR, I vendor that in. That fixes the issue.

2. Group checksum was being calculated by looping over map of tablets directly. This caused the checksum to be different for Zeros, causing queries to block indefinitely. This PR sorts the tablets before calculating the checksum, fixing the issue. Figured out this issue via blockade test. Also fixed up tracing, so waiting for checksum is better indicated (took me a while to realize that it was the group checksum, not the timestamp wait which was blocking).

Longer version for issue 1:

The master version has better error handling and reporting, so when a proposal is dropped, either because there's no leader or because there's no forwarding is not allowed, it can return an error instead of being silent.

This is very useful in a bug we saw today where Dgraph Zero indefinitely tries to send a proposal, which got passed through `AmLeader()` checks, but then failed after `raft.Propose`. Because raft did not complain about the proposal, the proposal was getting retried indefinitely.

Changes:

* Update raft lib to use master
* Fix the issue where different Zeros would end up with different checksums for groups, because map does random ordering of keys.
* Remove unused var
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
Found 2 issues recently:
1. If a Zero leader becomes a follower, but Dgraph already has a proposal for it, raft lib would silently drop the proposal, causing Zero to retry indefinitely. Better error handling was added in the master version of raft lib, so in this PR, I vendor that in. That fixes the issue.

2. Group checksum was being calculated by looping over map of tablets directly. This caused the checksum to be different for Zeros, causing queries to block indefinitely. This PR sorts the tablets before calculating the checksum, fixing the issue. Figured out this issue via blockade test. Also fixed up tracing, so waiting for checksum is better indicated (took me a while to realize that it was the group checksum, not the timestamp wait which was blocking).

Longer version for issue 1:

The master version has better error handling and reporting, so when a proposal is dropped, either because there's no leader or because there's no forwarding is not allowed, it can return an error instead of being silent.

This is very useful in a bug we saw today where Dgraph Zero indefinitely tries to send a proposal, which got passed through `AmLeader()` checks, but then failed after `raft.Propose`. Because raft did not complain about the proposal, the proposal was getting retried indefinitely.

Changes:

* Update raft lib to use master
* Fix the issue where different Zeros would end up with different checksums for groups, because map does random ordering of keys.
* Remove unused var
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants