Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TestRangeGCQueue* to correctly wait for range removal #1964

Merged
merged 1 commit into from
Aug 5, 2015

Conversation

kkaneda
Copy link
Contributor

@kkaneda kkaneda commented Aug 5, 2015

The test still fails with clusterIDGossipTTL = 1 * time.Millisecond. Here is a log:

  ...
I0805 04:50:29.402478 53224 storage/replica.go:1001  gossiping first range from store 1, range 1
W0805 04:50:29.402512 53224 storage/replica.go:968  could not acquire lease for range gossip: raft group deleted
W0805 04:50:29.402533 53224 storage/store.go:584  error gossiping first range data: raft group deleted
I0805 04:50:29.402544 53224 storage/replica.go:989  gossiping cluster id  from store 2, range 1
05 04:43:46.395109 42294 raft/log.go:187  applied(28) is out of range [prevApplied(10), committed(10)]
panic: applied(28) is out of range [prevApplied(10), committed(10)]

goroutine 37 [running]:
github.com/cockroachdb/cockroach/multiraft.(*raftLogger).Panicf(0x4f3bcb0, 0x4bff0b0, 0x3c, 0xc208180ed0, 0x3, 0x3)
        /Users/kaneda/Development/go/src/github.com/cockroachdb/cockroach/multiraft/raft.go:107 +0x188
github.com/coreos/etcd/raft.(*raftLog).appliedTo(0xc208186af0, 0x1c)
        /Users/kaneda/Development/go/src/github.com/coreos/etcd/raft/log.go:187 +0x1d5
github.com/coreos/etcd/raft.newRaft(0xc2081a21e0, 0xc208106f00)
        /Users/kaneda/Development/go/src/github.com/coreos/etcd/raft/raft.go:197 +0x73b
github.com/coreos/etcd/raft.(*multiNode).run(0xc208106ba0)
        /Users/kaneda/Development/go/src/github.com/coreos/etcd/raft/multinode.go:183 +0x2eb
created by github.com/coreos/etcd/raft.StartMultiNode
        /Users/kaneda/Development/go/src/github.com/coreos/etcd/raft/multinode.go:56 +0xb6
  ...

@tbg tbg added the PTAL label Aug 5, 2015
@tamird
Copy link
Contributor

tamird commented Aug 5, 2015

LGTM

kkaneda added a commit that referenced this pull request Aug 5, 2015
Fix TestRangeGCQueue* to correctly wait for range removal
@kkaneda kkaneda merged commit e3705ef into master Aug 5, 2015
@tbg tbg removed the PTAL label Aug 5, 2015
@kkaneda kkaneda deleted the kkaneda/fix_gc_test branch August 5, 2015 12:23
@bdarnell
Copy link
Contributor

bdarnell commented Aug 5, 2015

LGTM. So this is what was actually happening in #1928?

@tamird
Copy link
Contributor

tamird commented Aug 5, 2015

Yeah :(

@kkaneda
Copy link
Contributor Author

kkaneda commented Aug 6, 2015

It seems the test failure was fixed, but I think we still have a race when a store gossips info while a range is removed:

  1. Store starts gossiping the first range.
  2. Store calls RemoveRange() to remove the range. A Raft group is removed.
  3. Before RemoveRange() completes and the replica is removed from the range set, Store creates the Raft group again for gossip.
  4. Raft checks if the applied index is between the first index and the last index. Store.AppliedIndex() returns a non-error value (28 in the above log) since the replica still exists in the range set. Replica.FirstCheck returns 10 as RaftTruncatedState returns raftInitialLogIndex (= 10). The last index is also 10 since the store didn't may commits since unreplicateRange is called in the test (?).
  5. Panic since the applied index is bigger than the last index.

@bdarnell , what will be the right fix here if it's worth fixing? I think we cannot simply check if a range exists or not in ProposeRaftCommand since Store.RemoveRange might not have yet completed and removed the replica. Maybe change Store.RemoveRange to remove a Raft group after a range is removed from the range set?

@bdarnell
Copy link
Contributor

bdarnell commented Aug 6, 2015

Yes, this is worth fixing. We need to check if the range has been removed in ProposeRaftCommand, which means we need to make RemoveRange atomic (or at least have it set a flag first to say that the removal is in progress). The solution for #1644 is going to involve some changes to the relationship between raft groups and Replica objects, so I think at the same time we can reevaluate how RemoveRange works and make it atomic.

@kkaneda
Copy link
Contributor Author

kkaneda commented Aug 7, 2015

Ah, I see. Sounds good. Thanks for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants