Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: rewrite TestStoreRangeRebalance #10515

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Nov 7, 2016

Rather than the somewhat complicated rebalancing scenario, use a simple
scenario that we perform up-replication of range 1 from 1 to 3 nodes. We
check that this up-replication is performed using preemptive
snapshots. The more complicated scenario was very fragile, frequently
being broken by innocuous changes.

Fixes #10497
Fixes #10193
Fixes #10156
Fixes #9395


This change is Reviewable

@petermattis
Copy link
Collaborator Author

@tamird Note that #10501 goes away with this change, but I think we should track down why that failure was occurring.

@tamird
Copy link
Contributor

tamird commented Nov 7, 2016

Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/client_raft_test.go, line 2040 at r1 (raw file):

}

// TestStoreRangeRebalance verifies that the replication queue will take

s/ take// and re-wrap this guy


pkg/storage/client_raft_test.go, line 2042 at r1 (raw file):

// TestStoreRangeRebalance verifies that the replication queue will take
// up-replicate a range when possible.
func TestStoreRangeRebalance(t *testing.T) {

this test name doesn't make sense. The test is now the equivalent of testcluster.TestBasicAutoReplication?


pkg/storage/client_raft_test.go, line 2045 at r1 (raw file):

  defer leaktest.AfterTest(t)()

  // Start multiTestContext with replica rebalancing enabled.

this is nonsense now, right? no rebalancing will take place in this test no matter what options are passed.


pkg/storage/client_raft_test.go, line 2058 at r1 (raw file):

      // Exit when all stores have a single replica.
      const expected = 3

duplicates the argument to mtc.Start


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/test-store-range-rebalance branch from 0e1e31c to f334f77 Compare November 7, 2016 21:23
@tamird
Copy link
Contributor

tamird commented Nov 7, 2016

BTW, I don't think it's safe to say that this fixes #10193 and #10497 - those failures produce nonsense error messages like

client_raft_test.go:2093: range 1: replica {2 2 2} not lease holder; node_id:2 store_id:2 replica_id:2 is

which definitely are worth tracking down.


Review status: 2 of 3 files reviewed at latest revision, 4 unresolved discussions.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

which definitely are worth tracking down.

Those are due to periodic gossiping of the first range for which we're also transferring the lease in unexpected ways. We can't disable that periodic gossiping in the previous incarnation of this test or the test doesn't make progress.


Review status: 2 of 3 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/client_raft_test.go, line 2040 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/ take// and re-wrap this guy

Done.

pkg/storage/client_raft_test.go, line 2042 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this test name doesn't make sense. The test is now the equivalent of testcluster.TestBasicAutoReplication?

More or less. It also verifies that we're using preemptive snapshots. Not sure if that is worth a separate test here.

pkg/storage/client_raft_test.go, line 2045 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is nonsense now, right? no rebalancing will take place in this test no matter what options are passed.

Removed and simplified.

pkg/storage/client_raft_test.go, line 2058 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

duplicates the argument to mtc.Start

Done.

Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/test-store-range-rebalance branch from f334f77 to 978d6e7 Compare November 7, 2016 21:27
@tamird
Copy link
Contributor

tamird commented Nov 7, 2016

As mentioned in person, the troubling thing about those error messages is that they are saying "replica X is not the leaseholder, X is", which is obviously nonsense.


Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/client_raft_test.go, line 2042 at r1 (raw file):

It also verifies that we're using preemptive snapshots.

Ah, right, didn't notice that there was still more stuff below the retry loop.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Nov 8, 2016

Reviewed 2 of 3 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/client_raft_test.go, line 2042 at r3 (raw file):

// TestStoreRangeUpreplicate verifies that the replication queue will
// up-replicate a range when possible.
func TestStoreRangeUpreplicate(t *testing.T) {

We already have the same test in TestStoreRangeUpReplicate (with a capital R), so if there's nothing to salvage from TestStoreRangeRebalance we should just delete it. But I think the old test is worth fixing and keeping - upreplication and rebalancing are separate code paths, and I don't think we have much other testing of the rebalancing path.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/test-store-range-rebalance branch from efd4ada to d42cb5b Compare November 8, 2016 14:06
@petermattis
Copy link
Collaborator Author

Review status: 3 of 5 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/client_raft_test.go, line 2042 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We already have the same test in TestStoreRangeUpReplicate (with a capital R), so if there's nothing to salvage from TestStoreRangeRebalance we should just delete it. But I think the old test is worth fixing and keeping - upreplication and rebalancing are separate code paths, and I don't think we have much other testing of the rebalancing path.

Ah, didn't realize we already had `TestStoreRangeUpReplicate`. I've modified that test to verify we're performing preemptive snapshots during up-replication. I've removed `TestStoreRangeRebalance` completely and add a new `TestReplicateQueueRebalance` that uses a `TestCluster` to verify that basic rebalancing is occurring. I've also nuked `multiTestContext.transferLease()`.

PTAL.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/test-store-range-rebalance branch from d42cb5b to 5b43b62 Compare November 8, 2016 14:19
@tamird
Copy link
Contributor

tamird commented Nov 8, 2016

Reviewed 1 of 1 files at r3, 2 of 2 files at r4.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/storage/client_raft_test.go, line 2042 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ah, didn't realize we already had TestStoreRangeUpReplicate. I've modified that test to verify we're performing preemptive snapshots during up-replication. I've removed TestStoreRangeRebalance completely and add a new TestReplicateQueueRebalance that uses a TestCluster to verify that basic rebalancing is occurring. I've also nuked multiTestContext.transferLease().

PTAL.

🎉

pkg/storage/replicate_queue_test.go, line 40 at r4 (raw file):

  // Set the gossip stores interval lower to speed up rebalancing. With the
  // default of 5s we have to wait ~5s for the rebalancing to start.
  _ = os.Setenv("COCKROACH_GOSSIP_STORES_INTERVAL", "100ms")

can we check this error, please? I actually don't know under what circumstance this can return a non-nil error, but this kind of thing makes me nervous and isn't worth the two line savings. ditto below.


pkg/storage/replicate_queue_test.go, line 60 at r4 (raw file):

      for {
          if _, _, err := tc.SplitRange(splitKey); err != nil {
              // The split can conflict with rebalancing. Log the error and keep

can we do a more targeted error check here?


pkg/storage/replicate_queue_test.go, line 85 at r4 (raw file):

      counts := countReplicas()
      for _, c := range counts {
          if c == 0 {

this is not really checking that things are balanced at all, is it? it's only checking that every node has at least one replica, which is contrary to the detailed comment way above that lists the expected numbers of ranges, etc.

Can we do a "real" check of balancedness here? Also if we do, we should remove magic numbers from that comment since those numbers can and will rot.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/test-store-range-rebalance branch from 5b43b62 to de3434d Compare November 8, 2016 15:42
@petermattis
Copy link
Collaborator Author

Review status: 4 of 5 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue_test.go, line 40 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

can we check this error, please? I actually don't know under what circumstance this can return a non-nil error, but this kind of thing makes me nervous and isn't worth the two line savings. ditto below.

Done.

pkg/storage/replicate_queue_test.go, line 60 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

can we do a more targeted error check here?

Done.

pkg/storage/replicate_queue_test.go, line 85 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is not really checking that things are balanced at all, is it? it's only checking that every node has at least one replica, which is contrary to the detailed comment way above that lists the expected numbers of ranges, etc.

Can we do a "real" check of balancedness here? Also if we do, we should remove magic numbers from that comment since those numbers can and will rot.

Testing "balancedness" is extremely difficult to do robustly (if you feel otherwise, do a follow-up PR where you improve this test). I think of this test as a basic sanity check that rebalancing is being performed, not that the heuristics are working correctly. To test the heuristics, we should have higher level tests powered by infrastructure like `allocsim`.

Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 4 of 5 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue_test.go, line 51 at r5 (raw file):

  // TODO(peter): Bump this to 10 nodes. Doing so is flaky (we encounter a
  // situation where 1 node never receives a replica). Need to figure out why.

I'm investigating this before merging. The allocator thinks that everything is balanced while countReplicas does not. That discrepancy is disturbing. Also disturbing is that the discrepancy persists given that we're supposed to be gossiping the store capacity every 100ms.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 8, 2016

Reviewed 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/storage/replicate_queue_test.go, line 85 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Testing "balancedness" is extremely difficult to do robustly (if you feel otherwise, do a follow-up PR where you improve this test). I think of this test as a basic sanity check that rebalancing is being performed, not that the heuristics are working correctly. To test the heuristics, we should have higher level tests powered by infrastructure like allocsim.

That's fine, but then let's not lie about it in the top level comment that implies we're going to actually check range counts per store.

Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 4 of 5 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue_test.go, line 51 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm investigating this before merging. The allocator thinks that everything is balanced while countReplicas does not. That discrepancy is disturbing. Also disturbing is that the discrepancy persists given that we're supposed to be gossiping the store capacity every 100ms.

Figured this out. Without lease rebalancing we can get into a situation where 1 store has 0 replicas. I could increase the number of replicas in the test, but seems better to wait for lease rebalancing to land.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 8, 2016

Reviewed 1 of 1 files at r6.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue_test.go, line 85 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

That's fine, but then let's not lie about it in the top level comment that implies we're going to actually check range counts per store.

LGTM modulo this comment.

pkg/storage/replicate_queue_test.go, line 51 at r6 (raw file):

we can achieve the store 1 can hold on to too many

broken sentence


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/test-store-range-rebalance branch from d71fa95 to 954dd90 Compare November 8, 2016 16:14
@petermattis
Copy link
Collaborator Author

Review status: 4 of 5 files reviewed at latest revision, 4 unresolved discussions.


pkg/storage/replicate_queue_test.go, line 85 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

LGTM modulo this comment.

TODO added and expanded the comment above.

pkg/storage/replicate_queue_test.go, line 51 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

we can achieve the store 1 can hold on to too many

broken sentence

Done.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 8, 2016

Reviewed 1 of 1 files at r7.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


Comments from Reviewable

Add the check that preemptive snapshots are being used to
TestStoreRangeUpReplicate. Add TestReplicateQueueRebalance for testing
that basic rebalancing is working.

Fixes cockroachdb#10193
Fixes cockroachdb#10156
Fixes cockroachdb#9395
@petermattis petermattis force-pushed the pmattis/test-store-range-rebalance branch from 954dd90 to f920b33 Compare November 8, 2016 16:37
@petermattis petermattis merged commit 24dac05 into cockroachdb:master Nov 8, 2016
@petermattis petermattis deleted the pmattis/test-store-range-rebalance branch November 8, 2016 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants