Skip to content

Commit

Permalink
roachtest: fix replicagc-changed-peers
Browse files Browse the repository at this point in the history
The test ends up in the following situation:

n1: down, no replicas
n2: down, no replicas
n3: alive, with constraint that wants all replicas to move,
    and there may be a few ranges still on n3
n4-n6: alive

where the ranges predominantly 3x-replicated.

The test is then verifying that the replica count (as in, replicas on
n3, in contrast to replicas assigned via the meta ranges) on n3 drops to
zero.

However, system ranges cannot move in this configuration. The number of
cluster nodes is six (decommission{ing,ed} nodes would be excluded, but
no nodes are decommission{ing,ed} here) and so the system ranges operate
at a replication factor of five. There are only four live nodes here, so
if n3 is still a member of any system ranges, they will stay there and
the test fails.

This commit attempts to rectify that by making sure that while n3 is
down earlier in the test, all replicas are moved from it. That was
always the intent of the test, which is concerned with n3 realizing
that replicas have moved elsewhere and initiating replicaGC; however
prior to this commit it was always left to chance whether n3 would
or would not have replicas assigned to it by the time the test moved
to the stage above. The reason the test wasn't previously waiting
for all replicas to be moved off n3 while it was down was that it
required checking the meta ranges, which wasn't necessary for the
other two nodes.

This commit passed all five runs of
replicagc-changed-peers/restart=false, so I think it reliably addresses
the problem.

There is still the lingering question of why this is failing only now
(note that both flavors of the test failed on master last night, so
I doubt it is rare). We just merged
cockroachdb#67319 which is likely
somehow related.

Fixes cockroachdb#67910.
Fixes cockroachdb#67914.

Release note: None
  • Loading branch information
tbg committed Sep 20, 2021
1 parent 9dccf8e commit 8b159d8
Showing 1 changed file with 40 additions and 1 deletion.
41 changes: 40 additions & 1 deletion pkg/cmd/roachtest/replicagc.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@ import (
"strconv"
"time"

"github.com/cockroachdb/cockroach/pkg/util/retry"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
"github.com/cockroachdb/errors"
)

func registerReplicaGC(r *testRegistry) {
Expand Down Expand Up @@ -72,7 +74,7 @@ func runReplicaGCChangedPeers(ctx context.Context, t *test, c *cluster, withRest
// Start three new nodes that will take over all data.
c.Start(ctx, t, args, c.Range(4, 6))

// Recommission n1-3, with n3 in absentia, moving the replicas to n4-6.
// Decommission n1-3, with n3 in absentia, moving the replicas to n4-6.
if err := h.decommission(ctx, c.Range(1, 3), 2, "--wait=none"); err != nil {
t.Fatal(err)
}
Expand All @@ -83,6 +85,16 @@ func runReplicaGCChangedPeers(ctx context.Context, t *test, c *cluster, withRest
t.Status("waiting for zero replicas on n2")
h.waitForZeroReplicas(ctx, 2)

// Wait for the replica count on n3 to also drop to zero. This makes the test
// "test more" but also it prevents the test from failing spuriously, as later
// in the test any system ranges still on n3 would have a replication factor
// of five applied to them, and they would be unable to move off n3 as n1 and
// n2 will be down at that point. For details, see:
//
// https://github.com/cockroachdb/cockroach/issues/67910#issuecomment-884856356
t.Status("waiting for zero replicas on n3")
waitForZeroReplicasOnN3(ctx, t, c.Conn(ctx, 1))

// Stop the remaining two old nodes, no replicas remaining there.
c.Stop(ctx, c.Range(1, 2))

Expand Down Expand Up @@ -238,3 +250,30 @@ func (h *replicagcTestHelper) isolateDeadNodes(ctx context.Context, runNode int)
}
}
}

func waitForZeroReplicasOnN3(ctx context.Context, t *test, db *gosql.DB) {
if err := retry.ForDuration(5*time.Minute, func() error {
const q = `select range_id, replicas from crdb_internal.ranges_no_leases where replicas @> ARRAY[3];`
rows, err := db.QueryContext(ctx, q)
if err != nil {
return err
}
m := make(map[int64]string)
for rows.Next() {
var rangeID int64
var replicas string
if err := rows.Scan(&rangeID, replicas); err != nil {
return err
}
}
if err := rows.Err(); err != nil {
return err
}
if len(m) == 0 {
return nil
}
return errors.Errorf("ranges remained on n3 (according to meta2): %+v", m)
}); err != nil {
t.Fatal(err)
}
}

0 comments on commit 8b159d8

Please sign in to comment.