roachtest: tpcc/mixed-headroom/n5cpu16 failed #39460

cockroach-teamcity · 2019-08-08T16:26:42Z

SHA: https://github.com/cockroachdb/cockroach/commits/1051f376924f18c5782887ed824ab5ede5027493

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1429376&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190808-1429376/tpcc/mixed-headroom/n5cpu16/run_1
	cluster.go:1735,tpcc.go:161,tpcc.go:163,tpcc.go:174,tpcc.go:287,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1565244744-51-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		2.3 MiB)
		I190808 15:38:24.294129 48 ccl/workloadccl/fixture.go:544  loaded 887 KiB table district in 49.365343776s (10000 rows, 0 index entries, 18 KiB)
		I190808 15:46:13.672006 147 ccl/workloadccl/fixture.go:544  loaded 899 MiB table order in 8m38.743319862s (30000000 rows, 30000000 index entries, 1.7 MiB)
		I190808 15:54:04.953429 146 ccl/workloadccl/fixture.go:544  loaded 3.0 GiB table history in 16m30.02486995s (30000000 rows, 60000000 index entries, 3.1 MiB)
		I190808 16:05:15.465386 49 ccl/workloadccl/fixture.go:544  loaded 18 GiB table customer in 27m40.536546241s (30000000 rows, 30000000 index entries, 11 MiB)
		I190808 16:14:28.827232 150 ccl/workloadccl/fixture.go:544  loaded 32 GiB table stock in 36m53.898478143s (100000000 rows, 100000000 index entries, 15 MiB)
		I190808 16:26:22.588272 151 ccl/workloadccl/fixture.go:544  loaded 17 GiB table order_line in 48m47.659755157s (300011520 rows, 300011520 index entries, 5.8 MiB)
		Error: restoring fixture: pq: syntax error at or near "inject"
		Error:  exit status 1
		: exit status 1

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2019-08-30T14:06:38Z

SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1460982&tab=buildLog

The test failed on branch=provisional_201908291837_v19.2.0-beta.20190903, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190829-1460982/tpcc/mixed-headroom/n5cpu16/run_1
	test_runner.go:688: test timed out (10h0m0s)

cockroach-teamcity · 2019-08-31T00:06:40Z

SHA: https://github.com/cockroachdb/cockroach/commits/66bd279c9aa682c2b7adcec87ec0c639b8039a33

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1461635&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190830-1461635/tpcc/mixed-headroom/n5cpu16/run_1
	test_runner.go:688: test timed out (10h0m0s)

cockroach-teamcity · 2019-08-31T13:06:34Z

SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1462518&tab=buildLog

The test failed on branch=provisional_201908291837_v19.2.0-beta.20190903, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190830-1462518/tpcc/mixed-headroom/n5cpu16/run_1
	cluster.go:1735,tpcc.go:162,tpcc.go:164,tpcc.go:175,tpcc.go:288,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567188851-78-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		932 78 ccl/workloadccl/fixture.go:544  loaded 7.8 MiB table item in 11m4.230218445s (100000 rows, 0 index entries, 12 KiB)
		I190831 00:16:10.134418 74 ccl/workloadccl/fixture.go:544  loaded 17 GiB table customer in 11m32.319972731s (30000000 rows, 30000000 index entries, 26 MiB)
		I190831 00:16:17.073731 76 ccl/workloadccl/fixture.go:544  loaded 1.3 GiB table order in 11m39.259084823s (30000000 rows, 30000000 index entries, 1.9 MiB)
		I190831 00:16:19.920875 75 ccl/workloadccl/fixture.go:544  loaded 4.3 GiB table history in 11m42.106337497s (30000000 rows, 60000000 index entries, 6.2 MiB)
		Error: restoring fixture: pq: importing 1101 ranges: split at key /Table/58/1/134/82131 failed: unexpected value: raw_bytes:"\301\214\342\205\003\010\222\003\022\010\302\211\366\205\370\0010\035\032\007\302\211\366\234\367\271\256\"\006\010\001\020\001\030\001\"\006\010\003\020\003\030\002\"\006\010\004\020\004\030\003\"\010\010\002\020\002\030\004 \000(\0050\020" timestamp:<wall_time:1567209887452784586 > 
		Error:  exit status 1
		: exit status 1

cockroach-teamcity · 2019-09-02T11:43:26Z

SHA: https://github.com/cockroachdb/cockroach/commits/d51fa78ff90a113c9009d263dfaf58d3672670a6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1463583&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190902-1463583/tpcc/mixed-headroom/n5cpu16/run_1
	cluster.go:1735,tpcc.go:162,tpcc.go:164,tpcc.go:175,tpcc.go:288,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567405952-78-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		oadccl/fixture.go:544  loaded 4.3 GiB table history in 12m59.463696846s (30000000 rows, 60000000 index entries, 5.6 MiB)
		I190902 11:32:18.839982 119 ccl/workloadccl/fixture.go:544  loaded 1.3 GiB table order in 13m8.002791342s (30000000 rows, 30000000 index entries, 1.7 MiB)
		I190902 11:32:31.082507 120 ccl/workloadccl/fixture.go:544  loaded 126 MiB table new_order in 13m20.245122703s (9000000 rows, 0 index entries, 161 KiB)
		I190902 11:43:05.602125 123 ccl/workloadccl/fixture.go:544  loaded 23 GiB table order_line in 23m54.764995491s (300011520 rows, 300011520 index entries, 16 MiB)
		Error: restoring fixture: pq: importing 610 ranges: split at key /Table/58/1/781/4/687 failed: unexpected value: raw_bytes:"\021\005*\243\003\010\302\013\022\t\302\211\367\003\013\214\367\010C\032\t\302\211\367\003\035\214\367\n\330\"\006\010\001\020\001\030\001\"\006\010\004\020\004\030\002\"\006\010\003\020\003\030\003\"\010\010\002\020\002\030\004 \000(\0050\030" timestamp:<wall_time:1567423691929620802 > 
		Error:  exit status 1
		: exit status 1

tbg · 2019-09-02T15:08:03Z

I'm seeing two major failure modes here:

failed splits during the import with a message like

Error: restoring fixture: pq: importing 1528 ranges: split at key /T
raw_bytes:"\344x\317\000\003\010\351\003\022\t\301\211\322\220\206\370\364\214\210
\001\020\001\030\001"\006\010\003\020\003\030\002"\006\010\004\020\004\030\003"\0
tamp:<wall_time:1567431351149034101 >

This is unexpected because we never expect a raw *ConditionFailedError to be returned from adminSplitWithDescriptor, both in 19.2 and 19.1 (with the same code):

19.2:

cockroach/pkg/storage/replica_command.go

Lines 361 to 372 in 53e6089

    
           }); err != nil { 
        
           	// The ConditionFailedError can occur because the descriptors acting 
        
           	// as expected values in the CPuts used to update the left or right 
        
           	// range descriptors are picked outside the transaction. Return 
        
           	// ConditionFailedError in the error detail so that the command can be 
        
           	// retried. 
        
           	if msg, ok := maybeDescriptorChangedError(desc, err); ok { 
        
           		// NB: we have to wrap the existing error here as consumers of this 
        
           		// code look at the root cause to sniff out the changed descriptor. 
        
           		err = &benignError{errors.Wrap(err, msg)} 
        
           	} 
        
           	return reply, errors.Wrapf(err, "split at key %s failed", splitKey)

19.1:

cockroach/pkg/storage/replica_command.go

Lines 319 to 333 in 6d189c5

    
           	}); err != nil { 
        
           		// The ConditionFailedError can occur because the descriptors acting 
        
           		// as expected values in the CPuts used to update the left or right 
        
           		// range descriptors are picked outside the transaction. Return 
        
           		// ConditionFailedError in the error detail so that the command can be 
        
           		// retried. 
        
           		if msg, ok := maybeDescriptorChangedError(desc, err); ok { 
        
           			// NB: we have to wrap the existing error here as consumers of this 
        
           			// code look at the root cause to sniff out the changed descriptor. 
        
           			err = &benignError{errors.Wrap(err, msg)} 
        
           		} 
        
           		return reply, errors.Wrapf(err, "split at key %s failed", splitKey) 
        
           	} 
        
           	return reply, nil 
        
           }

The second failure is that this assertion gets hit when a preemptive snapshot gets sent by a 19.1 node to a 19.2 node:

tobias-1567423429-01-n5cpu16-0004> F190902 11:43:11.784656 125964 storage/replica.go:982  [n4,s4,r2109/?:/Table/56/1/398/6/949{-/0}] on-disk and in-memory state diverged: [Lease.Replica.Type: &roachpb.ReplicaType(0) != nil]
goroutine 125964 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000466101, 0xc000466120, 0x0, 0x5a)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x7bab9e0, 0xc000000004, 0x7378f32, 0x12, 0x3d6, 0xc00095c120, 0x85)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93e
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4e29b00, 0xc0173dc5d0, 0xc000000004, 0x2, 0x0, 0x0, 0xc0294c5f40, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4e29b00, 0xc0173dc5d0, 0x1, 0xc000000004, 0x0, 0x0, 0xc0294c5f40, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:189
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).assertStateLocked(0xc0034db000, 0x4e29b00, 0xc0173dc5d0, 0x4e77b20, 0xc0004d7500)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:982 +0x74c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applySnapshot(0xc0034db000, 0x4e29b00, 0xc0173dc5d0, 0x9548f418505fb0be, 0x379fe45034a031b9, 0xc0035fa640, 0x0, 0x0, 0x0, 0xc0007a9180, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raftstorage.go:960 +0xf3a
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processPreemptiveSnapshotRequest.func1(0x4e29b00, 0xc0173dc5d0, 0xc0034db000, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store_snapshot_preemptive.go:343 +0x6bb
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc0000fd800, 0x4e29b00, 0xc0173dc5d0, 0xc0007a8f48, 0xc0069b1418, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3352 +0x150
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processPreemptiveSnapshotRequest(0xc0000fd800, 0x4e29b00, 0xc0173dc270, 0xc0007a8f00, 0x9548f418505fb0be, 0x379fe45034a031b9, 0xc0035fa640, 0x0, 0x0, 0x0, ...)

That seems relatively straightforward to fix, if there isn't some unexpected snag.

tbg · 2019-09-02T15:43:58Z

I have a theory for the Lease crash. I think that when a preemptive snapshot is sent from a 19.2 node to a 19.1 node, the snapshot disk data contains a Lease whose ReplicaDescriptor likely has the replica type VOTER_FULL (instead of the equivalent, deprecated, nil).

Now if that old node sends a snapshot to a new node, it will send it with a descriptor that has the lease replica's type unspecified, i.e. nil. But the corresponding disk state does have a value that's non-nil, boom.

This is the sort of thing the XXX_unrecognized fields would prevent, but we turned them off four years ago (daf39dd).

Still have to verify that this is the case. I'm more mystified by the split error, which I can't explain yet.

tbg · 2019-09-02T16:54:37Z

I think this fixes the lease crash:

diff --git a/pkg/storage/replica_raftstorage.go b/pkg/storage/replica_raftstorage.go
index bf3a7bf687..f9bd4ee50d 100644
--- a/pkg/storage/replica_raftstorage.go
+++ b/pkg/storage/replica_raftstorage.go
@@ -740,6 +740,7 @@ func (r *Replica) applySnapshot(
 	if s.Desc.RangeID != r.RangeID {
 		log.Fatalf(ctx, "unexpected range ID %d", s.Desc.RangeID)
 	}
+	inSnap.State = nil // prevent accidental reuse below
 
 	// Strip the DeprecatedTxnSpanGCThreshold. We don't care about it.
 	// TODO(nvanbenschoten): Remove in 20.1.
@@ -913,6 +914,25 @@ func (r *Replica) applySnapshot(
 	// has not yet been updated. Any errors past this point must therefore be
 	// treated as fatal.
 
+	// Re-load the ReplicaState from disk. The state so far was handed to us by
+	// some remote node that may not have been able to correctly represent the
+	// on-disk state. For an example of this, consider the ReplicaState.Type
+	// field which was added in 19.2. A 19.1 node receiving a 19.2 snapshot
+	// could have an on-disk ReplicaDescriptor with a non-nil type, but in
+	// memory this would not be represented, and thus encoded on the wire as
+	// the nil value. If that 19.1 node later sent a snapshot to a 19.2 node,
+	// it would send the node a nil type in inSnap.State, but the on-disk data
+	// would represent the corresponding value, which would trip assertions.
+	// See https://github.com/cockroachdb/cockroach/issues/39460 for details.
+	{
+		sl := stateloader.Make(s.Desc.RangeID)
+		var err error
+		s, err = sl.Load(ctx, r.store.Engine(), s.Desc)
+		if err != nil {
+			return errors.Wrap(err, "loading state")
+		}
+	}
+
 	if err := r.clearSubsumedReplicaInMemoryData(ctx, subsumedRepls, subsumedNextReplicaID); err != nil {
 		log.Fatalf(ctx, "failed to clear in-memory data of subsumed replicas while applying snapshot: %+v", err)
 	}

tbg · 2019-09-02T18:59:25Z

Oh, regarding the failed splits -- this is unfortunately very likely just the whole CPut mess that wasn't fixed on the 19.x branch: #38302
And even if we did backport that, we don't require users to upgrade into the latest patch release.

The problem here is that once a descriptor has a 19.2-specific field, 19.1 nodes will always fail their cput:

func TestFoo(t *testing.T) {
	s := []byte("\035\244\226 \003\010\205\023\022\t\300\211\367\001I\367\2205\210\032\010\300\211\367\001`\367\\\250\"\006\010\003\020\003\030\001\"\010\010\001\020\001\030\002 \000\"\006\010\004\020\004\030\003(\0040\025")
	v := Value{RawBytes: s}
	var desc RangeDescriptor
	require.NoError(t, v.GetProto(&desc))
	t.Error(pretty.Sprint(&desc))
}

the 19.1 node sees:

--- FAIL: TestFoo (0.00s)
    data_test.go:1522: &roachpb.RangeDescriptor{
            RangeID:  2437,
            StartKey: {0xc0, 0x89, 0xf7, 0x1, 0x49, 0xf7, 0x90, 0x35, 0x88},
            EndKey:   {0xc0, 0x89, 0xf7, 0x1, 0x60, 0xf7, 0x5c, 0xa8},
            Replicas: {
                {
                    NodeID:               3,
                    StoreID:              3,
                    ReplicaID:            1,
                    XXX_NoUnkeyedLiteral: struct {}{},
                    XXX_sizecache:        0,
                },
                {
                    NodeID:               1,
                    StoreID:              1,
                    ReplicaID:            2,
                    XXX_NoUnkeyedLiteral: struct {}{},
                    XXX_sizecache:        0,
                },
                {
                    NodeID:               4,
                    StoreID:              4,
                    ReplicaID:            3,
                    XXX_NoUnkeyedLiteral: struct {}{},
                    XXX_sizecache:        0,
                },
            },
            NextReplicaID:        4,
            Generation:           &int64(21),
            XXX_NoUnkeyedLiteral: struct {}{},
            XXX_sizecache:        0,
        }
FAIL

Process finished with exit code 1

but 19.2 reveals that the proto really has a replica with type &zero instead of nil.

  data_test.go:1891: &roachpb.RangeDescriptor{
            RangeID:          2437,
            StartKey:         {0xc0, 0x89, 0xf7, 0x1, 0x49, 0xf7, 0x90, 0x35, 0x88},
            EndKey:           {0xc0, 0x89, 0xf7, 0x1, 0x60, 0xf7, 0x5c, 0xa8},
            InternalReplicas: {
                {
                    NodeID:    3,
                    StoreID:   3,
                    ReplicaID: 1,
                    Type:      (*roachpb.ReplicaType)(nil),
                },
                {
                    NodeID:    1,
                    StoreID:   1,
                    ReplicaID: 2,
                    Type:      &roachpb.ReplicaType(0),
                },
                {
                    NodeID:    4,
                    StoreID:   4,
                    ReplicaID: 3,
                    Type:      (*roachpb.ReplicaType)(nil),
                },
            },
            NextReplicaID:        4,
            Generation:           &int64(21),
            GenerationComparable: (*bool)(nil),
            StickyBit:            (*hlc.Timestamp)(nil),
        }

On the plus side, the diff in my last post seems to have resolved the assertion crashes.

tbg · 2019-09-02T19:09:50Z

Going to run 10 instances in which I modified the master binary to use nil instead of &VOTER_FULL (the &0 above). That should do it, let's see

tbg · 2019-09-02T19:59:18Z

They're all passing the import now but then immediately going bust with errors such as

            Error: error in orderStatus: select order fail: can't scan into dest[2]: invalid length for int8: 4
                Error:  exit status 1
                : exit status 1

paging @jordanlewis

jordanlewis · 2019-09-03T03:40:09Z

I think this would be caused by the server returning 4 bytes for a 4 byte integer instead of 8 like we used to.

Does this use fixtures import or fixtures load? What's the data source? My guess is that we just need to update the test to use int4 instead of int8, but I'm not sure yet.

jordanlewis · 2019-09-03T03:54:27Z

Hm, though it looks like all of the schemas say integer/int8. I'm not sure quite how you're reproducing this @tbg - let me know please, I can't tell if I need some non-master bulid.

tbg · 2019-09-03T07:21:29Z

Another thing to investigate here: on two out of ten clusters, the imports got stuck:

tbg · 2019-09-03T07:25:46Z

^- clusters for these are

http://tobias-1567451338-03-n5cpu16-0001.roachprod.crdb.io:26258/#/jobs
http://tobias-1567451338-04-n5cpu16-0001.roachprod.crdb.io:26258/#/jobs

tbg · 2019-09-03T07:36:17Z

Looking at the second one through a 19.1 ui (n1):

Looks like things are pretty grim. This node somehow can't even figure out the replicas in the descriptor?

If I open the same page through n2, it still looks unhappy, but not as broken:

predictably the logs are full of messages related to the fact that the range descriptor has turned to garbage.

tbg · 2019-09-03T07:46:23Z

./cockroach debug range-descriptors seems to give out reasonable data both on the 19.1 and 19.2 nodes:

https://gist.github.com/tbg/bd89db53e7b1e852ffca27f758800567

I can see from http://tobias-1567451338-04-n5cpu16-0002.roachprod.crdb.io:26258/_status/raft that the 19.2 nodes have this populated properly in-memory, but the 19.1 nodes return an empty internalReplicas. Interestingly, they are also missing NextReplicaID, which is puzzling - nothing ever changed about that.

Phew, whatever is going on here, things are majorly broken.

tbg · 2019-09-03T07:52:25Z

@jordan here's a repro, I extended that cluster to 24h

./bin/roachprod run tobias-1567451338-07-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=0m25s --duration=2h0m0s {pgurl:2}

Note that this one works:

./bin/roachprod run tobias-1567451338-07-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=0m25s --duration=2h0m0s {pgurl:1}

This is weird to me because :1 is a 19.1 node, :2 is a 19.2 node. I would've expected things to fail against the old one, not the other way around.

I initially thought that I probably messed up the workload binary, but that doesn't seem to be it; I replaced it with one built fresh from my branch (i.e. master).

tbg · 2019-09-03T08:08:50Z

I'd hazard a guess that this explains something:

tobias-1567451338-04-n5cpu16-0002> I190902 19:11:43.754115 2703 storage/replica_command.go:1433  [n2,s2,r204/1:/Table/58/1/108/1/-2158/9…] change replicas (add [(n1,s1):4] remove []): existing descriptor r204:/Table/58/1/108/1/-2158/9{-/0} [(n2,s2):1, (n3,s3):2, (n4,s4):3, next=4, gen=19?]
tobias-1567451338-04-n5cpu16-0003> I190902 19:11:43.768641 1996 storage/replica_raft.go:396  [n3,s3,r204/2:/Table/58/1/108/1/-2158/9…] proposing ADD_REPLICA((n1,s1):4): updated=[] next=0

n2 (19.2) runs a change replicas, and it looks like it's actually proposed on n3 (it probably has the lease). n3 is a 19.1 node, and it looks like it botches things pretty badly the "updated=[] next=0" indicates (we're supposed to see the new replica set and nextreplicaid there).

I'm seeing this in the other broken cluster too (the -03 one):

tobias-1567451338-03-n5cpu16-0002> I190902 19:25:34.697855 94972 storage/replica_command.go:1433  [n2,s2,r4746/1:/Table/60/1/8{20/495…-41/614…}] change replicas (add [] remove [(n4,s4):3]): existing descriptor r4746:/Table/60/1/8{20/49582-41/61499} [(n2,s2):1, (n3,s3):2, (n4,s4):3, (n1,s1):4, next=5, gen=14?]
tobias-1567451338-03-n5cpu16-0003> I190902 19:25:34.705844 33266 storage/replica_raft.go:396  [n3,s3,r4746/2:/Table/60/1/8{20/495…-41/614…}] proposing REMOVE_REPLICA((n4,s4):3): updated=[] next=0

Now to figure out what this is. Pretty sure I must have broken this; I refactored a number of things about the change replicas trigger recently.

tbg · 2019-09-03T08:26:54Z

^- ok I think the problem is actually not of my own doing, but it's 52333d1

That commit "migrated" the UpdatedReplicas field into a Desc field, but did so on the incorrect assumption that the coordinator node of a replication change would be the evaluator/proposer, and that is incorrect. 19.2 nodes pass a trigger to txn.Run which is missing the UpdatedReplicas field. If the evaluator/proposer ends up being a 19.1 node, it will just use this empty UpdatedReplicas slice and bake a range descriptor without replicas, and then propose that to Raft, and then all of the nodes get to enjoy a corrupted state.

Fix should be easy, I'll add something to my branch.

tbg · 2019-09-03T10:35:50Z

@jordanlewis it seems to be working now (i.e. no more int4/int8 errors). It's relatively likely that I used some old workload binary initially to restore the stuff, so I'll blame it on that.

tbg · 2019-09-03T10:36:23Z

Even better, the imports all went through without a hitch. I might get lucky and actually get all these tests to pass, who knows?

tbg · 2019-09-03T13:29:56Z

They all passed.

cockroach-teamcity added this to the 19.2 milestone Aug 8, 2019

cockroach-teamcity assigned andreimatei Aug 8, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Aug 8, 2019

solongordon mentioned this issue Aug 30, 2019

release: v19.2.0-beta.20190903 #40338

Closed

18 tasks

This was referenced Sep 2, 2019

secret core release blockers tbg/cockroach#163

Closed

storage: prepare for kv.atomic_replication_changes=true #40370

Merged

tbg mentioned this issue Sep 3, 2019

storage: fix two botched migrations #40415

Merged

craig bot closed this as completed in 3d8394e Sep 3, 2019

andy-kimball mentioned this issue Sep 3, 2019

19.2 release blockers list #40447

Closed

53 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpcc/mixed-headroom/n5cpu16 failed #39460

roachtest: tpcc/mixed-headroom/n5cpu16 failed #39460

cockroach-teamcity commented Aug 8, 2019

cockroach-teamcity commented Aug 30, 2019

cockroach-teamcity commented Aug 31, 2019

cockroach-teamcity commented Aug 31, 2019

cockroach-teamcity commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

jordanlewis commented Sep 3, 2019

jordanlewis commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

roachtest: tpcc/mixed-headroom/n5cpu16 failed #39460

roachtest: tpcc/mixed-headroom/n5cpu16 failed #39460

Comments

cockroach-teamcity commented Aug 8, 2019

cockroach-teamcity commented Aug 30, 2019

cockroach-teamcity commented Aug 31, 2019

cockroach-teamcity commented Aug 31, 2019

cockroach-teamcity commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

tbg commented Sep 2, 2019

jordanlewis commented Sep 3, 2019

jordanlewis commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019

tbg commented Sep 3, 2019