roachtest: clearrange/checks=true failed #34860

cockroach-teamcity · 2019-02-13T10:59:41Z

SHA: https://github.com/cockroachdb/cockroach/commits/10f8010fa5778e740c057905e2d7664b5fd5d647

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1135549&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:86,clearrange.go:35,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1135549-clearrange-checks-true:1 -- ./cockroach sql --insecure -e "DROP DATABASE IF EXISTS tinybank" returned:
		stderr:
		
		stdout:
		Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.
		
		read tcp 127.0.0.1:43548->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

The text was updated successfully, but these errors were encountered:

tbg · 2019-02-13T11:54:59Z

This is odd. We start the cluster, successfully import the large fixture (takes slightly north of 3h), stop the cluster. Then we restart the cluster, it takes the node ~4s to connect to Gossip

I190213 10:56:47.665954 1 cli/start.go:1075 CockroachDB CCL v2.2.0-alpha.20190211-100-g10f8010 (x86_64-unknown-linux-gnu, built 2019/02/13 06:49:41, go1.11.4)
...
I190213 10:56:51.592563 23 server/node.go:719 [n1] node connected via gossip and verified as part of cluster "33ed2e0c-c974-43aa-9319-5dded06246e5"

but then it takes another 11s for node startup to complete:

I190213 10:57:02.795050 23 cli/start.go:690 node startup completed

The log inbetween has logs of messages of this type

W190213 10:56:53.519644 644 storage/raft_transport.go:583 [n1] while processing outgoing Raft queue to node 10: rpc error: code = Unavailable desc = node waiting for init; /cockroach.storage.MultiRaft/RaftMessageBatch not available:

and looking at n10 I can see that it connected to Gossip a little later:

I190213 10:57:00.349315 39 gossip/gossip.go:1510 [n10] node has connected to cluster via gossip

Now we're already clocking in at 13s past roachprod start, so it's understandable that the cli command gave up. Still, 22s later, n10 still isn't serving traffic:

W190213 10:57:22.236098 3559 storage/raft_transport.go:583 [n1] while processing outgoing Raft queue to node 10: rpc error: code = Unavailable desc = node waiting for init; /cockroach.storage.MultiRaft/RaftMessageBatch not available:

It's only around that time that it gets its feet back on the ground:

I190213 10:57:00.349315 39 gossip/gossip.go:1510  [n10] node has connected to cluster via gossip
W190213 10:57:18.473011 2886 cli/start.go:541  The server appears to be unable to contact the other nodes in the cluster. Please try:

- starting the other nodes, if you haven't already;
- double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly;
- running the 'cockroach init' command if you are trying to initialize a new cluster.

If problems persist, please see https://www.cockroachlabs.com/docs/v2.2/cluster-setup-troubleshooting.html.
I190213 10:57:22.265203 16 server/node.go:461  [n10] initialized store [n10,s10]: disk (capacity=354 GiB, available=177 GiB, used=179 GiB, logicalBytes=199 GiB), ranges=6983, leases=0, queries=0.00, writes=0.00, bytesPerReplica={p10=18894960.00 p25=33600168.00 p50=33600168.00 p75=33600168.00 p90=33607164.00 pMax=46199121.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}
I190213 10:57:22.266798 16 storage/stores.go:243  [n10] read 9 node addresses from persistent storage
I190213 10:57:22.267594 16 server/node.go:699  [n10] connecting to gossip network to verify cluster ID...
I190213 10:57:22.267634 16 server/node.go:719  [n10] node connected via gossip and verified as part of cluster "33ed2e0c-c974-43aa-9319-5dded06246e5"
I190213 10:57:22.268273 16 server/node.go:542  [n10] node=10: started with [<no-attributes>=/mnt/data1/cockroach] engine(s) and attributes []
I190213 10:57:22.268529 16 server/status/recorder.go:610  [n10] available memory from cgroups (8.0 EiB) exceeds system memory 15 GiB, using system memory
I190213 10:57:22.269733 16 server/server.go:1575  [n10] starting http server at [::]:26258 (use: teamcity-1135549-clearrange-checks-true-0009:26258)
I190213 10:57:22.269756 16 server/server.go:1577  [n10] starting grpc/postgres server at [::]:26257
I190213 10:57:22.269775 16 server/server.go:1578  [n10] advertising CockroachDB node at teamcity-1135549-clearrange-checks-true-0009:26257
I190213 10:57:23.060088 39 storage/stores.go:262  [n10] wrote 9 node addresses to persistent storage

I will say that I was convinced that these tests would be flaky when they were written. They basically run an update-heavy workload with tight GC TTL and expect to make do with 200mb of disk space. I'm only planning to make sure that there isn't anything going wrong catastrophically here. That said, they became flaky suddenly, so it seems worth figuring out what caused it.

PS the logs of all nodes that I looked at had this sort of log spam:

W190213 10:57:41.870345 3332 storage/allocator.go:637 [n10,replicate,s10,r5146/2:/Table/53/1/1444{0550-3823}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190213 10:57:41.870804 3336 storage/allocator.go:637 [n10,replicate,s10,r4424/4:/Table/53/1/1197{7001-9214}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190213 10:57:41.871499 3323 storage/allocator.go:637 [n10,replicate,s10,r11232/3:/Table/53/1/2987{7784-8955}] simulating RemoveTarget failed: must supply at least one candidate replica to

Not sure what that's about, perhaps an artifact of the situation or the test, but seems better to avoid. The warning originates in #18364 as far as I can tell.

tbg · 2019-02-13T11:58:48Z

I uploaded the artifacts here.
@petermattis, could you take a look?

cockroach-teamcity · 2019-02-14T14:20:04Z

SHA: https://github.com/cockroachdb/cockroach/commits/acba091f04f3d8ecabf51009bf394951fbd3643c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1137872&tab=buildLog

The test failed on master:
	test.go:1200: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:192,clearrange.go:37,test.go:1212: context canceled

cockroach-teamcity · 2019-02-15T10:51:54Z

SHA: https://github.com/cockroachdb/cockroach/commits/d7c56dcb87c8f187e50303c8e32a64836c42515c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1139797&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:88,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1139797-clearrange-checks-true:1 -- ./cockroach sql --insecure -e "DROP DATABASE IF EXISTS tinybank" returned:
		stderr:
		
		stdout:
		Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.
		
		read tcp 127.0.0.1:58106->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

cockroach-teamcity · 2019-02-17T11:23:58Z

SHA: https://github.com/cockroachdb/cockroach/commits/8e9a1e310e3e8e37f091b7ca8bd204084ad9e2e5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1142461&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:88,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1142461-clearrange-checks-true:1 -- ./cockroach sql --insecure -e "DROP DATABASE IF EXISTS tinybank" returned:
		stderr:
		
		stdout:
		Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.
		
		read tcp 127.0.0.1:60282->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

cockroach-teamcity · 2019-02-19T11:07:39Z

SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:57,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1143393-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

cockroach-teamcity · 2019-02-21T11:25:05Z

SHA: https://github.com/cockroachdb/cockroach/commits/2e16b7357d5a15d87cd284d5f2c12e424ed29ca1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1146277&tab=buildLog

The test failed on master:
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1212: unexpected node event: 1: dead

tbg · 2019-02-21T13:58:19Z

W190221 11:22:45.988927 332 storage/engine/rocksdb.go:1943  batch [2/350/0] commit took 16.1705381s (>500ms):
goroutine 332 [running]:
runtime/debug.Stack(0x30c60cfa, 0xed40080f5, 0x0)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).commitInternal(0xc012544240, 0x0, 0x1, 0xc00863a308)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1944 +0x189
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Commit(0xc012544240, 0xed4008001, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1860 +0x810
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc005a37680, 0x390a2e0, 0xc010bd0450, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:688 +0x5ae
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x390a2e0, 0xc010bd0450, 0xc005a37680, 0x390a2e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3562 +0x120
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc000776000, 0x390a2e0, 0xc010bd0450, 0xc007b34fc0, 0xc00f8abed0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3209 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc000776000, 0x390a2e0, 0xc0066ddce0, 0x407f)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3550 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc00024b680, 0x390a2e0, 0xc0066ddce0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21a
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x390a2e0, 0xc0066ddce0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc007320450, 0xc0001c5e60, 0xc007320440)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8
F190221 11:22:45.990221 272113 server/server_engine_health.go:54  [n1] disk stall detected: unable to write to <no-attributes>=/mnt/data1/cockroach within 10s 

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/3   130.54 MB   0.0      0.2     0.0      0.2       0.4      0.3       0.0   1.4      5.4     13.6        32        20    1.623    133K   1601
  L2      5/5   47.41 MB   0.0      0.4     0.3      0.1       0.4      0.3       0.0   1.3     23.5     22.9        19         6    3.148   2582K    68K
  L3      3/0    5.01 MB   0.6      0.3     0.1      0.1       0.3      0.1       0.1   2.3     26.3     26.2        10         5    1.990    850K    11K
  L4     26/0   174.01 MB   0.9      0.5     0.2      0.3       0.5      0.2       0.1   2.0     19.3     19.2        26         9    2.914    812K    86K
  L5    631/89   3.73 GB   0.2      4.9     0.6      4.2       3.9     -0.4       0.0   6.2     61.2     48.8        81        11    7.406    103K   7143
  L6   6059/1315 165.55 GB   0.0     72.4     0.9     71.5      67.7     -3.8       0.0  75.9     33.2     31.0      2234        32   69.825    215K    35K
 Sum   6727/1412 169.62 GB   0.0     78.6     2.2     76.4      73.2     -3.3       0.1 238.1     33.5     31.2      2403        83   28.957   4699K   211K
 Int      0/0    0.00 KB   0.0     19.5     0.3     19.2      18.6     -0.6       0.0 146.3     30.3     29.0       658        11   59.840   1582K    81K
Uptime(secs): 2353.7 total, 536.2 interval
Flush(GB): cumulative 0.307, interval 0.127
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 73.16 GB write, 31.83 MB/s write, 78.63 GB read, 34.21 MB/s read, 2403.4 seconds
Interval compaction: 18.63 GB write, 35.58 MB/s write, 19.48 GB read, 37.21 MB/s read, 658.2 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
goroutine 272113 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x51d8100, 0x1e)

@ajkr do you see anything in the stats that hints at a problem? This is our main range deletion tombstone test (it drops a large table, which lays down lots of individual range deletions), though it also exercises range merges as the now-empty ranges are all merged away. The node crashed because it took >10s to commit a virtually empty batch (though it might've gotten lumped in with something else in our Go-side commit pipeline, but there shouldn't be any large batches in the test at that stage of the test).

cockroach-teamcity · 2019-02-24T10:48:21Z

SHA: https://github.com/cockroachdb/cockroach/commits/eaad50f0ea64985a0d7da05abb00cc9f321c5fa9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149743&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:58,clearrange.go:37,test.go:1211: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1149743-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

cockroach-teamcity · 2019-02-28T19:36:44Z

SHA: https://github.com/cockroachdb/cockroach/commits/71681f60031c101f17339236e2ba75f7a684d1a1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1155904&tab=buildLog

The test failed on master:
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: unexpected node event: 4: dead

tbg · 2019-02-28T20:02:15Z

		if !delta.ContainsEstimates && testingFatalOnStatsMismatch {
			// ContainsEstimates is true if the replica's persisted MVCCStats had ContainsEstimates set.
			// If this was *not* the case, the replica believed it had accurate stats. But we just found
			// out that this isn't true.
			log.Fatalf(ctx, "found a delta of %+v", log.Safe(delta))
		}

F190228 19:33:00.255975 28459 storage/replica_consistency.go:132  [n6,replicate,s6,r3/9:/System/{NodeLiveâ€¦-tsd}] found a delta of {false 0 0 309042 1 0 36 0 24 3 0 0 0 0 {} 0}
goroutine 28459 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x5358e00, 0x1e)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5af8d20, 0xc000000004, 0x5358e0a, 0x1e, 0x84, 0xc01a495400, 0x6f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a0e500, 0xc0089ad740, 0x4, 0x2, 0x326ac4a, 0x14, 0xc0065f9750, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3a0e500, 0xc0089ad740, 0x1, 0x4, 0x326ac4a, 0x14, 0xc0065f9750, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3a0e500, 0xc0089ad740, 0x326ac4a, 0x14, 0xc0065f9750, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc0007c4900, 0x3a0e500, 0xc0089ad740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:132 +0x75f
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc000342500, 0x3a0e500, 0xc0089ad740, 0xc0007c4900, 0xc005b247d0, 0xc0199b9c38, 0x3fc3333333333300)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:117 +0x1d8
github.com/cockroachdb/cockroach/pkg/storage.(*replicateQueue).process(0xc00031c310, 0x3a0e500, 0xc0089ad740, 0xc0007c4900, 0xc005b247d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replicate_queue.go:273 +0x313
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processReplica.func1(0x3a0e500, 0xc0089ad740, 0xdf8475800, 0x3a0e500)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:753 +0x24e
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout(0x3a0e500, 0xc0089ad740, 0xc0097255f0, 0x21, 0xdf8475800, 0xc00ce87e90, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:85 +0xac
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processReplica(0xc000bec400, 0x3a0e540, 0xc00b972240, 0xc0007c4900, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:719 +0x215
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processLoop.func1.2(0x3a0e540, 0xc00b972210)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:646 +0xb8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc000b78000, 0x3a0e540, 0xc00b972210, 0xc0097255c0, 0x2d, 0x0, 0x0, 0xc00b53d1a0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:325 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:320 +0x134

tbg · 2019-02-28T20:03:05Z

My guess would be that range merging sometimes doesn't respect the ContainsEstimates flag. That is, we're merging two ranges (one of which contains estimates) but then the merged range pretends that it doesn't.

cockroach-teamcity · 2019-02-28T22:26:33Z

SHA: https://github.com/cockroachdb/cockroach/commits/71681f60031c101f17339236e2ba75f7a684d1a1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1155867&tab=buildLog

The test failed on master:
	test.go:1199: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: context canceled

cockroach-teamcity · 2019-03-02T14:22:17Z

SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog

The test failed on master:
	test.go:1199: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: context canceled

cockroach-teamcity · 2019-03-04T09:51:57Z

SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:58,clearrange.go:37,test.go:1211: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1160394-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190304 07:48:15.947807 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 1 tables
		I190304 09:49:19.352543 24 ccl/workloadccl/fixture.go:498  imported bank (2h1m3s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

ajkr · 2019-03-04T13:58:46Z

The stats do look like we're putting great strain on the device: write-amp is 238.1. Another notable thing is 1315 bottom-level files were being compacted at the time those stats were printed. Since cockroach has only a few parallel compactions, that means something (usually range tombstones) causes individual compactions to be extremely wide. I will try repro'ing to investigate more.

petermattis · 2019-03-04T14:02:36Z

Since cockroach has only a few parallel compactions, that means something (usually range tombstones) causes individual compactions to be extremely wide. I will try repro'ing to investigate more.

Reproducing and understanding this would be super helpful. I filed #26693 last summer when we were doing investigations about problems with range tombstones. See also facebook/rocksdb#3977 (which I know you're aware of).

tbg · 2019-03-05T10:04:15Z

1315 bottom-level files were being compacted

Oh, that's what that means! Yikes, that does sound pretty bad. Thank you for taking a look.

tbg · 2019-03-05T10:04:56Z

(btw, in case this isn't already clear, clearrange is our primary range deletion tombstone stress test - it clears a range which causes individual range tombstones to be laid down on every range, i.e. you get $DATA/64mb range tombstones at the same time thrown into the LSM.)

ajkr · 2019-03-05T18:42:34Z

The import is hitting an interesting edge case. I'm not sure it's the root cause or not as it hasn't failed yet. But this is what's happening:

The import process writes tombstones that are spread out in the key-space.
When compaction happens, the max_compaction_bytes setting splits the tombstones among many files. There are a lot of single-tombstone files:

I190305 16:36:15.000192 2850 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1338] [default] [JOB 6279] Generated table #21189: 1 keys, 1051 bytes (need compaction)
I190305 16:36:15.000233 2850 storage/engine/rocksdb.go:93  [rocksdb] EVENT_LOG_v1 {"time_micros": 1551803775000206, "cf_name": "default", "job": 6279, "event": "table_file_creation", "file_number": 21189, "file_size": 1051, "table_proper
ties": {"data_size": 32, "index_size": 34, "filter_size": 69, "raw_key_size": 48, "raw_average_key_size": 48, "raw_value_size": 16, "raw_average_value_size": 16, "num_data_blocks": 1, "num_entries": 1, "filter_policy_name": "rocksdb.Buil
tinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
I190305 16:36:15.004345 2850 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1338] [default] [JOB 6279] Generated table #21190: 1 keys, 1051 bytes (need compaction)
I190305 16:36:15.004384 2850 storage/engine/rocksdb.go:93  [rocksdb] EVENT_LOG_v1 {"time_micros": 1551803775004360, "cf_name": "default", "job": 6279, "event": "table_file_creation", "file_number": 21190, "file_size": 1051, "table_proper
ties": {"data_size": 32, "index_size": 34, "filter_size": 69, "raw_key_size": 48, "raw_average_key_size": 48, "raw_value_size": 16, "raw_average_value_size": 16, "num_data_blocks": 1, "num_entries": 1, "filter_policy_name": "rocksdb.Buil
tinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
I190305 16:36:15.008383 2850 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1338] [default] [JOB 6279] Generated table #21191: 1 keys, 1020 bytes (need compaction)
I190305 16:36:15.008420 2850 storage/engine/rocksdb.go:93  [rocksdb] EVENT_LOG_v1 {"time_micros": 1551803775008396, "cf_name": "default", "job": 6279, "event": "table_file_creation", "file_number": 21191, "file_size": 1020, "table_proper
ties": {"data_size": 32, "index_size": 34, "filter_size": 69, "raw_key_size": 32, "raw_average_key_size": 32, "raw_value_size": 8, "raw_average_value_size": 8, "num_data_blocks": 1, "num_entries": 1, "filter_policy_name": "rocksdb.Builti
nBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
...

Each file generated above is "marked" for compaction since at least half its keys are deletions. Actually 100% of the keys are deletions since there's just that one.
The marked files get compacted together with a full file in the next level (32MB in L6). Then the write-amp is enormous as we're rewriting 32MB to move down a single tombstone L5->L6.

petermattis · 2019-03-05T21:41:20Z

Each file generated above is "marked" for compaction since at least half its keys are deletions. Actually 100% of the keys are deletions since there's just that one.

In order to make sure that range deletions are processed in a timely fashion, we mark any sstable containing at least range tombstone for compaction. Perhaps this is having a bad effect and should be disabled. See c-deps/libroach/table_props.cc:DBMakeDeleteRangeCollector.

The marked files get compacted together with a full file in the next level (32MB in L6). Then the write-amp is enormous as we're rewriting 32MB to move down a single tombstone L5->L6.

I would expect the range tombstone to actually cover a large amount of data. That's why we're trying to push them through compactions. Are you seeing evidence that the range tombstones cover few keys?

We know that the engine check fires during some tests ([clearrange], for example). This puts us in an awkward position: on the one hand, not being able to sync an engine in 10s is certainly going to cause lots of terrible snowball effects which then eat up troubleshooting time, but on the other hand we're not likely to fix all of the problems in 19.1. For now, up the limit significantly. Also up the corresponding log partition time limit, though we've seen that fire only in rare cases that likely really highlighted some I/O problem (or a severe case of being CPU bound). [clearrange]: cockroachdb#34860 (comment) Release note: None

35936: server,log: increase the max sync durations r=bdarnell a=tbg We know that the engine check fires during some tests ([clearrange], for example). This puts us in an awkward position: on the one hand, not being able to sync an engine in 10s is certainly going to cause lots of terrible snowball effects which then eat up troubleshooting time, but on the other hand we're not likely to fix all of the problems in 19.1. For now, up the limit significantly. Also up the corresponding log partition time limit, though we've seen that fire only in rare cases that likely really highlighted some I/O problem (or a severe case of being CPU bound). [clearrange]: #34860 (comment) Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

We know that the engine check fires during some tests ([clearrange], for example). This puts us in an awkward position: on the one hand, not being able to sync an engine in 10s is certainly going to cause lots of terrible snowball effects which then eat up troubleshooting time, but on the other hand we're not likely to fix all of the problems in 19.1. For now, up the limit significantly. Also up the corresponding log partition time limit, though we've seen that fire only in rare cases that likely really highlighted some I/O problem (or a severe case of being CPU bound). [clearrange]: cockroachdb#34860 (comment) Release note: None

cockroach-teamcity · 2019-03-22T09:57:32Z

SHA: https://github.com/cockroachdb/cockroach/commits/b5768aecd39461ab9a54e2e7db059a3fe8b00459

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191957&tab=buildLog

The test failed on release-19.1:
	cluster.go:1626,clearrange.go:189,clearrange.go:37,test.go:1214: unexpected node event: 3: dead

cockroach-teamcity · 2019-03-23T09:24:45Z

SHA: https://github.com/cockroachdb/cockroach/commits/9399d559ae196e5cf2ad122195048ff9115ab56a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194326&tab=buildLog

The test failed on release-19.1:
	cluster.go:1267,clearrange.go:58,clearrange.go:37,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1194326-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190323 06:52:57.297628 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190323 09:22:05.450159 41 ccl/workloadccl/fixture.go:516  imported bank (2h29m8s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

cockroach-teamcity · 2019-03-24T08:58:09Z

SHA: https://github.com/cockroachdb/cockroach/commits/c59f5347d5424edb90575fb0fd50bad677953752

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195732&tab=buildLog

The test failed on release-19.1:
	cluster.go:1267,clearrange.go:58,clearrange.go:37,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1195732-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190324 06:48:45.831461 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190324 08:55:25.611991 27 ccl/workloadccl/fixture.go:516  imported bank (2h6m40s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

danhhz · 2019-03-26T14:27:00Z

Node 9 had a larger log file than anything else so I took a quick glance. It has all sorts of grpc connection issues in the logs, though I don't immediately see why.

W190324 08:55:08.937449 10472 storage/raft_transport.go:583  [n9] while processing outgoing Raft queue to node 4: rpc error: code = Unavailable desc = transport is closing:
W190324 08:55:08.938767 140 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {teamcity-1195732-clearrange-checks-true-0004:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190324 08:55:08.938802 140 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
W190324 08:55:08.942600 169174 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {teamcity-1195732-clearrange-checks-true-0004:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.142.0.190:26257: connect: connection refused". Reconnecting...

tbg · 2019-03-26T14:27:56Z

Wait, connection refused? Let me double check that nodes really didn't die..

tbg · 2019-03-26T14:28:47Z

pretty sure n4 oomed or something along those lines.

tbg · 2019-03-26T14:29:37Z

Yep here we go

[ 7677.211199] Out of memory: Kill process 12304 (cockroach) score 543 or sacrifice child
[ 7677.219463] Killed process 12304 (cockroach) total-vm:20716080kB, anon-rss:8478640kB, file-rss:29716kB, shmem-rss:0kB
[ 7677.795495] oom_reaper: reaped process 12304 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

tbg · 2019-03-26T14:32:19Z

Hmm. No heap_profiler directories in the logs. Why not? The last runtime stats is

I190324 08:54:59.109162 165 server/status/runtime.go:500 [n4] runtime stats: 7.7 GiB RSS, 357 goroutines, 962 MiB/2.0 GiB/3.2 GiB GO alloc/idle/total, 4.0 GiB/4.8 GiB CGO alloc/total, 4805.7 CGO/sec, 81.5/53.3 %(u/s)time, 0.4 %gc (7x), 942 MiB/286 MiB (r/w)net

These nodes have 15GiB of memory, so this isn't even close. Something must've allocated quite suddenly.

cockroach-teamcity · 2019-04-01T06:25:24Z

SHA: https://github.com/cockroachdb/cockroach/commits/5921cf0dcc76548931cc85500c0fa2186a82142f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212185&tab=buildLog

The test failed on release-19.1:
	cluster.go:1293,clearrange.go:58,clearrange.go:37,test.go:1223: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1212185-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190401 04:14:13.581787 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190401 06:21:20.480424 66 ccl/workloadccl/fixture.go:516  imported bank (2h7m7s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

cockroach-teamcity · 2019-04-01T10:00:44Z

SHA: https://github.com/cockroachdb/cockroach/commits/668162cc99e4f3198b663b1abfa51858eeb3ccb8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212251&tab=buildLog

The test failed on master:
	cluster.go:1293,clearrange.go:58,clearrange.go:37,test.go:1223: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1212251-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190401 06:31:17.835966 1 ccl/workloadccl/cliccl/fixtures.go:325  starting import of 1 tables
		I190401 09:57:54.082021 26 ccl/workloadccl/fixture.go:516  imported bank (3h26m36s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1293,clearrange.go:58,clearrange.go:37,test.go:1223: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1212251-clearrange-checks-true --oneshot --ignore-empty-nodes: exit status 1 6: dead
		8: 13316
		7: 12419
		10: 13245
		4: 12847
		9: 12412
		1: 13449
		5: 13128
		2: 13483
		3: 13181
		Error:  6: dead

tbg · 2019-04-01T10:52:33Z

F190401 09:57:08.220353 152 storage/replica_raft.go:923 [n6,s6,r19943/3:/Table/53/1/5506{6922-8668}] while committing batch: while committing batch: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/026458.log: No space left on device

Before that, pages and pages of slow commits, some 20s in duration. Last compaction stats in the logs say

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      2/0   64.16 MB   1.0      0.0     0.0      0.0       5.6      5.6       0.4   0.0      0.0     47.1       122       137    0.889       0      0
  L2      7/0   84.27 MB   1.3      5.0     3.4      1.6       5.0      3.4       0.3   1.4     26.5     26.2       194        55    3.530   3012K    38K
  L3     50/0   197.72 MB   0.9      6.0     3.8      2.2       5.9      3.7       2.2   1.6     31.6     31.1       194       110    1.762   3338K    25K
  L4    156/0   144.44 MB   0.8      5.4     2.2      3.2       5.0      1.8       7.2   2.3     26.1     24.3       213       308    0.692   3290K    26K
  L5   1290/53  29.82 GB   1.3     15.5     1.4     14.1      13.0     -1.1      57.7   9.5      7.6      6.4      2093      1915    1.093   3339K    31K
  L6   8064/901 223.82 GB   0.0    245.4    25.9    219.5     226.5      7.0     216.8   8.7     31.2     28.8      8042      8561    0.939     22M   516K
 Sum   9569/954 254.12 GB   0.0    277.3    36.7    240.6     261.0     20.4     284.6   0.9     26.2     24.6     10858     11086    0.979     35M   637K
 Int      0/0    0.00 KB   0.0     41.0    12.5     28.5      39.4     10.9      18.0   2.2     13.0     12.5      3219       969    3.322   4227K    16K
Uptime(secs): 11841.2 total, 890.5 interval
Flush(GB): cumulative 5.600, interval 0.077
AddFile(GB): cumulative 270.408, interval 17.754
AddFile(Total Files): cumulative 9467, interval 618
AddFile(L0 Files): cumulative 13, interval 9
AddFile(Keys): cumulative 28244437, interval 1854392
Cumulative compaction: 261.03 GB write, 22.57 MB/s write, 277.32 GB read, 23.98 MB/s read, 10858.1 seconds
Interval compaction: 39.43 GB write, 45.34 MB/s write, 40.96 GB read, 47.10 MB/s read, 3218.8 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 457 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 457 total count
estimated_pending_compaction_bytes: 64 GiB

cc @ajkr

cockroach-teamcity · 2019-04-12T07:28:37Z

SHA: https://github.com/cockroachdb/cockroach/commits/509c5b130fb1ad0042beb74e083817aa68e4fc92

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1237002&tab=buildLog

The test failed on release-19.1:
	test.go:1225: test timed out (6h30m0s)
	cluster.go:1688,clearrange.go:189,clearrange.go:37,test.go:1237: context canceled

cockroach-teamcity · 2019-04-25T00:02:36Z

SHA: https://github.com/cockroachdb/cockroach/commits/7109d291e3b9edfa38264361f832cec14fff66ee

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1259219&tab=buildLog

The test failed on release-19.1:
	test.go:1231: test timed out (6h30m0s)
	cluster.go:1782,clearrange.go:189,clearrange.go:37,test.go:1245: context canceled

tbg · 2019-04-25T07:47:26Z

@ajkr could you take a look at these last two failures (ignoring the minimized ones, those look like infra flakes)? The checks=false test has passed (#34897) so maybe we're "just" overloading the cluster with these checks, but I'm curious if you'll see any indications of additional badness.

We repeatedly see in the logs of clusters undergoing chaos: ``` W190425 06:17:19.438001 251 storage/allocator.go:639 [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2â€¦-4/4/7â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.638430 289 storage/allocator.go:639 [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-â€¦-7/1/-â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.837870 256 storage/allocator.go:639 [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-â€¦-2/1/-â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:20.238586 276 storage/allocator.go:639 [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{dasâ€¦-f5\â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() ``` This was first picked up in cockroachdb#34860 (comment). This seems relatively new, so it might have been impacted by 3c76d77. Release note: None

37132: storage: remove spammy log about failed RemoveTarget simulation r=nvanbenschoten a=nvanbenschoten We repeatedly see in the logs of clusters undergoing chaos: ``` W190425 06:17:19.438001 251 storage/allocator.go:639 [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2â€¦-4/4/7â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.638430 289 storage/allocator.go:639 [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-â€¦-7/1/-â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.837870 256 storage/allocator.go:639 [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-â€¦-2/1/-â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:20.238586 276 storage/allocator.go:639 [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{dasâ€¦-f5\â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() ``` For instance, it dominates the logs in #36720. This was first picked up in #34860 (comment). This seems relatively new, so it might have been impacted by 3c76d77. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

tbg · 2019-05-08T16:22:42Z

@ajkr what's the status of this test? You found the issues with the temp engine, think there is more brewing here or should we try closing this and #34897?

ajkr · 2019-05-08T20:07:52Z

I still feel like there's more to be done for optimizing temp store compaction. I'll close these and open a separate focused issue for that. That should help import finish faster and make this kind of test less likely to time out.

We repeatedly see in the logs of clusters undergoing chaos: ``` W190425 06:17:19.438001 251 storage/allocator.go:639 [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2â€¦-4/4/7â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.638430 289 storage/allocator.go:639 [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-â€¦-7/1/-â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.837870 256 storage/allocator.go:639 [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-â€¦-2/1/-â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:20.238586 276 storage/allocator.go:639 [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{dasâ€¦-f5\â€¦}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() ``` This was first picked up in cockroachdb#34860 (comment). This seems relatively new, so it might have been impacted by 3c76d77. Release note: None

cockroach-teamcity added this to the 2.2 milestone Feb 13, 2019

cockroach-teamcity assigned tbg Feb 13, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Feb 13, 2019

tbg assigned petermattis and unassigned tbg Feb 13, 2019

tbg assigned ajkr Feb 22, 2019

tbg mentioned this issue Mar 19, 2019

server,log: increase the max sync durations #35936

Merged

This comment has been minimized.

Sign in to view

nvanbenschoten mentioned this issue Apr 25, 2019

storage: remove spammy log about failed RemoveTarget simulation #37132

Merged

ajkr closed this as completed May 8, 2019

nicktrav mentioned this issue May 18, 2022

roachtest: clearrange/checks=true failed #81429

Closed

roachtest: clearrange/checks=true failed #34860

roachtest: clearrange/checks=true failed #34860

Comments

cockroach-teamcity commented Feb 13, 2019

tbg commented Feb 13, 2019

tbg commented Feb 13, 2019

cockroach-teamcity commented Feb 14, 2019

cockroach-teamcity commented Feb 15, 2019

cockroach-teamcity commented Feb 17, 2019

cockroach-teamcity commented Feb 19, 2019

cockroach-teamcity commented Feb 21, 2019

tbg commented Feb 21, 2019

cockroach-teamcity commented Feb 24, 2019

cockroach-teamcity commented Feb 28, 2019

tbg commented Feb 28, 2019 • edited Loading

tbg commented Feb 28, 2019

cockroach-teamcity commented Feb 28, 2019

cockroach-teamcity commented Mar 2, 2019

cockroach-teamcity commented Mar 4, 2019

ajkr commented Mar 4, 2019

petermattis commented Mar 4, 2019

tbg commented Mar 5, 2019

tbg commented Mar 5, 2019 • edited Loading

ajkr commented Mar 5, 2019

petermattis commented Mar 5, 2019

cockroach-teamcity commented Mar 22, 2019

cockroach-teamcity commented Mar 23, 2019

cockroach-teamcity commented Mar 24, 2019

danhhz commented Mar 26, 2019

tbg commented Mar 26, 2019

tbg commented Mar 26, 2019

tbg commented Mar 26, 2019

tbg commented Mar 26, 2019

cockroach-teamcity commented Apr 1, 2019

cockroach-teamcity commented Apr 1, 2019

tbg commented Apr 1, 2019

cockroach-teamcity commented Apr 12, 2019

This comment has been minimized.

cockroach-teamcity commented Apr 25, 2019

tbg commented Apr 25, 2019

tbg commented May 8, 2019

ajkr commented May 8, 2019

tbg commented Feb 28, 2019 •

edited

Loading

tbg commented Mar 5, 2019 •

edited

Loading