Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=true failed #34860

Closed
cockroach-teamcity opened this issue Feb 13, 2019 · 49 comments
Closed

roachtest: clearrange/checks=true failed #34860

cockroach-teamcity opened this issue Feb 13, 2019 · 49 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/10f8010fa5778e740c057905e2d7664b5fd5d647

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1135549&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:86,clearrange.go:35,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1135549-clearrange-checks-true:1 -- ./cockroach sql --insecure -e "DROP DATABASE IF EXISTS tinybank" returned:
		stderr:
		
		stdout:
		Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.
		
		read tcp 127.0.0.1:43548->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Feb 13, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Feb 13, 2019
@tbg
Copy link
Member

tbg commented Feb 13, 2019

This is odd. We start the cluster, successfully import the large fixture (takes slightly north of 3h), stop the cluster. Then we restart the cluster, it takes the node ~4s to connect to Gossip

I190213 10:56:47.665954 1 cli/start.go:1075 CockroachDB CCL v2.2.0-alpha.20190211-100-g10f8010 (x86_64-unknown-linux-gnu, built 2019/02/13 06:49:41, go1.11.4)
...
I190213 10:56:51.592563 23 server/node.go:719 [n1] node connected via gossip and verified as part of cluster "33ed2e0c-c974-43aa-9319-5dded06246e5"

but then it takes another 11s for node startup to complete:

I190213 10:57:02.795050 23 cli/start.go:690 node startup completed

The log inbetween has logs of messages of this type

W190213 10:56:53.519644 644 storage/raft_transport.go:583 [n1] while processing outgoing Raft queue to node 10: rpc error: code = Unavailable desc = node waiting for init; /cockroach.storage.MultiRaft/RaftMessageBatch not available:

and looking at n10 I can see that it connected to Gossip a little later:

I190213 10:57:00.349315 39 gossip/gossip.go:1510 [n10] node has connected to cluster via gossip

Now we're already clocking in at 13s past roachprod start, so it's understandable that the cli command gave up. Still, 22s later, n10 still isn't serving traffic:

W190213 10:57:22.236098 3559 storage/raft_transport.go:583 [n1] while processing outgoing Raft queue to node 10: rpc error: code = Unavailable desc = node waiting for init; /cockroach.storage.MultiRaft/RaftMessageBatch not available:

It's only around that time that it gets its feet back on the ground:

I190213 10:57:00.349315 39 gossip/gossip.go:1510  [n10] node has connected to cluster via gossip
W190213 10:57:18.473011 2886 cli/start.go:541  The server appears to be unable to contact the other nodes in the cluster. Please try:

- starting the other nodes, if you haven't already;
- double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly;
- running the 'cockroach init' command if you are trying to initialize a new cluster.

If problems persist, please see https://www.cockroachlabs.com/docs/v2.2/cluster-setup-troubleshooting.html.
I190213 10:57:22.265203 16 server/node.go:461  [n10] initialized store [n10,s10]: disk (capacity=354 GiB, available=177 GiB, used=179 GiB, logicalBytes=199 GiB), ranges=6983, leases=0, queries=0.00, writes=0.00, bytesPerReplica={p10=18894960.00 p25=33600168.00 p50=33600168.00 p75=33600168.00 p90=33607164.00 pMax=46199121.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}
I190213 10:57:22.266798 16 storage/stores.go:243  [n10] read 9 node addresses from persistent storage
I190213 10:57:22.267594 16 server/node.go:699  [n10] connecting to gossip network to verify cluster ID...
I190213 10:57:22.267634 16 server/node.go:719  [n10] node connected via gossip and verified as part of cluster "33ed2e0c-c974-43aa-9319-5dded06246e5"
I190213 10:57:22.268273 16 server/node.go:542  [n10] node=10: started with [<no-attributes>=/mnt/data1/cockroach] engine(s) and attributes []
I190213 10:57:22.268529 16 server/status/recorder.go:610  [n10] available memory from cgroups (8.0 EiB) exceeds system memory 15 GiB, using system memory
I190213 10:57:22.269733 16 server/server.go:1575  [n10] starting http server at [::]:26258 (use: teamcity-1135549-clearrange-checks-true-0009:26258)
I190213 10:57:22.269756 16 server/server.go:1577  [n10] starting grpc/postgres server at [::]:26257
I190213 10:57:22.269775 16 server/server.go:1578  [n10] advertising CockroachDB node at teamcity-1135549-clearrange-checks-true-0009:26257
I190213 10:57:23.060088 39 storage/stores.go:262  [n10] wrote 9 node addresses to persistent storage

I will say that I was convinced that these tests would be flaky when they were written. They basically run an update-heavy workload with tight GC TTL and expect to make do with 200mb of disk space. I'm only planning to make sure that there isn't anything going wrong catastrophically here. That said, they became flaky suddenly, so it seems worth figuring out what caused it.

PS the logs of all nodes that I looked at had this sort of log spam:

W190213 10:57:41.870345 3332 storage/allocator.go:637 [n10,replicate,s10,r5146/2:/Table/53/1/1444{0550-3823}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190213 10:57:41.870804 3336 storage/allocator.go:637 [n10,replicate,s10,r4424/4:/Table/53/1/1197{7001-9214}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190213 10:57:41.871499 3323 storage/allocator.go:637 [n10,replicate,s10,r11232/3:/Table/53/1/2987{7784-8955}] simulating RemoveTarget failed: must supply at least one candidate replica to

Not sure what that's about, perhaps an artifact of the situation or the test, but seems better to avoid. The warning originates in #18364 as far as I can tell.

@tbg
Copy link
Member

tbg commented Feb 13, 2019

I uploaded the artifacts here.
@petermattis, could you take a look?

@tbg tbg assigned petermattis and unassigned tbg Feb 13, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/acba091f04f3d8ecabf51009bf394951fbd3643c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1137872&tab=buildLog

The test failed on master:
	test.go:1200: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:192,clearrange.go:37,test.go:1212: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d7c56dcb87c8f187e50303c8e32a64836c42515c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1139797&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:88,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1139797-clearrange-checks-true:1 -- ./cockroach sql --insecure -e "DROP DATABASE IF EXISTS tinybank" returned:
		stderr:
		
		stdout:
		Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.
		
		read tcp 127.0.0.1:58106->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8e9a1e310e3e8e37f091b7ca8bd204084ad9e2e5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1142461&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:88,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1142461-clearrange-checks-true:1 -- ./cockroach sql --insecure -e "DROP DATABASE IF EXISTS tinybank" returned:
		stderr:
		
		stdout:
		Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.
		
		read tcp 127.0.0.1:60282->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:57,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1143393-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/2e16b7357d5a15d87cd284d5f2c12e424ed29ca1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1146277&tab=buildLog

The test failed on master:
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1212: unexpected node event: 1: dead

@tbg
Copy link
Member

tbg commented Feb 21, 2019

W190221 11:22:45.988927 332 storage/engine/rocksdb.go:1943  batch [2/350/0] commit took 16.1705381s (>500ms):
goroutine 332 [running]:
runtime/debug.Stack(0x30c60cfa, 0xed40080f5, 0x0)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).commitInternal(0xc012544240, 0x0, 0x1, 0xc00863a308)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1944 +0x189
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Commit(0xc012544240, 0xed4008001, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1860 +0x810
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc005a37680, 0x390a2e0, 0xc010bd0450, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:688 +0x5ae
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x390a2e0, 0xc010bd0450, 0xc005a37680, 0x390a2e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3562 +0x120
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc000776000, 0x390a2e0, 0xc010bd0450, 0xc007b34fc0, 0xc00f8abed0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3209 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc000776000, 0x390a2e0, 0xc0066ddce0, 0x407f)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3550 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc00024b680, 0x390a2e0, 0xc0066ddce0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21a
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x390a2e0, 0xc0066ddce0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc007320450, 0xc0001c5e60, 0xc007320440)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8
F190221 11:22:45.990221 272113 server/server_engine_health.go:54  [n1] disk stall detected: unable to write to <no-attributes>=/mnt/data1/cockroach within 10s 

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/3   130.54 MB   0.0      0.2     0.0      0.2       0.4      0.3       0.0   1.4      5.4     13.6        32        20    1.623    133K   1601
  L2      5/5   47.41 MB   0.0      0.4     0.3      0.1       0.4      0.3       0.0   1.3     23.5     22.9        19         6    3.148   2582K    68K
  L3      3/0    5.01 MB   0.6      0.3     0.1      0.1       0.3      0.1       0.1   2.3     26.3     26.2        10         5    1.990    850K    11K
  L4     26/0   174.01 MB   0.9      0.5     0.2      0.3       0.5      0.2       0.1   2.0     19.3     19.2        26         9    2.914    812K    86K
  L5    631/89   3.73 GB   0.2      4.9     0.6      4.2       3.9     -0.4       0.0   6.2     61.2     48.8        81        11    7.406    103K   7143
  L6   6059/1315 165.55 GB   0.0     72.4     0.9     71.5      67.7     -3.8       0.0  75.9     33.2     31.0      2234        32   69.825    215K    35K
 Sum   6727/1412 169.62 GB   0.0     78.6     2.2     76.4      73.2     -3.3       0.1 238.1     33.5     31.2      2403        83   28.957   4699K   211K
 Int      0/0    0.00 KB   0.0     19.5     0.3     19.2      18.6     -0.6       0.0 146.3     30.3     29.0       658        11   59.840   1582K    81K
Uptime(secs): 2353.7 total, 536.2 interval
Flush(GB): cumulative 0.307, interval 0.127
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 73.16 GB write, 31.83 MB/s write, 78.63 GB read, 34.21 MB/s read, 2403.4 seconds
Interval compaction: 18.63 GB write, 35.58 MB/s write, 19.48 GB read, 37.21 MB/s read, 658.2 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
goroutine 272113 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x51d8100, 0x1e)

@ajkr do you see anything in the stats that hints at a problem? This is our main range deletion tombstone test (it drops a large table, which lays down lots of individual range deletions), though it also exercises range merges as the now-empty ranges are all merged away. The node crashed because it took >10s to commit a virtually empty batch (though it might've gotten lumped in with something else in our Go-side commit pipeline, but there shouldn't be any large batches in the test at that stage of the test).

@tbg tbg assigned ajkr Feb 22, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/eaad50f0ea64985a0d7da05abb00cc9f321c5fa9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149743&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:58,clearrange.go:37,test.go:1211: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1149743-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/71681f60031c101f17339236e2ba75f7a684d1a1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1155904&tab=buildLog

The test failed on master:
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: unexpected node event: 4: dead

@tbg
Copy link
Member

tbg commented Feb 28, 2019

		if !delta.ContainsEstimates && testingFatalOnStatsMismatch {
			// ContainsEstimates is true if the replica's persisted MVCCStats had ContainsEstimates set.
			// If this was *not* the case, the replica believed it had accurate stats. But we just found
			// out that this isn't true.
			log.Fatalf(ctx, "found a delta of %+v", log.Safe(delta))
		}
F190228 19:33:00.255975 28459 storage/replica_consistency.go:132  [n6,replicate,s6,r3/9:/System/{NodeLive…-tsd}] found a delta of {false 0 0 309042 1 0 36 0 24 3 0 0 0 0 {} 0}
goroutine 28459 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x5358e00, 0x1e)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5af8d20, 0xc000000004, 0x5358e0a, 0x1e, 0x84, 0xc01a495400, 0x6f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a0e500, 0xc0089ad740, 0x4, 0x2, 0x326ac4a, 0x14, 0xc0065f9750, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3a0e500, 0xc0089ad740, 0x1, 0x4, 0x326ac4a, 0x14, 0xc0065f9750, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3a0e500, 0xc0089ad740, 0x326ac4a, 0x14, 0xc0065f9750, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc0007c4900, 0x3a0e500, 0xc0089ad740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:132 +0x75f
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc000342500, 0x3a0e500, 0xc0089ad740, 0xc0007c4900, 0xc005b247d0, 0xc0199b9c38, 0x3fc3333333333300)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:117 +0x1d8
github.com/cockroachdb/cockroach/pkg/storage.(*replicateQueue).process(0xc00031c310, 0x3a0e500, 0xc0089ad740, 0xc0007c4900, 0xc005b247d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replicate_queue.go:273 +0x313
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processReplica.func1(0x3a0e500, 0xc0089ad740, 0xdf8475800, 0x3a0e500)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:753 +0x24e
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout(0x3a0e500, 0xc0089ad740, 0xc0097255f0, 0x21, 0xdf8475800, 0xc00ce87e90, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:85 +0xac
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processReplica(0xc000bec400, 0x3a0e540, 0xc00b972240, 0xc0007c4900, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:719 +0x215
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processLoop.func1.2(0x3a0e540, 0xc00b972210)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:646 +0xb8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc000b78000, 0x3a0e540, 0xc00b972210, 0xc0097255c0, 0x2d, 0x0, 0x0, 0xc00b53d1a0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:325 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:320 +0x134

@tbg
Copy link
Member

tbg commented Feb 28, 2019

My guess would be that range merging sometimes doesn't respect the ContainsEstimates flag. That is, we're merging two ranges (one of which contains estimates) but then the merged range pretends that it doesn't.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/71681f60031c101f17339236e2ba75f7a684d1a1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1155867&tab=buildLog

The test failed on master:
	test.go:1199: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog

The test failed on master:
	test.go:1199: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:58,clearrange.go:37,test.go:1211: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1160394-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190304 07:48:15.947807 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 1 tables
		I190304 09:49:19.352543 24 ccl/workloadccl/fixture.go:498  imported bank (2h1m3s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@ajkr
Copy link
Contributor

ajkr commented Mar 4, 2019

The stats do look like we're putting great strain on the device: write-amp is 238.1. Another notable thing is 1315 bottom-level files were being compacted at the time those stats were printed. Since cockroach has only a few parallel compactions, that means something (usually range tombstones) causes individual compactions to be extremely wide. I will try repro'ing to investigate more.

@petermattis
Copy link
Collaborator

Since cockroach has only a few parallel compactions, that means something (usually range tombstones) causes individual compactions to be extremely wide. I will try repro'ing to investigate more.

Reproducing and understanding this would be super helpful. I filed #26693 last summer when we were doing investigations about problems with range tombstones. See also facebook/rocksdb#3977 (which I know you're aware of).

@tbg
Copy link
Member

tbg commented Mar 5, 2019

1315 bottom-level files were being compacted

Oh, that's what that means! Yikes, that does sound pretty bad. Thank you for taking a look.

@tbg
Copy link
Member

tbg commented Mar 5, 2019

(btw, in case this isn't already clear, clearrange is our primary range deletion tombstone stress test - it clears a range which causes individual range tombstones to be laid down on every range, i.e. you get $DATA/64mb range tombstones at the same time thrown into the LSM.)

@ajkr
Copy link
Contributor

ajkr commented Mar 5, 2019

The import is hitting an interesting edge case. I'm not sure it's the root cause or not as it hasn't failed yet. But this is what's happening:

  1. The import process writes tombstones that are spread out in the key-space.
  2. When compaction happens, the max_compaction_bytes setting splits the tombstones among many files. There are a lot of single-tombstone files:
I190305 16:36:15.000192 2850 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1338] [default] [JOB 6279] Generated table #21189: 1 keys, 1051 bytes (need compaction)
I190305 16:36:15.000233 2850 storage/engine/rocksdb.go:93  [rocksdb] EVENT_LOG_v1 {"time_micros": 1551803775000206, "cf_name": "default", "job": 6279, "event": "table_file_creation", "file_number": 21189, "file_size": 1051, "table_proper
ties": {"data_size": 32, "index_size": 34, "filter_size": 69, "raw_key_size": 48, "raw_average_key_size": 48, "raw_value_size": 16, "raw_average_value_size": 16, "num_data_blocks": 1, "num_entries": 1, "filter_policy_name": "rocksdb.Buil
tinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
I190305 16:36:15.004345 2850 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1338] [default] [JOB 6279] Generated table #21190: 1 keys, 1051 bytes (need compaction)
I190305 16:36:15.004384 2850 storage/engine/rocksdb.go:93  [rocksdb] EVENT_LOG_v1 {"time_micros": 1551803775004360, "cf_name": "default", "job": 6279, "event": "table_file_creation", "file_number": 21190, "file_size": 1051, "table_proper
ties": {"data_size": 32, "index_size": 34, "filter_size": 69, "raw_key_size": 48, "raw_average_key_size": 48, "raw_value_size": 16, "raw_average_value_size": 16, "num_data_blocks": 1, "num_entries": 1, "filter_policy_name": "rocksdb.Buil
tinBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
I190305 16:36:15.008383 2850 storage/engine/rocksdb.go:93  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1338] [default] [JOB 6279] Generated table #21191: 1 keys, 1020 bytes (need compaction)
I190305 16:36:15.008420 2850 storage/engine/rocksdb.go:93  [rocksdb] EVENT_LOG_v1 {"time_micros": 1551803775008396, "cf_name": "default", "job": 6279, "event": "table_file_creation", "file_number": 21191, "file_size": 1020, "table_proper
ties": {"data_size": 32, "index_size": 34, "filter_size": 69, "raw_key_size": 32, "raw_average_key_size": 32, "raw_value_size": 8, "raw_average_value_size": 8, "num_data_blocks": 1, "num_entries": 1, "filter_policy_name": "rocksdb.Builti
nBloomFilter", "kDeletedKeys": "1", "kMergeOperands": "0"}}
...
  1. Each file generated above is "marked" for compaction since at least half its keys are deletions. Actually 100% of the keys are deletions since there's just that one.
  2. The marked files get compacted together with a full file in the next level (32MB in L6). Then the write-amp is enormous as we're rewriting 32MB to move down a single tombstone L5->L6.

@petermattis
Copy link
Collaborator

  1. Each file generated above is "marked" for compaction since at least half its keys are deletions. Actually 100% of the keys are deletions since there's just that one.

In order to make sure that range deletions are processed in a timely fashion, we mark any sstable containing at least range tombstone for compaction. Perhaps this is having a bad effect and should be disabled. See c-deps/libroach/table_props.cc:DBMakeDeleteRangeCollector.

  1. The marked files get compacted together with a full file in the next level (32MB in L6). Then the write-amp is enormous as we're rewriting 32MB to move down a single tombstone L5->L6.

I would expect the range tombstone to actually cover a large amount of data. That's why we're trying to push them through compactions. Are you seeing evidence that the range tombstones cover few keys?

tbg added a commit to tbg/cockroach that referenced this issue Mar 19, 2019
We know that the engine check fires during some tests ([clearrange], for
example). This puts us in an awkward position: on the one hand, not
being able to sync an engine in 10s is certainly going to cause lots
of terrible snowball effects which then eat up troubleshooting time,
but on the other hand we're not likely to fix all of the problems in
19.1.

For now, up the limit significantly. Also up the corresponding log
partition time limit, though we've seen that fire only in rare cases
that likely really highlighted some I/O problem (or a severe case of
being CPU bound).

[clearrange]: cockroachdb#34860 (comment)

Release note: None
craig bot pushed a commit that referenced this issue Mar 19, 2019
35936: server,log: increase the max sync durations r=bdarnell a=tbg

We know that the engine check fires during some tests ([clearrange], for
example). This puts us in an awkward position: on the one hand, not
being able to sync an engine in 10s is certainly going to cause lots
of terrible snowball effects which then eat up troubleshooting time,
but on the other hand we're not likely to fix all of the problems in
19.1.

For now, up the limit significantly. Also up the corresponding log
partition time limit, though we've seen that fire only in rare cases
that likely really highlighted some I/O problem (or a severe case of
being CPU bound).

[clearrange]: #34860 (comment)

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
tbg added a commit to tbg/cockroach that referenced this issue Mar 21, 2019
We know that the engine check fires during some tests ([clearrange], for
example). This puts us in an awkward position: on the one hand, not
being able to sync an engine in 10s is certainly going to cause lots
of terrible snowball effects which then eat up troubleshooting time,
but on the other hand we're not likely to fix all of the problems in
19.1.

For now, up the limit significantly. Also up the corresponding log
partition time limit, though we've seen that fire only in rare cases
that likely really highlighted some I/O problem (or a severe case of
being CPU bound).

[clearrange]: cockroachdb#34860 (comment)

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b5768aecd39461ab9a54e2e7db059a3fe8b00459

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191957&tab=buildLog

The test failed on release-19.1:
	cluster.go:1626,clearrange.go:189,clearrange.go:37,test.go:1214: unexpected node event: 3: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9399d559ae196e5cf2ad122195048ff9115ab56a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194326&tab=buildLog

The test failed on release-19.1:
	cluster.go:1267,clearrange.go:58,clearrange.go:37,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1194326-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190323 06:52:57.297628 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190323 09:22:05.450159 41 ccl/workloadccl/fixture.go:516  imported bank (2h29m8s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c59f5347d5424edb90575fb0fd50bad677953752

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195732&tab=buildLog

The test failed on release-19.1:
	cluster.go:1267,clearrange.go:58,clearrange.go:37,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1195732-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190324 06:48:45.831461 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190324 08:55:25.611991 27 ccl/workloadccl/fixture.go:516  imported bank (2h6m40s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@danhhz
Copy link
Contributor

danhhz commented Mar 26, 2019

Node 9 had a larger log file than anything else so I took a quick glance. It has all sorts of grpc connection issues in the logs, though I don't immediately see why.

W190324 08:55:08.937449 10472 storage/raft_transport.go:583  [n9] while processing outgoing Raft queue to node 4: rpc error: code = Unavailable desc = transport is closing:
W190324 08:55:08.938767 140 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {teamcity-1195732-clearrange-checks-true-0004:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190324 08:55:08.938802 140 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
W190324 08:55:08.942600 169174 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {teamcity-1195732-clearrange-checks-true-0004:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.142.0.190:26257: connect: connection refused". Reconnecting...

@tbg
Copy link
Member

tbg commented Mar 26, 2019

Wait, connection refused? Let me double check that nodes really didn't die..

@tbg
Copy link
Member

tbg commented Mar 26, 2019

pretty sure n4 oomed or something along those lines.

@tbg
Copy link
Member

tbg commented Mar 26, 2019

Yep here we go

[ 7677.211199] Out of memory: Kill process 12304 (cockroach) score 543 or sacrifice child
[ 7677.219463] Killed process 12304 (cockroach) total-vm:20716080kB, anon-rss:8478640kB, file-rss:29716kB, shmem-rss:0kB
[ 7677.795495] oom_reaper: reaped process 12304 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@tbg
Copy link
Member

tbg commented Mar 26, 2019

Hmm. No heap_profiler directories in the logs. Why not? The last runtime stats is

I190324 08:54:59.109162 165 server/status/runtime.go:500 [n4] runtime stats: 7.7 GiB RSS, 357 goroutines, 962 MiB/2.0 GiB/3.2 GiB GO alloc/idle/total, 4.0 GiB/4.8 GiB CGO alloc/total, 4805.7 CGO/sec, 81.5/53.3 %(u/s)time, 0.4 %gc (7x), 942 MiB/286 MiB (r/w)net

These nodes have 15GiB of memory, so this isn't even close. Something must've allocated quite suddenly.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5921cf0dcc76548931cc85500c0fa2186a82142f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212185&tab=buildLog

The test failed on release-19.1:
	cluster.go:1293,clearrange.go:58,clearrange.go:37,test.go:1223: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1212185-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190401 04:14:13.581787 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190401 06:21:20.480424 66 ccl/workloadccl/fixture.go:516  imported bank (2h7m7s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/668162cc99e4f3198b663b1abfa51858eeb3ccb8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212251&tab=buildLog

The test failed on master:
	cluster.go:1293,clearrange.go:58,clearrange.go:37,test.go:1223: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1212251-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190401 06:31:17.835966 1 ccl/workloadccl/cliccl/fixtures.go:325  starting import of 1 tables
		I190401 09:57:54.082021 26 ccl/workloadccl/fixture.go:516  imported bank (3h26m36s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1293,clearrange.go:58,clearrange.go:37,test.go:1223: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1212251-clearrange-checks-true --oneshot --ignore-empty-nodes: exit status 1 6: dead
		8: 13316
		7: 12419
		10: 13245
		4: 12847
		9: 12412
		1: 13449
		5: 13128
		2: 13483
		3: 13181
		Error:  6: dead

@tbg
Copy link
Member

tbg commented Apr 1, 2019

F190401 09:57:08.220353 152 storage/replica_raft.go:923 [n6,s6,r19943/3:/Table/53/1/5506{6922-8668}] while committing batch: while committing batch: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/026458.log: No space left on device

Before that, pages and pages of slow commits, some 20s in duration. Last compaction stats in the logs say

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      2/0   64.16 MB   1.0      0.0     0.0      0.0       5.6      5.6       0.4   0.0      0.0     47.1       122       137    0.889       0      0
  L2      7/0   84.27 MB   1.3      5.0     3.4      1.6       5.0      3.4       0.3   1.4     26.5     26.2       194        55    3.530   3012K    38K
  L3     50/0   197.72 MB   0.9      6.0     3.8      2.2       5.9      3.7       2.2   1.6     31.6     31.1       194       110    1.762   3338K    25K
  L4    156/0   144.44 MB   0.8      5.4     2.2      3.2       5.0      1.8       7.2   2.3     26.1     24.3       213       308    0.692   3290K    26K
  L5   1290/53  29.82 GB   1.3     15.5     1.4     14.1      13.0     -1.1      57.7   9.5      7.6      6.4      2093      1915    1.093   3339K    31K
  L6   8064/901 223.82 GB   0.0    245.4    25.9    219.5     226.5      7.0     216.8   8.7     31.2     28.8      8042      8561    0.939     22M   516K
 Sum   9569/954 254.12 GB   0.0    277.3    36.7    240.6     261.0     20.4     284.6   0.9     26.2     24.6     10858     11086    0.979     35M   637K
 Int      0/0    0.00 KB   0.0     41.0    12.5     28.5      39.4     10.9      18.0   2.2     13.0     12.5      3219       969    3.322   4227K    16K
Uptime(secs): 11841.2 total, 890.5 interval
Flush(GB): cumulative 5.600, interval 0.077
AddFile(GB): cumulative 270.408, interval 17.754
AddFile(Total Files): cumulative 9467, interval 618
AddFile(L0 Files): cumulative 13, interval 9
AddFile(Keys): cumulative 28244437, interval 1854392
Cumulative compaction: 261.03 GB write, 22.57 MB/s write, 277.32 GB read, 23.98 MB/s read, 10858.1 seconds
Interval compaction: 39.43 GB write, 45.34 MB/s write, 40.96 GB read, 47.10 MB/s read, 3218.8 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 457 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 457 total count
estimated_pending_compaction_bytes: 64 GiB

cc @ajkr

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/509c5b130fb1ad0042beb74e083817aa68e4fc92

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1237002&tab=buildLog

The test failed on release-19.1:
	test.go:1225: test timed out (6h30m0s)
	cluster.go:1688,clearrange.go:189,clearrange.go:37,test.go:1237: context canceled

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7109d291e3b9edfa38264361f832cec14fff66ee

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1259219&tab=buildLog

The test failed on release-19.1:
	test.go:1231: test timed out (6h30m0s)
	cluster.go:1782,clearrange.go:189,clearrange.go:37,test.go:1245: context canceled

@tbg
Copy link
Member

tbg commented Apr 25, 2019

@ajkr could you take a look at these last two failures (ignoring the minimized ones, those look like infra flakes)? The checks=false test has passed (#34897) so maybe we're "just" overloading the cluster with these checks, but I'm curious if you'll see any indications of additional badness.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 25, 2019
We repeatedly see in the logs of clusters undergoing chaos:
```
W190425 06:17:19.438001 251 storage/allocator.go:639  [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2…-4/4/7…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:19.638430 289 storage/allocator.go:639  [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-…-7/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:19.837870 256 storage/allocator.go:639  [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-…-2/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:20.238586 276 storage/allocator.go:639  [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{das…-f5\…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
```

This was first picked up in
cockroachdb#34860 (comment).
This seems relatively new, so it might have been impacted by
3c76d77.

Release note: None
craig bot pushed a commit that referenced this issue Apr 30, 2019
37132: storage: remove spammy log about failed RemoveTarget simulation r=nvanbenschoten a=nvanbenschoten

We repeatedly see in the logs of clusters undergoing chaos:
```
W190425 06:17:19.438001 251 storage/allocator.go:639  [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2…-4/4/7…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:19.638430 289 storage/allocator.go:639  [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-…-7/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:19.837870 256 storage/allocator.go:639  [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-…-2/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:20.238586 276 storage/allocator.go:639  [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{das…-f5\…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
```

For instance, it dominates the logs in #36720.

This was first picked up in #34860 (comment). This seems relatively new, so it might have been impacted by 3c76d77.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@tbg
Copy link
Member

tbg commented May 8, 2019

@ajkr what's the status of this test? You found the issues with the temp engine, think there is more brewing here or should we try closing this and #34897?

@ajkr
Copy link
Contributor

ajkr commented May 8, 2019

I still feel like there's more to be done for optimizing temp store compaction. I'll close these and open a separate focused issue for that. That should help import finish faster and make this kind of test less likely to time out.

@ajkr ajkr closed this as completed May 8, 2019
ajwerner pushed a commit to ajwerner/cockroach that referenced this issue May 15, 2019
We repeatedly see in the logs of clusters undergoing chaos:
```
W190425 06:17:19.438001 251 storage/allocator.go:639  [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2…-4/4/7…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:19.638430 289 storage/allocator.go:639  [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-…-7/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:19.837870 256 storage/allocator.go:639  [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-…-2/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:20.238586 276 storage/allocator.go:639  [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{das…-f5\…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
```

This was first picked up in
cockroachdb#34860 (comment).
This seems relatively new, so it might have been impacted by
3c76d77.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants