Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=true failed #38131

Closed
cockroach-teamcity opened this issue Jun 11, 2019 · 15 comments · Fixed by #38433 or #38632
Closed

roachtest: clearrange/checks=true failed #38131

cockroach-teamcity opened this issue Jun 11, 2019 · 15 comments · Fixed by #38433 or #38632
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/a581c20147539c74a8c72f8e6b4a7a4206a50853

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1334794&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1872,clearrange.go:190,clearrange.go:34,test.go:1248: unexpected node event: 9: dead
	cluster.go:1035,context.go:87,cluster.go:1024,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1872,clearrange.go:190,clearrange.go:34,test.go:1248: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1334794-clearrange-checks-true --oneshot --ignore-empty-nodes: exit status 1 9: dead
		8: 28075
		3: 2198
		2: 24065
		7: 26724
		1: 18844
		4: 15068
		5: 1798
		6: 26271
		10: 18627
		Error:  9: dead

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jun 11, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jun 11, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/cbd571c7bf2ffad3514334a410fa7a728b1f5bf0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1335643&tab=buildLog

The test failed on branch=provisional_201906071718_v19.1.2, cloud=gce:
	cluster.go:1513,clearrange.go:59,clearrange.go:34,test.go:1248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1335643-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=0 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190611 20:00:44.092705 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190612 00:40:35.424289 66 ccl/workloadccl/fixture.go:516  imported bank (4h39m51s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.224.24.192_2019-06-11T20:00:43Z: exit status 1
		: exit status 1

@tbg
Copy link
Member

tbg commented Jun 19, 2019

First failure is

F190611 09:46:18.628906 2254 storage/replica_raft.go:930 [n9,s9,r5363/1:/Table/53/1/1433{0713-7257}] while committing batch: while committing batch: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/027150.log: No space left on device
goroutine 2254 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000059f01, 0xc000059f20, 0x0, 0x8a)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x608fcc0, 0xc000000004, 0x58dda84, 0x17, 0x3a2, 0xc008724c30, 0xce)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:876 +0x97a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3d6bdc0, 0xc00a0c5890, 0xc000000004, 0x2, 0x357ce6f, 0x6, 0xc0129fbc40, 0x2, 0x2)

cc @ajkr in your experience looking at this test in the past, is it reasonable that it would run out of disk on the GCE machines?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/02056538e63bd33adbc24efee05ff94385c19fc8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1351907&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1872,clearrange.go:190,clearrange.go:34,test.go:1251: unexpected node event: 2: dead
	cluster.go:1035,context.go:127,cluster.go:1024,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1872,clearrange.go:190,clearrange.go:34,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1351907-clearrange-checks-true --oneshot --ignore-empty-nodes: exit status 1 1: 19193
		8: 27125
		2: dead
		3: 31098
		4: 20057
		6: 21782
		5: 31934
		9: 10230
		7: 23803
		10: 30488
		Error:  2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/2811e0fb4deded2a459594f0f3065650ca92cf6d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1359691&tab=buildLog

The test failed on branch=provisional_201906242300_v19.2.0-alpha.20190701, cloud=gce:
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1359691-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190625 20:15:08.781800 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.239.58.172_2019-06-25T20:15:07Z: exit status 1
		: exit status 1
	cluster.go:1033,context.go:122,cluster.go:1022,asm_amd64.s:522,panic.go:397,test.go:783,test.go:769,cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1359691-clearrange-checks-true --oneshot --ignore-empty-nodes: exit status 1 8: dead
		3: 12546
		2: 12620
		9: 12629
		7: 12576
		6: 12711
		5: 12628
		1: 12766
		4: 13128
		10: 12686
		Error:  8: dead

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5f358ed804af05f8c4b404efc4d8a282d8e0916c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1361643&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1361643-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190626 19:12:23.138472 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.237.117.235_2019-06-26T19:12:22Z: exit status 1
		: exit status 1

craig bot pushed a commit that referenced this issue Jun 26, 2019
38433: sql: fix swallowing Unsplit error r=jeffrey-xiao a=jeffrey-xiao

Realized this condition should be negated.

Fixes #38131, #38095.

Release note: None

Co-authored-by: Jeffrey Xiao <[email protected]>
@craig craig bot closed this as completed in #38433 Jun 26, 2019
@jeffrey-xiao jeffrey-xiao reopened this Jun 26, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog

The test failed on branch=provisional_201906271846_v19.2.0-alpha.20190701, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364443-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190627 22:51:32.649249 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364821-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190628 06:54:47.381098 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5968fa55d48966866eac1cb744c61d0aa6284ae7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367354&tab=buildLog

The test failed on branch=provisional_201906292257_v19.2.0-alpha.20190701, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1870,clearrange.go:184,clearrange.go:32,test.go:1249: context canceled
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,clearrange.go:184,clearrange.go:32,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1367354-clearrange-checks-true --oneshot --ignore-empty-nodes: exit status 1 1: error: exit status 255
		8: error: exit status 255
		6: error: exit status 255
		7: error: exit status 255
		2: error: exit status 255
		10: error: exit status 255
		4: error: exit status 255
		5: error: exit status 255
		3: error: exit status 255
		9: error: exit status 255
		Error:  1: error: exit status 255, 8: error: exit status 255, 6: error: exit status 255, 7: error: exit status 255, 2: error: exit status 255, 10: error: exit status 255, 4: error: exit status 255, 5: error: exit status 255, 3: error: exit status 255, 9: error: exit status 255

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190630 06:55:47.472867 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ca1ef4d4f8296b213c0b2b140f16e4a97931e6e7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1368144&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1511,clearrange.go:53,clearrange.go:32,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1368144-clearrange-checks-true:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190701 06:55:39.416751 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@tbg
Copy link
Member

tbg commented Jul 1, 2019

Unfortunately the logs evaporate when the test times out, but I fished this from the main output of the failure on the alpha branch above

clearrange/checks=true: dropping bank table (1h1m23s)

@ajkr could you take a look? Note that this unfortunately is in the critical path for the alpha though there are two pretty serious release blockers on top of that too

@tbg
Copy link
Member

tbg commented Jul 1, 2019

Oh, looks like you already took a look: #38095 (comment)

Were you also seeing the DROP TABLE itself take a long time (or maybe that's a red herring) or was it the merges itself? I don't think the instance I looked at ever got to the merging. This t.Status() output is also notoriously misleading unfortunately.

@ajkr
Copy link
Contributor

ajkr commented Jul 2, 2019

Unfortunately the logs evaporate when the test times out, but I fished this from the main output of the failure on the alpha branch above

clearrange/checks=true: dropping bank table (1h1m23s)

@ajkr could you take a look? Note that this unfortunately is in the critical path for the alpha though there are two pretty serious release blockers on top of that too

It looks like both DROP TABLE and getting range count both happen before updating the worker status from "dropping bank table" to "waiting for ~X merges to complete". So it could be either one. But I agree it might be a red herring as I'm seeing other clearranges tests pass even though the worker status stayed in "dropping bank table" for over an hour.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A for of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/2c865eeb3e3b244468ffc509a62778bd1f46740f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1370685&tab=buildLog

The test failed on branch=provisional_201907021644_v19.2.0-alpha.20190701, cloud=gce:
	test.go:1235: test timed out (6h30m0s)
	cluster.go:1870,clearrange.go:184,clearrange.go:32,test.go:1249: context canceled
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1870,clearrange.go:184,clearrange.go:32,test.go:1249: exit status 1

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A lot of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
craig bot pushed a commit that referenced this issue Jul 3, 2019
38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten

Fixes #34180.
Fixes #35493.
Fixes #36983.
Fixes #37108.
Fixes #37371.
Fixes #37384.
Fixes #37551.
Fixes #37879.
Fixes #38095.
Fixes #38131.
Fixes #38136.
Fixes #38549.
Fixes #38552.
Fixes #38555.
Fixes #38560.
Fixes #38562.
Fixes #38563.
Fixes #38569.
Fixes #38578.
Fixes #38600.

_A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following:

![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png)

We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case.

The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #38632 Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
4 participants