roachtest: scrub/index-only/tpcc/w=100 failed #37551

cockroach-teamcity · 2019-05-16T16:19:28Z

SHA: https://github.com/cockroachdb/cockroach/commits/c8bda1de440cfe90cf23a433119d77795cfa0047

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/index-only/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1292152&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1474,tpcc.go:168,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1292152-scrub-index-only-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=30m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		l
		  17m36s    13800            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  17m36s    13800            1.0            0.9 103079.2 103079.2 103079.2 103079.2 newOrder
		  17m36s    13800            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  17m36s    13800            2.0            0.8  15569.3  90194.3  90194.3  90194.3 payment
		  17m36s    13800            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  17m37s    13800            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  17m37s    13800            1.0            0.9 103079.2 103079.2 103079.2 103079.2 newOrder
		  17m37s    13800            1.0            0.1   9126.8   9126.8   9126.8   9126.8 orderStatus
		  17m37s    13800            0.0            0.8      0.0      0.0      0.0      0.0 payment
		  17m37s    13800            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1833,tpcc.go:178,scrub.go:58,test.go:1251: unexpected node event: 2: dead
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:774,cluster.go:1833,tpcc.go:178,scrub.go:58,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1292152-scrub-index-only-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: dead
		4: 3818
		1: 4338
		3: 3892
		Error:  2: dead

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2019-05-21T20:16:43Z

Previous two issues addressed by #37701.

nvanbenschoten · 2019-05-21T20:28:19Z

First four issues addressed by #36854.

So the only real thing here is the initial failure, which was an OOM.

@dt assigning this to you for triage within bulk-io.

cockroach-teamcity · 2019-05-22T15:27:46Z

SHA: https://github.com/cockroachdb/cockroach/commits/1810a4eaa07b412b2d0899d25bb16a28a2746d48

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/index-only/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1300948&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1510,tpcc.go:168,cluster.go:1848,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1300948-scrub-index-only-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=30m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		l
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  20m37s    13774            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  20m37s    13774            1.0            1.1 103079.2 103079.2 103079.2 103079.2 newOrder
		  20m37s    13774            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  20m37s    13774            0.0            1.0      0.0      0.0      0.0      0.0 payment
		  20m37s    13774            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		  20m38s    13774            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  20m38s    13774            1.0            1.1  98784.2  98784.2  98784.2  98784.2 newOrder
		  20m38s    13774            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  20m38s    13774            0.0            1.0      0.0      0.0      0.0      0.0 payment
		  20m38s    13774            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1869,tpcc.go:178,scrub.go:58,test.go:1251: unexpected node event: 4: dead
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:774,cluster.go:1869,tpcc.go:178,scrub.go:58,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1300948-scrub-index-only-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		4: dead
		1: 4197
		2: 3959
		3: 4331
		Error:  4: dead

cockroach-teamcity · 2019-06-19T14:52:37Z

SHA: https://github.com/cockroachdb/cockroach/commits/e6366f3ac39652a763f38948fccf4b2dab363034

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/index-only/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1347608&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	scrub.go:80,cluster.go:1851,errgroup.go:57: dial tcp 35.196.93.247:26257: connect: connection refused
	cluster.go:1513,tpcc.go:169,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1347608-scrub-index-only-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=30m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		    0.1      0.0      0.0      0.0      0.0 delivery
		  14m59s    24087            1.0            0.5  51539.6  51539.6  51539.6  51539.6 newOrder
		  14m59s    24087            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  14m59s    24087            0.0            0.3      0.0      0.0      0.0      0.0 payment
		  14m59s    24087            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		E190619 14:48:55.836356 1 workload/cli/run.go:426  error in payment: dial tcp 10.142.0.120:26257: connect: connection refused
		   15m0s    26683            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		   15m0s    26683            0.0            0.5      0.0      0.0      0.0      0.0 newOrder
		   15m0s    26683            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		   15m0s    26683            0.0            0.3      0.0      0.0      0.0      0.0 payment
		   15m0s    26683            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1872,tpcc.go:179,scrub.go:55,test.go:1251: Goexit() was called
	cluster.go:1035,context.go:127,cluster.go:1024,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1872,tpcc.go:179,scrub.go:55,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1347608-scrub-index-only-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		1: dead
		2: 4015
		4: 4234
		3: 4032
		Error:  1: dead

cockroach-teamcity · 2019-06-28T17:42:10Z

SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/index-only/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog

The test failed on branch=provisional_201906271846_v19.2.0-alpha.20190701, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364443-scrub-index-only-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190628 10:37:25.879196 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190628 10:37:38.794082 93 ccl/workloadccl/fixture.go:476  loaded 7.8 MiB table item in 12.914461048s (100000 rows, 0 index entries, 616 KiB)
		I190628 10:37:40.520134 87 ccl/workloadccl/fixture.go:476  loaded 5.1 KiB table warehouse in 14.640410587s (100 rows, 0 index entries, 359 B)
		I190628 10:37:51.255802 91 ccl/workloadccl/fixture.go:476  loaded 126 MiB table order in 25.376393606s (3000000 rows, 3000000 index entries, 5.0 MiB)
		I190628 10:37:52.255231 92 ccl/workloadccl/fixture.go:476  loaded 11 MiB table new_order in 26.375827425s (900000 rows, 0 index entries, 433 KiB)
		: signal: interrupt
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: context canceled

cockroach-teamcity · 2019-06-29T01:46:28Z

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/index-only/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364821-scrub-index-only-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190628 19:36:58.661418 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190628 19:36:59.153370 87 ccl/workloadccl/fixture.go:476  loaded 5.1 KiB table warehouse in 491.458686ms (100 rows, 0 index entries, 10 KiB)
		: signal: interrupt
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: context canceled

cockroach-teamcity · 2019-07-01T01:47:52Z

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/index-only/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-scrub-index-only-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190630 20:05:12.333709 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190630 20:05:26.736569 54 ccl/workloadccl/fixture.go:476  loaded 5.1 KiB table warehouse in 14.402165309s (100 rows, 0 index entries, 365 B)
		I190630 20:05:35.368919 59 ccl/workloadccl/fixture.go:476  loaded 11 MiB table new_order in 23.034686115s (900000 rows, 0 index entries, 496 KiB)
		I190630 20:05:36.629845 55 ccl/workloadccl/fixture.go:476  loaded 99 KiB table district in 24.295512778s (1000 rows, 0 index entries, 4.1 KiB)
		I190630 20:05:56.845953 60 ccl/workloadccl/fixture.go:476  loaded 7.8 MiB table item in 44.51191752s (100000 rows, 0 index entries, 179 KiB)
		: signal: killed
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: context canceled

cockroach-teamcity · 2019-07-02T01:48:56Z

SHA: https://github.com/cockroachdb/cockroach/commits/ca1ef4d4f8296b213c0b2b140f16e4a97931e6e7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/index-only/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1368144&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1368144-scrub-index-only-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190701 20:54:58.141470 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190701 20:54:59.215282 79 ccl/workloadccl/fixture.go:476  loaded 99 KiB table district in 1.073598142s (1000 rows, 0 index entries, 92 KiB)
		I190701 20:56:10.631669 84 ccl/workloadccl/fixture.go:476  loaded 7.8 MiB table item in 1m12.489939302s (100000 rows, 0 index entries, 110 KiB)
		: signal: interrupt
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: context canceled

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A for of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten Fixes #34180. Fixes #35493. Fixes #36983. Fixes #37108. Fixes #37371. Fixes #37384. Fixes #37551. Fixes #37879. Fixes #38095. Fixes #38131. Fixes #38136. Fixes #38549. Fixes #38552. Fixes #38555. Fixes #38560. Fixes #38562. Fixes #38563. Fixes #38569. Fixes #38578. Fixes #38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: ![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png) We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Co-authored-by: Nathan VanBenschoten <[email protected]>

cockroach-teamcity added this to the 19.2 milestone May 16, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels May 16, 2019

This comment has been minimized.

Sign in to view

nvanbenschoten assigned dt May 21, 2019

nvanbenschoten mentioned this issue Jul 3, 2019

storage: release quota on failed Raft proposals #38632

Merged

craig bot closed this as completed in #38632 Jul 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: scrub/index-only/tpcc/w=100 failed #37551

roachtest: scrub/index-only/tpcc/w=100 failed #37551

cockroach-teamcity commented May 16, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

nvanbenschoten commented May 21, 2019

nvanbenschoten commented May 21, 2019

cockroach-teamcity commented May 22, 2019

cockroach-teamcity commented Jun 19, 2019

cockroach-teamcity commented Jun 28, 2019

cockroach-teamcity commented Jun 29, 2019

cockroach-teamcity commented Jul 1, 2019

cockroach-teamcity commented Jul 2, 2019

roachtest: scrub/index-only/tpcc/w=100 failed #37551

roachtest: scrub/index-only/tpcc/w=100 failed #37551

Comments

cockroach-teamcity commented May 16, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

nvanbenschoten commented May 21, 2019

nvanbenschoten commented May 21, 2019

cockroach-teamcity commented May 22, 2019

cockroach-teamcity commented Jun 19, 2019

cockroach-teamcity commented Jun 28, 2019

cockroach-teamcity commented Jun 29, 2019

cockroach-teamcity commented Jul 1, 2019

cockroach-teamcity commented Jul 2, 2019