roachtest: acceptance/bank/cluster-recovery failed #38785

cockroach-teamcity · 2019-07-10T06:14:25Z

SHA: https://github.com/cockroachdb/cockroach/commits/f74db5e81f8eaa190a41d708a9ccafb3eba9370a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/cluster-recovery PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1380786&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190710-1380786/acceptance/bank/cluster-recovery/run_1
	bank.go:398,bank.go:460,acceptance.go:68,test_runner.go:678: stall detected at round 2, no forward progress for 30s
	cluster.go:1654,bank.go:252: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1562739066-01-n4cpu4:1 returned:
		stderr:
		
		stdout:
		teamcity-1562739066-01-n4cpu4: starting
		teamcity-1562739066-01-n4cpu4: initializing cluster settings................: signal: killed

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2019-07-22T06:20:51Z

SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/cluster-recovery PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190722-1399000/acceptance/bank/cluster-recovery/run_1
	bank.go:398,bank.go:460,acceptance.go:69,test_runner.go:691: stall detected at round 2, no forward progress for 30s

tbg · 2019-07-23T13:54:39Z

@nvanbenschoten you mentioned you were stressing this one, right?

nvanbenschoten · 2019-07-24T01:57:43Z

I got distracted by #39018. Could be the same root cause. I'll stress with and without the fix tomorrow.

cockroach-teamcity · 2019-08-21T05:14:19Z

SHA: https://github.com/cockroachdb/cockroach/commits/bd27eb358f558bb7598945318240335ebcfcdf13

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/cluster-recovery PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1446993&tab=buildLog

The test failed on branch=provisional_201908202216_v19.2.0-beta.20190826, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190821-1446993/acceptance/bank/cluster-recovery/run_1
	bank.go:398,bank.go:460,acceptance.go:69,test_runner.go:673: stall detected at round 2, no forward progress for 30s

cockroach-teamcity · 2019-08-21T05:15:13Z

SHA: https://github.com/cockroachdb/cockroach/commits/bd27eb358f558bb7598945318240335ebcfcdf13

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/cluster-recovery PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1447014&tab=buildLog

The test failed on branch=provisional_201908202216_v19.2.0-beta.20190826, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190821-1447014/acceptance/bank/cluster-recovery/run_1
	cluster.go:1664,bank.go:252: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1566364342-07-n4cpu4:2 returned:
		stderr:
		
		stdout:
		g;  export ROACHPROD=2 && GOTRACEBACK=crash COCKROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=1 COCKROACH_ENABLE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-central1,zone=us-central1-b --join=130.211.227.73:26257 >> ${HOME}/logs/cockroach.stdout.log 2>> ${HOME}/logs/cockroach.stderr.log || (x=$?; cat ${HOME}/logs/cockroach.stderr.log; exit $x)
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:256
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1535
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1337: 
		I190821 05:14:56.987294 1 cluster_synced.go:1617  command failed
		: exit status 1

nvanbenschoten · 2019-09-03T22:04:42Z

@irfansharif do you mind taking a look at this? The most recent failure looks like a flake, but the two stall detected failures look concerning.

irfansharif · 2019-09-04T18:26:45Z

Taking a look at this now.

nvanbenschoten · 2019-09-11T23:55:25Z

The recent failures in #35326 look similar. They probably have the same root cause.

irfansharif · 2019-09-16T22:07:20Z

This is possibly related to #39841. We trip up circuit breakers around node startup failing with the following:

I190912 23:11:58.290645 471 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n3] circuitbreaker: gossip [::]:26257->irfansharif-flake-0001:26257 tripped: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.138.0.41:26257: connect: connection refused"

irfansharif · 2019-09-17T15:46:28Z

The grpc failures were a red herring, looking at the timings for when each circuit breaker for each gossip connection was tripped and when exactly each node was killed, they line up as we'd expect.

irfansharif · 2019-09-20T17:17:08Z

Ok, so here's what's going on (TL;DR roachtest bug, not CRDB). The "stalls"
detected boil down to the time between chaos monkey iterations. Within each
chaos monkey iteration we do the following:

Lock all clients, in sequence
Restart each node
Unlock all clients
Sleep until at least one client has made progress

Given our stall timeout is only 30s, we have just that long to go through all
of the above. In each client we lock around the UPDATE query so as to not be
interrupted. The problem is that every now and then these UPDATE queries take a
lot longer than a few milliseconds. This is expected behaviour: this is
primarily due to txnwait procedures and having to wait for the expiration of
an extant contending txn. More importantly, it's not what we're testing here as
the clients are still making progress. Given the chaos monkey first locks each
client, it has to drain out these requests, which eats out of the 30s or so we
have for each chaos monkey iteration. This is made worse by the fact that we do
this in sequence for each client. When we're unlucky, we run into this
particular convoy situation and we're unable to finish the round in time, and a
"stall" is detected.

We should really only be interested in how long it takes for the chaos monkey
to restart a set of nodes, and ensuring that after it does, that clients are
still making progress. We already have statement timeouts for the UPDATE
queries that fail if we take "too long". Removing the stopClients apparatus
gives us what we need.

All that being said, this failure mode is awfully similar to the recent
failures seen in #35326 where we're not stopping clients anyway. When trying to
reproduce both, it was much easier to reproduce this as compared to
node-restart (doubly so if reducing # of bank accounts or increasing # of
clients to increase contention). Also if you notice, the failures for
cluster-recovery and node-restart as posted above and over at #35326
happened in quick succession. (10th Jul, 02:14 and 10th Jul '19, 02:28;
21st Aug, 01:14 and 21st Aug, 01:14). This suggested there were other factors
at play, but I'll post those over at #35326 instead.

Informs cockroachdb#38785. The "stalls" detected boil down to the time between chaos monkey iterations. Within each chaos monkey iteration we do the following: - Lock all clients, in sequence - Restart each node - Unlock all clients - Sleep until at least one client has made progress Given our stall timeout is only 30s, we have just that long to go through all of the above. In each client we lock around the UPDATE query so as to not be interrupted. The problem is that every now and then these UPDATE queries take a lot longer than a few milliseconds. This is expected behaviour: this is primarily due to txnwait procedures and having to wait for the expiration of an extant contending txn. More importantly, it's not what we're testing here as the clients are still making progress. Given the chaos monkey first locks each client, it has to drain out these requests, which eats out of the 30s or so we have for each chaos monkey iteration. This is made worse by the fact that we do this in sequence for each client. When we're unlucky, we run into this particular convoy situation and we're unable to finish the round in time, and a "stall" is detected. We should really only be interested in how long it takes for the chaos monkey to restart a set of nodes, and ensuring that after it does, that clients are still making progress. We already have statement timeouts for the UPDATE queries that fail if we take "too long". Removing the stopClients apparatus gives us what we need. Release justification: Category 1: Non-production code changes Release note: None

40874: delegate: Fix index resolution hack in show partitions r=rohany a=rohany This fixes a hack that was introduced when writing show partitions in order to have a better error message when the user provided an invalid index. Release justification: Low risk improvement to functionality. Release note: None 40976: roachtest: fix bank/cluster-recovery r=irfansharif a=irfansharif Informs #38785. The "stalls" detected boil down to the time between chaos monkey iterations. Within each chaos monkey iteration we do the following: - Lock all clients, in sequence - Restart each node - Unlock all clients - Sleep until at least one client has made progress Given our stall timeout is only 30s, we have just that long to go through all of the above. In each client we lock around the UPDATE query so as to not be interrupted. The problem is that every now and then these UPDATE queries take a lot longer than a few milliseconds. This is expected behaviour: this is primarily due to txnwait procedures and having to wait for the expiration of an extant contending txn. More importantly, it's not what we're testing here as the clients are still making progress. Given the chaos monkey first locks each client, it has to drain out these requests, which eats out of the 30s or so we have for each chaos monkey iteration. This is made worse by the fact that we do this in sequence for each client. When we're unlucky, we run into this particular convoy situation and we're unable to finish the round in time, and a "stall" is detected. We should really only be interested in how long it takes for the chaos monkey to restart a set of nodes, and ensuring that after it does, that clients are still making progress. We already have statement timeouts for the UPDATE queries that fail if we take "too long". Removing the stopClients apparatus gives us what we need. Release justification: Category 1: Non-production code changes Release note: None Co-authored-by: Rohan Yadav <[email protected]> Co-authored-by: irfan sharif <[email protected]>

40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif Fixes #38785. Fixes #35326. Because everything roachprod does, it does through SSH, we're particularly susceptible to network delays, packet drops, etc. We've seen this before, or at least pointed to this being a problem before, over at #37001. Setting timeouts around our calls to roachprod helps to better surface these kind of errors. The underlying issue in #38785 and in #35326 is the fact that we're running roachprod commands that may (reasonably) fail due to connection issues, and we're unable to retry them safely (the underlying commands are non-idempotent). Presently we simply fail the entire test, when really we should be able to retry the commands. This is left unaddressed. Release justification: Category 1: Non-production code changes Release note: None 41029: cli: fix the demo licensing code r=rohany a=knz Fixes #40734. Fixes #41024. Release justification: fixes a flaky test, fixes UX of main new feature Before this patch, there were multiple problems with the code: - if the license acquisition was disabled by the env var config, the error message would not be clear. - the licensing code would deadlock silently on OSS-only builds (because the license failure channel was not written in that control branch). - the error/warning messages would be interleaved on the same line as the input line (missing newline at start of message). - the test code would fail when the license server is not available. - the set up of the example database and workload would be performed asynchronously, with unclear signalling of when the user can expect to use them interactively. After this patch: - it's possible to override the license acquisition URL with COCKROACH_DEMO_LICENSE_URL, this is used in tests. - setting up the example database, partitioning and workload is done before presenting the interactive prompt. - partitioning the example database, if requested by --geo-partitioned-replicas, waits for license acquisition to complete (license acquisition remains asynchronous otherwise). - impossible configurations are reported early(earlier). For example: - OSS-only builds: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas * * ERROR: enterprise features are required for this demo, cannot run from OSS-only binary * Failed running "demo" ``` For license acquisition failures: ``` kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas error while contacting licensing server: Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) * * ERROR: license acquisition was unsuccessful. * Note: enterprise features are needed for --geo-partitioned-replicas. * Failed running "demo" ``` Additionally, this change fixes test flakiness that arises from an unavailable license server. Release note (cli change): To enable uses of `cockroach demo` with enterprise features in firewalled network environments, it is now possible to redirect the license acquisition with the environment variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for example a suitably configured HTTP proxy). Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

cockroach-teamcity added this to the 19.2 milestone Jul 10, 2019

cockroach-teamcity assigned andreimatei Jul 10, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 10, 2019

tbg assigned nvanbenschoten and unassigned andreimatei Jul 23, 2019

dt mentioned this issue Aug 21, 2019

release: v19.2.0-beta.20190826 #39781

Closed

14 tasks

nvanbenschoten mentioned this issue Sep 3, 2019

secret core release blockers tbg/cockroach#163

Closed

14 tasks

nvanbenschoten assigned irfansharif Sep 3, 2019

andy-kimball mentioned this issue Sep 3, 2019

19.2 release blockers list #40447

Closed

53 tasks

This was referenced Sep 17, 2019

CI: make it easier to navigate to the artifacts of a failed test (especially TestDockerCLI) #40822

Closed

CI: GCE roachtests not publishing artifacts #40830

Closed

irfansharif mentioned this issue Sep 20, 2019

roachtest: acceptance/bank/node-restart failed #35326

Closed

irfansharif mentioned this issue Sep 23, 2019

roachtest: fix bank/cluster-recovery #40976

Merged

irfansharif mentioned this issue Sep 23, 2019

roachtest: deflake bank/{node-restart,cluster-recovery} #40997

Merged

craig bot closed this as completed in 5296a27 Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: acceptance/bank/cluster-recovery failed #38785

roachtest: acceptance/bank/cluster-recovery failed #38785

cockroach-teamcity commented Jul 10, 2019

cockroach-teamcity commented Jul 22, 2019

tbg commented Jul 23, 2019

nvanbenschoten commented Jul 24, 2019

cockroach-teamcity commented Aug 21, 2019

cockroach-teamcity commented Aug 21, 2019

nvanbenschoten commented Sep 3, 2019

irfansharif commented Sep 4, 2019

nvanbenschoten commented Sep 11, 2019

irfansharif commented Sep 16, 2019 •

edited

Loading

irfansharif commented Sep 17, 2019 •

edited

Loading

irfansharif commented Sep 20, 2019

roachtest: acceptance/bank/cluster-recovery failed #38785

roachtest: acceptance/bank/cluster-recovery failed #38785

Comments

cockroach-teamcity commented Jul 10, 2019

cockroach-teamcity commented Jul 22, 2019

tbg commented Jul 23, 2019

nvanbenschoten commented Jul 24, 2019

cockroach-teamcity commented Aug 21, 2019

cockroach-teamcity commented Aug 21, 2019

nvanbenschoten commented Sep 3, 2019

irfansharif commented Sep 4, 2019

nvanbenschoten commented Sep 11, 2019

irfansharif commented Sep 16, 2019 • edited Loading

irfansharif commented Sep 17, 2019 • edited Loading

irfansharif commented Sep 20, 2019

irfansharif commented Sep 16, 2019 •

edited

Loading

irfansharif commented Sep 17, 2019 •

edited

Loading