Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/bank/cluster-recovery failed #38785

Closed
cockroach-teamcity opened this issue Jul 10, 2019 · 11 comments · Fixed by #40997
Closed

roachtest: acceptance/bank/cluster-recovery failed #38785

cockroach-teamcity opened this issue Jul 10, 2019 · 11 comments · Fixed by #40997
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/f74db5e81f8eaa190a41d708a9ccafb3eba9370a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/cluster-recovery PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1380786&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190710-1380786/acceptance/bank/cluster-recovery/run_1
	bank.go:398,bank.go:460,acceptance.go:68,test_runner.go:678: stall detected at round 2, no forward progress for 30s
	cluster.go:1654,bank.go:252: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1562739066-01-n4cpu4:1 returned:
		stderr:
		
		stdout:
		teamcity-1562739066-01-n4cpu4: starting
		teamcity-1562739066-01-n4cpu4: initializing cluster settings................: signal: killed

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 10, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 10, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/cluster-recovery PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190722-1399000/acceptance/bank/cluster-recovery/run_1
	bank.go:398,bank.go:460,acceptance.go:69,test_runner.go:691: stall detected at round 2, no forward progress for 30s

@tbg tbg assigned nvanbenschoten and unassigned andreimatei Jul 23, 2019
@tbg
Copy link
Member

tbg commented Jul 23, 2019

@nvanbenschoten you mentioned you were stressing this one, right?

@nvanbenschoten
Copy link
Member

I got distracted by #39018. Could be the same root cause. I'll stress with and without the fix tomorrow.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bd27eb358f558bb7598945318240335ebcfcdf13

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/cluster-recovery PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1446993&tab=buildLog

The test failed on branch=provisional_201908202216_v19.2.0-beta.20190826, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190821-1446993/acceptance/bank/cluster-recovery/run_1
	bank.go:398,bank.go:460,acceptance.go:69,test_runner.go:673: stall detected at round 2, no forward progress for 30s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bd27eb358f558bb7598945318240335ebcfcdf13

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/cluster-recovery PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1447014&tab=buildLog

The test failed on branch=provisional_201908202216_v19.2.0-beta.20190826, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190821-1447014/acceptance/bank/cluster-recovery/run_1
	cluster.go:1664,bank.go:252: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1566364342-07-n4cpu4:2 returned:
		stderr:
		
		stdout:
		g;  export ROACHPROD=2 && GOTRACEBACK=crash COCKROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=1 COCKROACH_ENABLE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-central1,zone=us-central1-b --join=130.211.227.73:26257 >> ${HOME}/logs/cockroach.stdout.log 2>> ${HOME}/logs/cockroach.stderr.log || (x=$?; cat ${HOME}/logs/cockroach.stderr.log; exit $x)
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:256
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1535
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1337: 
		I190821 05:14:56.987294 1 cluster_synced.go:1617  command failed
		: exit status 1

@nvanbenschoten
Copy link
Member

@irfansharif do you mind taking a look at this? The most recent failure looks like a flake, but the two stall detected failures look concerning.

@irfansharif
Copy link
Contributor

Taking a look at this now.

@nvanbenschoten
Copy link
Member

The recent failures in #35326 look similar. They probably have the same root cause.

@irfansharif
Copy link
Contributor

irfansharif commented Sep 16, 2019

This is possibly related to #39841. We trip up circuit breakers around node startup failing with the following:

I190912 23:11:58.290645 471 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n3] circuitbreaker: gossip [::]:26257->irfansharif-flake-0001:26257 tripped: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.138.0.41:26257: connect: connection refused"

@irfansharif
Copy link
Contributor

irfansharif commented Sep 17, 2019

The grpc failures were a red herring, looking at the timings for when each circuit breaker for each gossip connection was tripped and when exactly each node was killed, they line up as we'd expect.

@irfansharif
Copy link
Contributor

Ok, so here's what's going on (TL;DR roachtest bug, not CRDB). The "stalls"
detected boil down to the time between chaos monkey iterations. Within each
chaos monkey iteration we do the following:

  1. Lock all clients, in sequence
  2. Restart each node
  3. Unlock all clients
  4. Sleep until at least one client has made progress

Given our stall timeout is only 30s, we have just that long to go through all
of the above. In each client we lock around the UPDATE query so as to not be
interrupted. The problem is that every now and then these UPDATE queries take a
lot longer than a few milliseconds. This is expected behaviour: this is
primarily due to txnwait procedures and having to wait for the expiration of
an extant contending txn. More importantly, it's not what we're testing here as
the clients are still making progress. Given the chaos monkey first locks each
client, it has to drain out these requests, which eats out of the 30s or so we
have for each chaos monkey iteration. This is made worse by the fact that we do
this in sequence for each client. When we're unlucky, we run into this
particular convoy situation and we're unable to finish the round in time, and a
"stall" is detected.

We should really only be interested in how long it takes for the chaos monkey
to restart a set of nodes, and ensuring that after it does, that clients are
still making progress. We already have statement timeouts for the UPDATE
queries that fail if we take "too long". Removing the stopClients apparatus
gives us what we need.

All that being said, this failure mode is awfully similar to the recent
failures seen in #35326 where we're not stopping clients anyway. When trying to
reproduce both, it was much easier to reproduce this as compared to
node-restart (doubly so if reducing # of bank accounts or increasing # of
clients to increase contention). Also if you notice, the failures for
cluster-recovery and node-restart as posted above and over at #35326
happened in quick succession. (10th Jul, 02:14 and 10th Jul '19, 02:28;
21st Aug, 01:14 and 21st Aug, 01:14). This suggested there were other factors
at play, but I'll post those over at #35326 instead.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 23, 2019
Informs cockroachdb#38785.
The "stalls" detected boil down to the time between chaos monkey
iterations. Within each chaos monkey iteration we do the following:

  - Lock all clients, in sequence
  - Restart each node
  - Unlock all clients
  - Sleep until at least one client has made progress

Given our stall timeout is only 30s, we have just that long to go through all
of the above. In each client we lock around the UPDATE query so as to not be
interrupted. The problem is that every now and then these UPDATE queries take a
lot longer than a few milliseconds. This is expected behaviour: this is
primarily due to txnwait procedures and having to wait for the expiration of
an extant contending txn. More importantly, it's not what we're testing here as
the clients are still making progress. Given the chaos monkey first locks each
client, it has to drain out these requests, which eats out of the 30s or so we
have for each chaos monkey iteration. This is made worse by the fact that we do
this in sequence for each client. When we're unlucky, we run into this
particular convoy situation and we're unable to finish the round in time, and a
"stall" is detected.

We should really only be interested in how long it takes for the chaos monkey
to restart a set of nodes, and ensuring that after it does, that clients are
still making progress. We already have statement timeouts for the UPDATE
queries that fail if we take "too long". Removing the stopClients apparatus
gives us what we need.

Release justification: Category 1: Non-production code changes

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 23, 2019
Informs cockroachdb#38785.
The "stalls" detected boil down to the time between chaos monkey
iterations. Within each chaos monkey iteration we do the following:

  - Lock all clients, in sequence
  - Restart each node
  - Unlock all clients
  - Sleep until at least one client has made progress

Given our stall timeout is only 30s, we have just that long to go through all
of the above. In each client we lock around the UPDATE query so as to not be
interrupted. The problem is that every now and then these UPDATE queries take a
lot longer than a few milliseconds. This is expected behaviour: this is
primarily due to txnwait procedures and having to wait for the expiration of
an extant contending txn. More importantly, it's not what we're testing here as
the clients are still making progress. Given the chaos monkey first locks each
client, it has to drain out these requests, which eats out of the 30s or so we
have for each chaos monkey iteration. This is made worse by the fact that we do
this in sequence for each client. When we're unlucky, we run into this
particular convoy situation and we're unable to finish the round in time, and a
"stall" is detected.

We should really only be interested in how long it takes for the chaos monkey
to restart a set of nodes, and ensuring that after it does, that clients are
still making progress. We already have statement timeouts for the UPDATE
queries that fail if we take "too long". Removing the stopClients apparatus
gives us what we need.

Release justification: Category 1: Non-production code changes

Release note: None
craig bot pushed a commit that referenced this issue Sep 23, 2019
40874: delegate: Fix index resolution hack in show partitions r=rohany a=rohany

This fixes a hack that was introduced when writing show partitions in
order to have a better error message when the user provided an invalid
index.

Release justification: Low risk improvement to functionality.

Release note: None

40976: roachtest: fix bank/cluster-recovery r=irfansharif a=irfansharif

Informs #38785.

The "stalls" detected boil down to the time between chaos monkey
iterations. Within each chaos monkey iteration we do the following:

  - Lock all clients, in sequence
  - Restart each node
  - Unlock all clients
  - Sleep until at least one client has made progress

Given our stall timeout is only 30s, we have just that long to go through all
of the above. In each client we lock around the UPDATE query so as to not be
interrupted. The problem is that every now and then these UPDATE queries take a
lot longer than a few milliseconds. This is expected behaviour: this is
primarily due to txnwait procedures and having to wait for the expiration of
an extant contending txn. More importantly, it's not what we're testing here as
the clients are still making progress. Given the chaos monkey first locks each
client, it has to drain out these requests, which eats out of the 30s or so we
have for each chaos monkey iteration. This is made worse by the fact that we do
this in sequence for each client. When we're unlucky, we run into this
particular convoy situation and we're unable to finish the round in time, and a
"stall" is detected.

We should really only be interested in how long it takes for the chaos monkey
to restart a set of nodes, and ensuring that after it does, that clients are
still making progress. We already have statement timeouts for the UPDATE
queries that fail if we take "too long". Removing the stopClients apparatus
gives us what we need.

Release justification: Category 1: Non-production code changes

Release note: None

Co-authored-by: Rohan Yadav <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
craig bot pushed a commit that referenced this issue Sep 24, 2019
40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif

Fixes #38785.
Fixes #35326.

Because everything roachprod does, it does through SSH, we're
particularly susceptible to network delays, packet drops, etc. We've
seen this before, or at least pointed to this being a problem before,
over at #37001. Setting timeouts around our calls to roachprod helps to
better surface these kind of errors.

The underlying issue in #38785 and in #35326 is the fact that we're
running roachprod commands that may (reasonably) fail due to connection
issues, and we're unable to retry them safely (the underlying commands
are non-idempotent). Presently we simply fail the entire test, when
really we should be able to retry the commands. This is left unaddressed.

Release justification: Category 1: Non-production code changes
Release note: None

41029: cli: fix the demo licensing code r=rohany a=knz

Fixes #40734.
Fixes #41024.

Release justification: fixes a flaky test, fixes UX of main new feature

Before this patch, there were multiple problems with the code:

- if the license acquisition was disabled by the env var config,
  the error message would not be clear.
- the licensing code would deadlock silently on OSS-only
  builds (because the license failure channel was not written in that
  control branch).
- the error/warning messages would be interleaved on the same line as
  the input line (missing newline at start of message).
- the test code would fail when the license server is not available.
- the set up of the example database and workload would be performed
  asynchronously, with unclear signalling of when the user
  can expect to use them interactively.

After this patch:
- it's possible to override the license acquisition URL with
  COCKROACH_DEMO_LICENSE_URL, this is used in tests.
- setting up the example database, partitioning and workload is done
  before presenting the interactive prompt.
- partitioning the example database, if requested by
  --geo-partitioned-replicas, waits for license acquisition to
  complete (license acquisition remains asynchronous otherwise).
- impossible configurations are reported early(earlier).

For example:

- OSS-only builds:

```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
*
* ERROR: enterprise features are required for this demo, cannot run from OSS-only binary
*
Failed running "demo"
```

For license acquisition failures:

```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
error while contacting licensing server:
Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
*
* ERROR: license acquisition was unsuccessful.
* Note: enterprise features are needed for --geo-partitioned-replicas.
*
Failed running "demo"
```

Additionally, this change fixes test flakiness that arises from an
unavailable license server.

Release note (cli change): To enable uses of `cockroach demo` with
enterprise features in firewalled network environments, it is now
possible to redirect the license acquisition with the environment
variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for
example a suitably configured HTTP proxy).

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig craig bot closed this as completed in 5296a27 Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants