Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=false failed #34897

Closed
cockroach-teamcity opened this issue Feb 14, 2019 · 83 comments
Closed

roachtest: clearrange/checks=false failed #34897

cockroach-teamcity opened this issue Feb 14, 2019 · 83 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/acba091f04f3d8ecabf51009bf394951fbd3643c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1137872&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:57,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1137872-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Feb 14, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Feb 14, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d7c56dcb87c8f187e50303c8e32a64836c42515c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1139797&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:57,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1139797-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4fc4d63ddddeb507564bfe53c6cd79549ff2fd27

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1141650&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:88,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1141650-clearrange-checks-false:1 -- ./cockroach sql --insecure -e "DROP DATABASE IF EXISTS tinybank" returned:
		stderr:
		
		stdout:
		Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.
		
		read tcp 127.0.0.1:39356->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8e9a1e310e3e8e37f091b7ca8bd204084ad9e2e5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1142461&tab=buildLog

The test failed on master:
	test.go:1200: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:192,clearrange.go:37,test.go:1212: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:88,clearrange.go:37,test.go:1212: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1143393-clearrange-checks-false:1 -- ./cockroach sql --insecure -e "DROP DATABASE IF EXISTS tinybank" returned:
		stderr:
		
		stdout:
		Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.
		
		read tcp 127.0.0.1:59582->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

@tbg
Copy link
Member

tbg commented Feb 19, 2019

This second to last one looks like it's really a test timeout and that the test, ironically, got stuck importing a tiny table:

11:28:31 cluster.go:253: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1142461-clearrange-checks-false:1 -- ./cockroach sql --insecure -e "DROP DATABASE IF EXISTS tinybank"
DROP DATABASE
11:28:36 cluster.go:253: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1142461-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --db=tinybank --payload-bytes=100 --ranges=10 --rows=800 --seed=1
7: 31137
5: 7054
1: 19076
10: 20811
8: 15433
9: 17108
2: 26324
4: 11799
3: 5286
6: 18016
14:13:59 cluster.go:908: fetching debug zip

14:13:59 is very close to the test timeout of 6:30h (the test run started at 7:42:56).

@tbg
Copy link
Member

tbg commented Feb 19, 2019

The last one is probably similar, except in this case the ./cockroach sql invocation didn't get a response from the server quick enough and cancelled the test right there.

(Note that this is all before even dropping a table)

@tbg
Copy link
Member

tbg commented Feb 19, 2019

In the last one, n10 took ~15s to start. It's unclear where all that time went, for example there's a 5s gap very early (:27-:32) that I can't pin on anything in the code (perhaps it's the reporting, that's in the gap between the log messages among other things that shouldn't take that long?)

Starting the Store also seems to eat 5s. Anyway, it takes a while to start is the bottom line here. I wonder why c.Start already returns while that's still going on. We use --background, perhaps that ought to return a little later. @bdarnell, do you have any opinions here? We currently signal readiness relatively early:

cockroach/pkg/server/server.go

Lines 1375 to 1380 in 475958d

if len(bootstrappedEngines) > 0 {
// The cluster was already initialized.
doBootstrap = false
if s.cfg.ReadyFn != nil {
s.cfg.ReadyFn(false /*waitForInit*/)
}

but that's probably for a good reason. Intuitively I would've thought readiness would be signaled with this message

teamcity-1143393-clearrange-checks-false-0004> I190219 10:19:43.065628 25 cli/start.go:695 node startup completed:
CockroachDB node starting at 2019-02-19 10:19:43.065405278 +0000 UTC (took 15.7s)
build: CCL v2.2.0-alpha.20190211-218-gbd80a74 @ 2019/02/19 06:46:22 (go1.11.4)
webui: http://teamcity-1143393-clearrange-checks-false-0004:26258
...

Anyway, the more interesting failure is that in which the RESTORE just gets stuck basically forever.

I190219 10:19:27.353927 1 util/log/clog.go:1197  [config] file created at: 2019/02/19 10:19:27
I190219 10:19:27.353927 1 util/log/clog.go:1197  [config] running on machine: teamcity-1143393-clearrange-checks-false-0004
I190219 10:19:27.353927 1 util/log/clog.go:1197  [config] binary: CockroachDB CCL v2.2.0-alpha.20190211-218-gbd80a74 (x86_64-unknown-linux-gnu, built 2019/02/19 06:46:22, go1.11.4)
I190219 10:19:27.353927 1 util/log/clog.go:1197  [config] arguments: [./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=/root/logs --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce
,region=us-east1,zone=us-east1-b --join=35.196.17.241:26257]
I190219 10:19:27.353927 1 util/log/clog.go:1197  line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg utf8=✓
I190219 10:19:27.353926 1 cli/start.go:1031  logging to directory /root/logs
W190219 10:19:27.355571 1 cli/start.go:1066  RUNNING IN INSECURE MODE!

- Your cluster is open for any client that can access <all your IP addresses>.
- Any user, even root, can log in without providing a password.
- Any user, connecting as root, can read or write any data in your cluster.
- There is no network encryption nor authentication, and thus no confidentiality.

Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v2.2/secure-a-cluster.html
I190219 10:19:27.355730 1 server/status/recorder.go:610  available memory from cgroups (8.0 EiB) exceeds system memory 15 GiB, using system memory
I190219 10:19:27.355742 1 cli/start.go:1080  CockroachDB CCL v2.2.0-alpha.20190211-218-gbd80a74 (x86_64-unknown-linux-gnu, built 2019/02/19 06:46:22, go1.11.4)
I190219 10:19:32.448960 1 server/status/recorder.go:610  available memory from cgroups (8.0 EiB) exceeds system memory 15 GiB, using system memory
I190219 10:19:32.448980 1 server/config.go:382  system total memory: 15 GiB
I190219 10:19:32.449021 1 server/config.go:384  server configuration:
max offset             500000000
cache size             3.7 GiB
SQL memory pool size   3.7 GiB
scan interval          10m0s
scan min idle time     10ms
scan max idle time     1s
event log enabled      true
I190219 10:19:32.449040 1 cli/start.go:920  using local environment variables: COCKROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=1, COCKROACH_ENABLE_RPC_COMPRESSION=false, COCKROACH_BACKGROUND_RESTART=1
I190219 10:19:32.449051 1 cli/start.go:927  process identity: uid 0 euid 0 gid 0 egid 0
I190219 10:19:32.449060 1 cli/start.go:552  starting cockroach node
I190219 10:19:32.454990 25 storage/engine/rocksdb.go:573  opening rocksdb instance at "/mnt/data1/cockroach/cockroach-temp154990427"
I190219 10:19:32.495350 25 server/server.go:871  [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled
I190219 10:19:32.499004 25 storage/engine/rocksdb.go:573  opening rocksdb instance at "/mnt/data1/cockroach"
I190219 10:19:32.727661 25 server/config.go:490  [n?] 1 storage engine initialized
I190219 10:19:32.727678 25 server/config.go:493  [n?] RocksDB cache size: 3.7 GiB
I190219 10:19:32.727691 25 server/config.go:493  [n?] store 0: RocksDB, max size 0 B, max open file limit 60536
W190219 10:19:32.739146 25 gossip/gossip.go:1496  [n?] no incoming or outgoing connections
W190219 10:19:32.739273 25 cli/start.go:894  neither --listen-addr nor --advertise-addr was specified.
The server will advertise "teamcity-1143393-clearrange-checks-false-0004" to other nodes, is this routable?

Consider using:
- for local-only servers:  --listen-addr=localhost
- for multi-node clusters: --advertise-addr=<host/IP addr>
I190219 10:19:32.741693 25 server/server.go:921  [n?] Sleeping till wall time 1550571572741660520 to catches up to 1550571572995315950 to ensure monotonicity. Delta: 253.65543ms
I190219 10:19:32.743194 30 gossip/client.go:128  [n?] started gossip client to 35.196.17.241:26257
I190219 10:19:32.997672 25 gossip/gossip.go:392  [n10] NodeDescriptor set to node_id:10 address:<network_field:"tcp" address_field:"teamcity-1143393-clearrange-checks-false-0004:26257" > attrs:<> locality:<tiers:<key:"cloud" value:"gce" > tiers:<key:"region" value:"us-east1" > tiers:<key:"zone" value:"us-east1-b" > > ServerVersion:<major_val:2 minor_val:1 patch:0 unstable:8 > build_tag:"v2.2.0-alpha.20190211-218-gbd80a74" started_at:1550571572996001692
I190219 10:19:33.010052 30 gossip/client.go:133  [n10] closing client to n1 (35.196.17.241:26257): received forward from n1 to 2 (teamcity-1143393-clearrange-checks-false-0006:26257)
I190219 10:19:33.013728 121 gossip/client.go:128  [n10] started gossip client to teamcity-1143393-clearrange-checks-false-0006:26257
I190219 10:19:33.740412 110 gossip/client.go:128  [n10] started gossip client to teamcity-1143393-clearrange-checks-false-0006:26257
I190219 10:19:33.741262 110 gossip/client.go:133  [n10] closing client to n2 (teamcity-1143393-clearrange-checks-false-0006:26257): rpc error: code = Unknown desc = duplicate connection from node at teamcity-1143393-clearrange-checks-false-0004:26257
I190219 10:19:34.756232 236 gossip/client.go:128  [n10] started gossip client to teamcity-1143393-clearrange-checks-false-0008:26257
I190219 10:19:37.391596 57 gossip/gossip.go:1510  [n10] node has connected to cluster via gossip
W190219 10:19:38.998130 47 storage/closedts/provider/provider.go:144  [ct-closer] unable to move closed timestamp forward: not live
I190219 10:19:43.033326 25 server/node.go:461  [n10] initialized store [n10,s10]: disk (capacity=314 GiB, available=143 GiB, used=173 GiB, logicalBytes=195 GiB), ranges=6842, leases=0, queries=0.00, writes=0.00, bytesPerReplica={p10=18708296.00 p25=33600168.00 p50=33600168.00 p75=33600168.00 p90=33607164.00 pMax=33607164.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}
I190219 10:19:43.033688 25 storage/stores.go:243  [n10] read 9 node addresses from persistent storage
I190219 10:19:43.033990 25 server/node.go:699  [n10] connecting to gossip network to verify cluster ID...
I190219 10:19:43.034012 25 server/node.go:719  [n10] node connected via gossip and verified as part of cluster "63dac28d-1f60-4c52-88d9-a4285e33e98e"
I190219 10:19:43.034233 25 server/node.go:542  [n10] node=10: started with [<no-attributes>=/mnt/data1/cockroach] engine(s) and attributes []
I190219 10:19:43.034498 25 server/status/recorder.go:610  [n10] available memory from cgroups (8.0 EiB) exceeds system memory 15 GiB, using system memory
I190219 10:19:43.035901 25 server/server.go:1577  [n10] starting http server at [::]:26258 (use: teamcity-1143393-clearrange-checks-false-0004:26258)
I190219 10:19:43.035924 25 server/server.go:1579  [n10] starting grpc/postgres server at [::]:26257
I190219 10:19:43.035941 25 server/server.go:1580  [n10] advertising CockroachDB node at teamcity-1143393-clearrange-checks-false-0004:26257
I190219 10:19:43.065268 25 server/server.go:1647  [n10] done ensuring all necessary migrations have run
I190219 10:19:43.065296 25 server/server.go:1650  [n10] serving sql connections
I190219 10:19:43.065585 25 cli/start.go:687  [config] clusterID: 63dac28d-1f60-4c52-88d9-a4285e33e98e
I190219 10:19:43.065628 25 cli/start.go:695  node startup completed:
CockroachDB node starting at 2019-02-19 10:19:43.065405278 +0000 UTC (took 15.7s)

tbg added a commit to tbg/cockroach that referenced this issue Feb 19, 2019
@bdarnell
Copy link
Contributor

We use --background, perhaps that ought to return a little later. @bdarnell, do you have any opinions here? We currently signal readiness relatively early:

Yeah, this should probably be later. It should at least be moved below the ensureClockMonotonicity call later in the same block, and probably even later. I think one reason this is where it is is that we need (needed?) two different paths for initialized and uninitialized clusters, but I think we just stuffed it in the one if alreadyBootstrapped block we already had instead of thinking about where it really belongs.

@tbg
Copy link
Member

tbg commented Feb 19, 2019 via email

tbg added a commit to tbg/cockroach that referenced this issue Feb 20, 2019
craig bot pushed a commit that referenced this issue Feb 20, 2019
35050: roachtest: misc clearrange improvements r=petermattis a=tbg

Touches #34897.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/46d1ab4c06edfd37d875114972c55b44105acd83

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1147903&tab=buildLog

The test failed on master:
	test.go:1199: test timed out (6h30m0s)
	cluster.go:1226,clearrange.go:58,clearrange.go:37,test.go:1211: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1147903-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d0a93d286ee42ceb94f986b6a06a1afd52cf914e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149020&tab=buildLog

The test failed on master:
	test.go:1199: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/eaad50f0ea64985a0d7da05abb00cc9f321c5fa9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149743&tab=buildLog

The test failed on master:
	test.go:1199: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog

The test failed on master:
	test.go:1199: test timed out (6h30m0s)
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d888b76df319571afe4d5816f1a1f0f53905653f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1159641&tab=buildLog

The test failed on master:
	cluster.go:1226,clearrange.go:58,clearrange.go:37,test.go:1211: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1159641-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190303 07:40:23.870359 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 1 tables
		I190303 09:31:58.723951 68 ccl/workloadccl/fixture.go:498  imported bank (1h51m35s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog

The test failed on master:
	cluster.go:1585,clearrange.go:189,clearrange.go:37,test.go:1211: pq: count-leases: AS OF SYSTEM TIME: cannot specify timestamp in the future

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/959dcf7de0f94cfcfa0062387b109adebd1f11da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1163702&tab=buildLog

The test failed on master:
	cluster.go:1603,clearrange.go:189,clearrange.go:37,test.go:1214: pq: count-leases: AS OF SYSTEM TIME: cannot specify timestamp in the future

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/38bb1e7905b89f911bd74be4f5830217ffb7b343

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1168752&tab=buildLog

The test failed on master:
	test.go:1202: test timed out (6h30m0s)
	cluster.go:1603,clearrange.go:189,clearrange.go:37,test.go:1214: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5f7de3c35348e847dc0d4ab1ba8d76574e1706c2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1169980&tab=buildLog

The test failed on master:
	test.go:1202: test timed out (6h30m0s)
	cluster.go:1603,clearrange.go:189,clearrange.go:37,test.go:1214: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9d058d53c8a82fceb2205f1827c26f1bf36c32ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1172386&tab=buildLog

The test failed on master:
	cluster.go:1603,clearrange.go:189,clearrange.go:37,test.go:1214: unexpected node event: 7: dead

@tbg
Copy link
Member

tbg commented Apr 11, 2019

Does COCKROACH_ENGINE_MAX_SYNC_DURATION also apply to writes to the temp storage engine?

No, it does not. This only applies to the engines that are used as stores. This means that if you're seeing the disk checker crash because of activity on the temp engines, then it's because the temp engines starve IO for the main engine.

What difference should it make to the regular store, which is barely taking any writes in the early stages of import, if the temp store is flushing a lot?

I think @petermattis had a past investigation in which CRDB performed poorly because some other process was filling up the OS' write buffer and so even small syncs done by RocksDB could take a long time. Perhaps something similar is going on here.

EDIT: one other thing to try is adding more flush threads. It looks like right now the temp store and real store are sharing a single flush thread.

Interesting, how can that be? They're entirely different RocksDB instances. How does that work?

with an unrelated transient issue after 3h11m.

Just curious, what was the transient issue? We're really trying to get rid of transient issues.

@petermattis
Copy link
Collaborator

I think @petermattis had a past investigation in which CRDB performed poorly because some other process was filling up the OS' write buffer and so even small syncs done by RocksDB could take a long time. Perhaps something similar is going on here.

@tbg's memory is good. Linux's default behavior is to allow too large an amount of dirty file data to accumulate before flushing. When the flushing does occur it can cause significant disk IO starvation for everything on the system. I can't recall off hand what to look for, but vmstat or some file in /proc will indicate the "dirty" data size. In normal CockroachDB usage, the amount stays low because we're frequently syncing the WAL and sstables (as they are written). I'd be mildly surprised if this is coming up here due to the temp-store, but definitely worth checking.

EDIT: one other thing to try is adding more flush threads. It looks like right now the temp store and real store are sharing a single flush thread.

Interesting, how can that be? They're entirely different RocksDB instances. How does that work?

The thread pool could be shared between RocksDB instances, but I'm pretty sure we don't share it. We do share the block cache.

@tbg
Copy link
Member

tbg commented Apr 11, 2019

We do share the block cache.

Where? Am I looking at the wrong place?

// NewTempEngine creates a new engine for DistSQL processors to use when the
// working set is larger than can be stored in memory.
func NewTempEngine(
tempStorage base.TempStorageConfig, storeSpec base.StoreSpec,
) (MapProvidingEngine, error) {
if tempStorage.InMemory {
// TODO(arjun): Limit the size of the store once #16750 is addressed.
// Technically we do not pass any attributes to temporary store.
return NewInMem(roachpb.Attributes{} /* attrs */, 0 /* cacheSize */), nil
}
rocksDBCfg := RocksDBConfig{
Attrs: roachpb.Attributes{},
Dir: tempStorage.Path,
// MaxSizeBytes doesn't matter for temp storage - it's not
// enforced in any way.
MaxSizeBytes: 0,
MaxOpenFiles: 128, // TODO(arjun): Revisit this.
UseFileRegistry: storeSpec.UseFileRegistry,
ExtraOptions: storeSpec.ExtraOptions,
}
rocksDBCache := NewRocksDBCache(0)
rocksdb, err := NewRocksDB(rocksDBCfg, rocksDBCache)
if err != nil {
return nil, err
}
return rocksdb, nil
}

@petermattis
Copy link
Collaborator

Where? Am I looking at the wrong place?

I'm a lying liar and you caught me! More accurately, I misremembered. We share the block cache between all of the stores on a node. I wonder if we should use that same shared block cache for the temp store. The current 0-size block cache might make random access DiskMap usage significantly slower.

@ajkr
Copy link
Contributor

ajkr commented Apr 11, 2019

@tbg's memory is good. Linux's default behavior is to allow too large an amount of dirty file data to accumulate before flushing. When the flushing does occur it can cause significant disk IO starvation for everything on the system. I can't recall off hand what to look for, but vmstat or some file in /proc will indicate the "dirty" data size.

I have some low level strace data showing the tiny WAL write syncs getting stuck behind huge temp store sstable syncs. For example:

[pid 12888] 05:29:15 fdatasync(37</mnt/data1/cockroach/cockroach-temp128692840/001118.sst> <unfinished ...>
--
[pid 12892] 05:29:15 fdatasync(150</mnt/data1/cockroach/cockroach-temp128692840/001119.sst> <unfinished ...>
[pid 12881] 05:29:15 fdatasync(17</mnt/data1/cockroach/000003.log> <unfinished ...>
[pid 12888] 05:29:16 <... fdatasync resumed> ) = 0 <1.638572>
[pid 12892] 05:29:16 <... fdatasync resumed> ) = 0 <1.468135>
[pid 12881] 05:29:16 <... fdatasync resumed> ) = 0 <1.164935>

In normal CockroachDB usage, the amount stays low because we're frequently syncing the WAL and sstables (as they are written). I'd be mildly surprised if this is coming up here due to the temp-store, but definitely worth checking.

bytes_per_sync behaves differently from how I'd expect -- it only requests asynchronous writeback each time bytes_per_sync bytes are written to an sstable. It does this using sync_file_range with SYNC_FILE_RANGE_WRITE flag. So it could be possible no writeback has happened by the time the sstable is finished and fdatasync is called, and then it'll all happen at once.

That option might make more sense together with a rate limiter on background writes. That'd cause us to pause occasionally when generating large files allowing some of the async writebacks to finish. I don't know what rate limit to pick though. If we had some estimate of the device's max throughput that'd be enough info for us to use auto-tuned rate limiter.

@ajkr
Copy link
Contributor

ajkr commented Apr 11, 2019

Interesting, how can that be? They're entirely different RocksDB instances. How does that work?

In RocksDB the Env object owns the thread pools. The Env we use Env::Default() which is a static object so is the same for anyone in the same process.

Just curious, what was the transient issue? We're really trying to get rid of transient issues.

I think this one was caused by my home Internet connection.

@petermattis
Copy link
Collaborator

bytes_per_sync behaves differently from how I'd expect -- it only requests asynchronous writeback each time bytes_per_sync bytes are written to an sstable. It does this using sync_file_range with SYNC_FILE_RANGE_WRITE flag. So it could be possible no writeback has happened by the time the sstable is finished and fdatasync is called, and then it'll all happen at once.

Oh, interesting. I had assumed the file syncing was synchronous. I wonder if we should be passing SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER.

@petermattis
Copy link
Collaborator

In RocksDB the Env object owns the thread pools. The Env we use Env::Default() which is a static object so is the same for anyone in the same process.

Globals!!! I'm pretty sure that sharing is unintentional. I think the temp-store should have a separate thread pool with 1 high priority and 1 low priority thread. And sharing the thread pool across all stores on a node is definitely not what was intended. Yikes!

@tbg
Copy link
Member

tbg commented Apr 11, 2019

Yikes indeed. I'm excited by the better place we'll be in once this investigation is resolved, though.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/134478e4dde16919eb86efb81fb22d8cce8a9701

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1236263&tab=buildLog

The test failed on release-19.1:
	cluster.go:1329,clearrange.go:58,clearrange.go:37,test.go:1237: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1236263-clearrange-checks-false:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190411 18:37:18.507963 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190411 19:09:55.396960 52 ccl/workloadccl/fixture.go:516  imported bank (32m37s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@ajkr
Copy link
Contributor

ajkr commented Apr 12, 2019

OK finished experimenting with these three changes:

(1) facebook/rocksdb#4460
(2) wal_bytes_per_sync = 0
(3) launching minimum of two flush threads and two compaction threads

All experiments were using COCKROACH_ENGINE_MAX_SYNC_DURATION_FATAL=true COCKROACH_ENGINE_MAX_SYNC_DURATION=3s.

With the three changes combined, the import succeeded in 3h30m11s. I suspected, though, perhaps (3) is the most important, so tried a few more experiments.

  • Stripped out (2) since that looked least likely to make a difference. The import failed with an unrelated transient issue after 3h11m.
  • Stripped out both (1) and (2). The import succeeded in 2h53m45s.
  • Stripped out (1), (2), and (3) (for sanity). The import failed as expected after 10m55s.

So (3) appears to be the major differentiator. But why? What difference should it make to the regular store, which is barely taking any writes in the early stages of import, if the temp store is flushing a lot? Also, why does the import finish slower when we do less WAL writes to the temp store (this seems to agree with @petermattis's benchmark result)? Lots of unknowns here.

Following up on this series, we tried a fourth thing today.

(4) cockroachdb/rocksdb#29

I only ran one experiment which contained changes (1), (2), and (4). Like usual it was run with COCKROACH_ENGINE_MAX_SYNC_DURATION_FATAL=true COCKROACH_ENGINE_MAX_SYNC_DURATION=3s. The import succeeded in 2h35m4s. That's the best we've seen yet.

@petermattis
Copy link
Collaborator

Was the (1), (2), and (4) experiment run on ZFS? That's an excellent improvement. I think it would be worthwhile to make sure the result is stable and run the test multiple times (you can do this in parallel using roachtest run -c and specify a manually created cluster).

I believe you mentioned elsewhere that ext4 is able to complete the import portion of this test significantly faster. Is that result reproducible?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/509c5b130fb1ad0042beb74e083817aa68e4fc92

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1237002&tab=buildLog

The test failed on release-19.1:
	test.go:1225: test timed out (6h30m0s)
	cluster.go:1688,clearrange.go:189,clearrange.go:37,test.go:1237: context canceled

@ajkr
Copy link
Contributor

ajkr commented Apr 12, 2019

Was the (1), (2), and (4) experiment run on ZFS? That's an excellent improvement. I think it would be worthwhile to make sure the result is stable and run the test multiple times (you can do this in parallel using roachtest run -c and specify a manually created cluster).

I believe you mentioned elsewhere that ext4 is able to complete the import portion of this test significantly faster. Is that result reproducible?

That one was on ZFS. OK, I ran a few more experiments to answer your questions. Summary of all our results from yesterday and today:

  • ZFS + (1) + (2) + (4): 2h35m4s, 2h32m43, 2h27m51s
  • ZFS + (4): 2h38m38s
  • ext4 + (1) + (2) + (4): 1h40m34s, 1h41m39s
  • ext4 + (4): 1h40m59

So the results are pretty stable, and looks like (1) and (2) don't make a big difference, but also don't hurt.

I realized also we don't really have a baseline since these are all run with COCKROACH_ENGINE_MAX_SYNC_DURATION_FATAL=true COCKROACH_ENGINE_MAX_SYNC_DURATION=3s which always fails on master. Will get one.

@petermattis
Copy link
Collaborator

Glad to see that the import times are stable. This is great progress and I appreciate your diligence in testing here. I'm surprised at how large the difference is between ZFS and ext4. Looks like (4) is the important change, though I'm totally fine with (1) and (2) as well.

@ajkr
Copy link
Contributor

ajkr commented Apr 12, 2019

This is great progress and I appreciate your diligence in testing here.

Thanks, hope I didn't run up our GCE bill too much :).

I realized also we don't really have a baseline since these are all run with COCKROACH_ENGINE_MAX_SYNC_DURATION_FATAL=true COCKROACH_ENGINE_MAX_SYNC_DURATION=3s which always fails on master. Will get one.

Got a baseline using COCKROACH_ENGINE_MAX_SYNC_DURATION_FATAL=false COCKROACH_ENGINE_MAX_SYNC_DURATION=3s on master branch. It was 3h8m15s on ZFS.

@cockroach-teamcity

This comment has been minimized.

ajkr added a commit to ajkr/cockroach that referenced this issue Apr 27, 2019
Includes the following changes, all of which have landed upstream.

- cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans"
- cockroachdb/rocksdb#28: "Fix cockroachdb#3840: only `SyncClosedLogs` for multiple CFs"
- cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O"
- cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry"

Also made the RocksDB changes that we decided in cockroachdb#34897:

- Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary.
- Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`.
- Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`.

Release note: None
craig bot pushed a commit that referenced this issue Apr 27, 2019
37172: c-deps: bump rocksdb for multiple backported PRs r=ajkr a=ajkr

Includes the following changes, all of which have landed upstream.

- cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans"
- cockroachdb/rocksdb#28: "Fix #3840: only `SyncClosedLogs` for multiple CFs"
- cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O"
- cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry"

Also made the RocksDB changes that we decided in #34897:

- Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary.
- Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`.
- Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`.

Release note: None

Co-authored-by: Andrew Kryczka <[email protected]>
@ajkr
Copy link
Contributor

ajkr commented May 8, 2019

Failing on zfs snapshot, that's strange. Well, I plan to play with these tests over the next week or so, and will investigate it then if I happen to see it.

@ajkr ajkr closed this as completed May 8, 2019
ajwerner pushed a commit to ajwerner/cockroach that referenced this issue May 15, 2019
Includes the following changes, all of which have landed upstream.

- cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans"
- cockroachdb/rocksdb#28: "Fix cockroachdb#3840: only `SyncClosedLogs` for multiple CFs"
- cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O"
- cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry"

Also made the RocksDB changes that we decided in cockroachdb#34897:

- Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary.
- Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`.
- Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants