Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #34180

Closed
cockroach-teamcity opened this issue Jan 23, 2019 · 25 comments · Fixed by #38632
Closed

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #34180

cockroach-teamcity opened this issue Jan 23, 2019 · 25 comments · Fixed by #38632
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/db8ee1384c46bcbece589dd60288dd151ad4bbb4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1105113&tab=buildLog

The test failed on release-2.1:
	test.go:743,cluster.go:1195,import.go:53,cluster.go:1533,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1105113-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1105113-import-tpcc-warehouses-1000-nodes-32 returned:
		stderr:
		
		stdout:
		Error: creating backup for table order: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1
	test.go:743,cluster.go:1554,import.go:56,import.go:66: Goexit() was called

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Jan 23, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 23, 2019
@tbg
Copy link
Member

tbg commented Jan 23, 2019

cc @mjibson

@maddyblue
Copy link
Contributor

The error message "creating backup for table" appears in MakeFixture of fixture.go which is invoked before IMPORT. Any ideas about what causes that error?

`./workload fixtures make tpcc --warehouses=%d --csv-server='http://localhost:8081' `+

@tbg
Copy link
Member

tbg commented Jan 24, 2019

It comes from this IMPORT:

fmt.Fprintf(&buf, `IMPORT TABLE "%s"."%s" %s CSV DATA (`, dbName, table.Name, table.Schema)
// Generate $1,...,$N-1, where N is the number of csv paths.
for _, path := range paths {
params = append(params, path)
if len(params) != 1 {
buf.WriteString(`,`)
}
fmt.Fprintf(&buf, `$%d`, len(params))
}
buf.WriteString(`) WITH nullif='NULL'`)
if len(output) > 0 {
params = append(params, output)
fmt.Fprintf(&buf, `, transform=$%d`, len(params))
}
_, err := sqlDB.Exec(buf.String(), params...)
return err

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8179cd9efec890f1ba063488c7a502a96b8241dc

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1119877&tab=buildLog

The test failed on release-2.1:
	test.go:743,test.go:755: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1119877-import-tpcc-warehouses-1000-nodes-32 -n 32 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b --local-ssd-no-ext4-barrier returned:
		stderr:
		
		stdout:
		-ephemeral/zones/us-central1-b/instances/teamcity-1119877-import-tpcc-warehouses-1000-nodes-32-0011].
		Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-central1-b/instances/teamcity-1119877-import-tpcc-warehouses-1000-nodes-32-0016].
		Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-central1-b/instances/teamcity-1119877-import-tpcc-warehouses-1000-nodes-32-0020].
		Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-central1-b/instances/teamcity-1119877-import-tpcc-warehouses-1000-nodes-32-0021].
		Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-central1-b/instances/teamcity-1119877-import-tpcc-warehouses-1000-nodes-32-0026].
		ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  '(resource type:pd-ssd)'.
		
		: exit status 1
		Cleaning up...
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog

The test failed on master:
	cluster.go:1226,import.go:53,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1160394-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1160394-import-tpcc-warehouses-1000-nodes-32 returned:
		stderr:
		
		stdout:
		4s, 9000000 rows, 0 index entries, 126 MiB)
		I190304 09:30:35.176584 31 ccl/workloadccl/fixture.go:498  imported history (4m19s, 30000000 rows, 60000000 index entries, 3.8 GiB)
		I190304 09:31:37.163904 33 ccl/workloadccl/fixture.go:498  imported order (5m21s, 30000000 rows, 60000000 index entries, 1.8 GiB)
		I190304 09:33:47.067293 29 ccl/workloadccl/fixture.go:498  imported customer (7m31s, 30000000 rows, 30000000 index entries, 17 GiB)
		I190304 09:37:35.560967 89 ccl/workloadccl/fixture.go:498  imported order_line (11m19s, 300013709 rows, 300013709 index entries, 23 GiB)
		I190304 09:39:14.437086 87 ccl/workloadccl/fixture.go:498  imported stock (12m58s, 0 rows, 0 index entries, 0 B)
		Error: creating backup for table stock: pq: write to google cloud: Post https://www.googleapis.com/upload/storage/v1/b/cockroach-tmp/o?alt=json&prettyPrint=false&projection=full&uploadType=resumable: status code 500 trying to fetch http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token
		Error:  exit status 1
		: exit status 1
	cluster.go:1585,import.go:56,import.go:66,test.go:1211: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/23f9707873abbd2de91a42055535529d7ff296ce

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1209900&tab=buildLog

The test failed on release-19.1:
	cluster.go:1293,import.go:53,cluster.go:1631,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1209900-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1209900-import-tpcc-warehouses-1000-nodes-32 returned:
		stderr:
		
		stdout:
		q: internal error: uncaught error: Post https://www.googleapis.com/upload/storage/v1/b/cockroach-tmp/o?alt=json&prettyPrint=false&projection=full&uploadType=resumable: status code 500 trying to fetch http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token
		write to google cloud
		github.com/cockroachdb/cockroach/pkg/ccl/storageccl.(*gcsStorage).WriteFile
			/go/src/github.com/cockroachdb/cockroach/pkg/ccl/storageccl/export_storage.go:797
		github.com/cockroachdb/cockroach/pkg/ccl/importccl.(*sstWriter).Run.func1.2
			/go/src/github.com/cockroachdb/cockroach/pkg/ccl/importccl/sst_writer_proc.go:208
		github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1
			/go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:170
		github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
			/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333
		Error:  exit status 1
		: exit status 1
	cluster.go:1652,import.go:56,import.go:66,test.go:1223: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d3f704f839ccaef7f10c3af48c78a26d390ae1dc

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1241436&tab=buildLog

The test failed on master:
	cluster.go:1107,import.go:34,import.go:66,test.go:1237: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-1241436-import-tpcc-warehouses-1000-nodes-32 /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload ./workload returned:
		stderr:
		
		stdout:
		teamcity-1241436-import-tpcc-warehouses-1000-nodes-32: putting (dist) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload ./workload
		.............................................................................................................................
		   1: done
		   2: done
		   3: done
		   4: done
		   5: done
		   6: done
		   7: done
		   8: done
		   9: done
		  10: done
		  11: done
		  12: done
		  13: done
		  14: done
		  15: done
		  16: done
		  17: done
		  18: done
		  19: done
		  20: done
		  21: done
		  22: done
		  23: done
		  24: done
		  25: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine [email protected]:./workload [email protected]:./workload
		Connection to 35.243.147.97 closed by remote host.
		: exit status 1
		  26: done
		  27: done
		  28: done
		  29: done
		  30: done
		  31: done
		  32: done
		I190415 07:28:04.201198 1 cluster_synced.go:962  put /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7109d291e3b9edfa38264361f832cec14fff66ee

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1259219&tab=buildLog

The test failed on release-19.1:
	cluster.go:1349,import.go:37,import.go:93,test.go:1245: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1259219-import-tpcc-warehouses-1000-nodes-32 returned:
		stderr:
		
		stdout:
		log;  export ROACHPROD=3 && GOTRACEBACK=crash COCKROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=1 COCKROACH_ENABLE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-central1,zone=us-central1-b --join=35.222.51.68:26257 >> ${HOME}/logs/cockroach.stdout.log 2>> ${HOME}/logs/cockroach.stderr.log || (x=$?; cat ${HOME}/logs/cockroach.stderr.log; exit $x)
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:397
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1441
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190424 18:55:17.539639 1 cluster_synced.go:1523  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d554884a4e474cc06213230d5ba7d757a88e9e46

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1279548&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1474,import.go:54,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1279548-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1279548-import-tpcc-warehouses-1000-nodes-32 returned:
		stderr:
		
		stdout:
		Error: importing table customer: pq: unsupported storage scheme: "experimental-workload"
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,import.go:57,import.go:93,test.go:1251: Goexit() was called

@maddyblue maddyblue assigned dt and unassigned maddyblue May 8, 2019
@dt
Copy link
Member

dt commented May 8, 2019

huh, the change I just merged shouldn't have changed whether or not the scheme is experimental-workload though, just what it does under the hood when it is (skipping csv) so I don't think it caused this? did 2.1 not have the workload export storage?

@tbg
Copy link
Member

tbg commented May 8, 2019

The code is this:

runImportTPCC := func(ctx context.Context, t *test, c *cluster, warehouses int) {
c.Put(ctx, cockroach, "./cockroach")
c.Put(ctx, workload, "./workload")
t.Status("starting csv servers")
c.Start(ctx, t)
c.Run(ctx, c.All(), `./workload csv-server --port=8081 &> logs/workload-csv-server.log < /dev/null &`)
t.Status("running workload")
m := newMonitor(ctx, c)
dul := NewDiskUsageLogger(c)
m.Go(dul.Runner)
hc := NewHealthChecker(c, c.All())
m.Go(hc.Runner)
m.Go(func(ctx context.Context) error {
defer dul.Done()
defer hc.Done()
cmd := fmt.Sprintf(
`./workload fixtures make tpcc --warehouses=%d --csv-server='http://localhost:8081' `+
`--gcs-bucket-override=%s --gcs-prefix-override=%s`,
warehouses, gcsTestBucket, c.name)
c.Run(ctx, c.Node(1), cmd)
return nil
})
m.Wait()
}

This looks like it should continue to work with 2.1. Something must have gotten broken.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d554884a4e474cc06213230d5ba7d757a88e9e46

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1281453&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1474,import.go:54,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1281453-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1281453-import-tpcc-warehouses-1000-nodes-32 returned:
		stderr:
		
		stdout:
		Error: importing table district: pq: unsupported storage scheme: "experimental-workload"
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,import.go:57,import.go:93,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/979b47cb3c6cd55d0d4c142bd97cb569a1813c2a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1281674&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1474,import.go:54,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1281674-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1281674-import-tpcc-warehouses-1000-nodes-32 returned:
		stderr:
		
		stdout:
		/cockroach-tmp/teamcity-1281674-import-tpcc-warehouses-1000-nodes-32/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 07:44:00.000999 101 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281674-import-tpcc-warehouses-1000-nodes-32/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 07:44:00.001024 104 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281674-import-tpcc-warehouses-1000-nodes-32/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 07:44:00.001317 100 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281674-import-tpcc-warehouses-1000-nodes-32/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		Error: fixture table not found: teamcity-1281674-import-tpcc-warehouses-1000-nodes-32/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/warehouse
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,import.go:57,import.go:93,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/048bdc163fcb470d4e749fcad482cf2671c29fb1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1281656&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,import.go:54,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1281656-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1281656-import-tpcc-warehouses-1000-nodes-32 returned:
		stderr:
		
		stdout:
		2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 08:02:23.139300 87 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281656-import-tpcc-warehouses-1000-nodes-32/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 08:02:23.139306 86 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281656-import-tpcc-warehouses-1000-nodes-32/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 08:02:23.139312 49 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281656-import-tpcc-warehouses-1000-nodes-32/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		Error: pq: gs://cockroach-tmp/teamcity-1281656-import-tpcc-warehouses-1000-nodes-32/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line already contains a BACKUP-CHECKPOINT file (is another operation already in progress?)
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,import.go:57,import.go:93,test.go:1251: Goexit() was called

@yuzefovich yuzefovich mentioned this issue May 9, 2019
14 tasks
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/699f675c73f8420802f92e46f65e6dce52abc12f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1306268&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1306268-import-tpcc-warehouses-1000-nodes-32:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		t 1)
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.73.92.72_2019-05-24T07:37:02Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/db98d5fb943e0a45b3878bdf042838408e9aee40

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1308281&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1308281-import-tpcc-warehouses-1000-nodes-32:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		 1)
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.227.95.41_2019-05-25T07:20:19Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c280de40c2bcab93c41fe82bef8353a5ecd95ac4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1311970&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1592,cluster.go:1611,cluster.go:1725,restore.go:107,import.go:41,cluster.go:1854,errgroup.go:57: exit status 1
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1311970-import-tpcc-warehouses-1000-nodes-32:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		9.0981779s, 0.07 MiB/s)
		I190528 08:39:10.964461 70 ccl/workloadccl/fixture.go:395  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 2m21.817564994s, 0.89 MiB/s)
		I190528 08:42:01.469835 68 ccl/workloadccl/fixture.go:395  imported 4.3 GiB in history table (30000000 rows, 60000000 index entries, took 5m12.323261875s, 14.01 MiB/s)
		I190528 08:42:55.871589 69 ccl/workloadccl/fixture.go:395  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 6m6.724840385s, 3.69 MiB/s)
		I190528 08:43:15.490479 67 ccl/workloadccl/fixture.go:395  imported 17 GiB in customer table (30000000 rows, 30000000 index entries, took 6m26.344144922s, 45.71 MiB/s)
		I190528 08:46:07.392951 72 ccl/workloadccl/fixture.go:395  imported 31 GiB in stock table (100000000 rows, 100000000 index entries, took 9m18.245761775s, 57.46 MiB/s)
		I190528 08:47:43.348621 73 ccl/workloadccl/fixture.go:395  imported 23 GiB in order_line table (300011520 rows, 300011520 index entries, took 10m54.20158647s, 35.72 MiB/s)
		: signal: killed
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/61715f0f96f519d599eec6541bbee7394d63209a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1312952&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1312952-import-tpcc-warehouses-1000-nodes-32:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		t 1)
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.73.23.17_2019-05-29T07:54:34Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f49f211f8fb2c2aa51182054192ebfcb9c0355f0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1315180&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1315180-import-tpcc-warehouses-1000-nodes-32:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		 1)
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.231.64.19_2019-05-30T08:10:41Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/83e62d69214aaa0f7b976f764b97b0e21a41cde3

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1318703&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1318703-import-tpcc-warehouses-1000-nodes-32:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		1)
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.231.106.90_2019-06-01T07:46:49Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8892e379d84a36b29003420189edd1e10db41d71

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1329974&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1513,import.go:46,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1329974-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		54862s, 37.79 MiB/s)
		I190607 21:33:09.396466 1 ccl/workloadccl/cliccl/fixtures.go:338  imported 77 GiB in 9 tables (took 10m44.646812414s, 122.50 MiB/s)
		I190607 21:33:09.397122 1 ccl/workloadccl/cliccl/fixtures.go:343  fixture is imported; now running consistency checks (ctrl-c to abort)
		I190607 21:33:09.479484 1 workload/tpcc/tpcc.go:287  check 3.3.2.1 took 82.239447ms
		I190607 21:33:38.459935 1 workload/tpcc/tpcc.go:287  check 3.3.2.2 took 28.980355809s
		I190607 21:33:41.241334 1 workload/tpcc/tpcc.go:287  check 3.3.2.3 took 2.781340641s
		I190607 21:34:59.246072 1 workload/tpcc/tpcc.go:287  check 3.3.2.4 took 1m18.004666894s
		I190607 21:35:36.085751 1 workload/tpcc/tpcc.go:287  check 3.3.2.5 took 36.839621299s
		I190607 21:37:42.499452 1 workload/tpcc/tpcc.go:287  check 3.3.2.7 took 2m6.413641811s
		Error: check failed: 3.3.2.7: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.226.115.52_2019-06-07T21:22:23Z: exit status 1
		: exit status 1
	cluster.go:1872,import.go:49,import.go:85,test.go:1248: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog

The test failed on branch=provisional_201906271846_v19.2.0-alpha.20190701, cloud=gce:
	test.go:1235: test timed out (5h0m0s)
	cluster.go:1511,import.go:44,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364443-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190628 00:07:12.775399 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 9 tables
		I190628 00:07:38.912957 50 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 26.124700558s, 0.30 MiB/s)
		I190628 00:07:39.383554 26 ccl/workloadccl/fixture.go:396  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 26.606459208s, 0.00 MiB/s)
		I190628 00:07:46.884389 27 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 34.107143072s, 0.03 MiB/s)
		: signal: killed
	cluster.go:1870,import.go:47,import.go:83,test.go:1249: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (5h0m0s)
	cluster.go:1511,import.go:44,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364821-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190628 08:08:45.684376 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 9 tables
		I190628 08:09:11.008352 42 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 25.32204039s, 0.31 MiB/s)
		I190628 08:09:11.657524 36 ccl/workloadccl/fixture.go:396  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 25.972097847s, 0.00 MiB/s)
		I190628 08:09:12.516488 37 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 26.830950172s, 0.04 MiB/s)
		I190628 08:10:25.696596 41 ccl/workloadccl/fixture.go:396  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 1m40.0104791s, 1.26 MiB/s)
		: signal: killed
	cluster.go:1870,import.go:47,import.go:83,test.go:1249: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (5h0m0s)
	cluster.go:1511,import.go:44,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190630 08:09:44.506331 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 9 tables
		I190630 08:10:16.275876 16 ccl/workloadccl/fixture.go:396  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 31.764823542s, 0.00 MiB/s)
		I190630 08:10:16.688401 66 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 32.176984713s, 0.03 MiB/s)
		I190630 08:10:17.204672 35 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 32.688643811s, 0.24 MiB/s)
		I190630 08:12:19.028756 70 ccl/workloadccl/fixture.go:396  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 2m34.51864673s, 0.81 MiB/s)
		: signal: killed
	cluster.go:1870,import.go:47,import.go:83,test.go:1249: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ca1ef4d4f8296b213c0b2b140f16e4a97931e6e7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1368144&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (5h0m0s)
	cluster.go:1511,import.go:44,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1368144-import-tpcc-warehouses-1000-nodes-32:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190701 08:22:54.828627 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 9 tables
		I190701 08:23:36.695313 24 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 41.86140184s, 0.02 MiB/s)
		I190701 08:23:36.777307 23 ccl/workloadccl/fixture.go:396  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 41.944047055s, 0.00 MiB/s)
		I190701 08:23:52.707727 29 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 57.874609137s, 0.13 MiB/s)
		I190701 08:24:34.839997 28 ccl/workloadccl/fixture.go:396  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 1m40.006870127s, 1.26 MiB/s)
		: signal: killed
	cluster.go:1870,import.go:47,import.go:83,test.go:1249: context canceled

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A for of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A lot of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
craig bot pushed a commit that referenced this issue Jul 3, 2019
38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten

Fixes #34180.
Fixes #35493.
Fixes #36983.
Fixes #37108.
Fixes #37371.
Fixes #37384.
Fixes #37551.
Fixes #37879.
Fixes #38095.
Fixes #38131.
Fixes #38136.
Fixes #38549.
Fixes #38552.
Fixes #38555.
Fixes #38560.
Fixes #38562.
Fixes #38563.
Fixes #38569.
Fixes #38578.
Fixes #38600.

_A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following:

![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png)

We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case.

The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #38632 Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants