Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpcc/warehouses=1000/nodes=4 failed #37371

Closed
cockroach-teamcity opened this issue May 8, 2019 · 15 comments · Fixed by #38632
Closed

roachtest: import/tpcc/warehouses=1000/nodes=4 failed #37371

cockroach-teamcity opened this issue May 8, 2019 · 15 comments · Fixed by #38632
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/d554884a4e474cc06213230d5ba7d757a88e9e46

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1279548&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1474,import.go:54,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1279548-import-tpcc-warehouses-1000-nodes-4:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1279548-import-tpcc-warehouses-1000-nodes-4 returned:
		stderr:
		
		stdout:
		Error: importing table stock: pq: unsupported storage scheme: "experimental-workload"
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,import.go:57,import.go:93,test.go:1251: Goexit() was called

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone May 8, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels May 8, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8abb47a1c9795c1463183bc44e776b054bece682

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1279683&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,import.go:54,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1279683-import-tpcc-warehouses-1000-nodes-4:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1279683-import-tpcc-warehouses-1000-nodes-4 returned:
		stderr:
		
		stdout:
		to "gs://cockroach-tmp/teamcity-1279683-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190508 09:14:27.348877 70 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1279683-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190508 09:14:27.348882 75 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1279683-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190508 09:14:27.348885 74 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1279683-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		Error: fixture table not found: teamcity-1279683-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/warehouse
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,import.go:57,import.go:93,test.go:1251: Goexit() was called

@danhhz
Copy link
Contributor

danhhz commented May 8, 2019

unsupported storage scheme: "experimental-workload" Seems like this may have been #37343 assigning @dt

@danhhz danhhz assigned dt May 8, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d554884a4e474cc06213230d5ba7d757a88e9e46

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1281453&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1474,import.go:54,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1281453-import-tpcc-warehouses-1000-nodes-4:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1281453-import-tpcc-warehouses-1000-nodes-4 returned:
		stderr:
		
		stdout:
		Error: importing table stock: pq: unsupported storage scheme: "experimental-workload"
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,import.go:57,import.go:93,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/048bdc163fcb470d4e749fcad482cf2671c29fb1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1281656&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,import.go:54,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1281656-import-tpcc-warehouses-1000-nodes-4:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1281656-import-tpcc-warehouses-1000-nodes-4 returned:
		stderr:
		
		stdout:
		ion=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 09:12:19.550044 41 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281656-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 09:12:19.550052 44 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281656-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 09:12:19.550057 47 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281656-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		Error: pq: gs://cockroach-tmp/teamcity-1281656-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line already contains a BACKUP-CHECKPOINT file (is another operation already in progress?)
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,import.go:57,import.go:93,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/979b47cb3c6cd55d0d4c142bd97cb569a1813c2a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1281674&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1474,import.go:54,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1281674-import-tpcc-warehouses-1000-nodes-4:1 -- ./workload fixtures make tpcc --warehouses=1000 --csv-server='http://localhost:8081' --gcs-bucket-override=cockroach-tmp --gcs-prefix-override=teamcity-1281674-import-tpcc-warehouses-1000-nodes-4 returned:
		stderr:
		
		stdout:
		to "gs://cockroach-tmp/teamcity-1281674-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 08:41:34.970244 49 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281674-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 08:41:34.970251 48 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281674-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		I190509 08:41:34.970255 44 ccl/workloadccl/fixture.go:271  Backing order_line up to "gs://cockroach-tmp/teamcity-1281674-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/order_line"...
		Error: fixture table not found: teamcity-1281674-import-tpcc-warehouses-1000-nodes-4/tpcc/version=2.0.1,interleaved=false,seed=1,warehouses=1000/warehouse
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,import.go:57,import.go:93,test.go:1251: Goexit() was called

@yuzefovich yuzefovich mentioned this issue May 9, 2019
14 tasks
@danhhz
Copy link
Contributor

danhhz commented May 9, 2019

The unsupported storage scheme: "experimental-workload" error was fixed by #37410 but seems like there are some other failures here as well

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/699f675c73f8420802f92e46f65e6dce52abc12f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1306268&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1306268-import-tpcc-warehouses-1000-nodes-4:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		1)
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_104.196.60.26_2019-05-24T07:41:03Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/db98d5fb943e0a45b3878bdf042838408e9aee40

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1308281&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1308281-import-tpcc-warehouses-1000-nodes-4:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_104.196.203.131_2019-05-25T07:24:04Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/61715f0f96f519d599eec6541bbee7394d63209a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1312952&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1312952-import-tpcc-warehouses-1000-nodes-4:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		1)
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.196.145.94_2019-05-29T07:58:31Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

danhhz added a commit to danhhz/cockroach that referenced this issue May 29, 2019
In cockroachdb#37726, I switched a few places use the `workload fixtures import` in
the cockroach cli so the version string would match between the
`fixtures import` command and whatever was producing the data, but I
should have left this one alone. It's using `--csv-server` so, in
contrast to the other places I fixed that that PR, in this case the
workload binary is both ends. (`--csv-server` causes cockroach to read
csv data from an http server inside the standalone workload command.)

This broke because `fixtures import` works with the `--csv-server` flag
on master, but that hasn't been backported to anything else, so the
workload built into the cockroach cli doesn't know about that flag for
fixtures import. My smoke tests didn't catch this because I was using a
cockroach binary built from master.

Hopefully this is the last of the fallout.

Touches cockroachdb#37371

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f49f211f8fb2c2aa51182054192ebfcb9c0355f0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1315180&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1315180-import-tpcc-warehouses-1000-nodes-4:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_104.196.178.130_2019-05-30T08:12:22Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/83e62d69214aaa0f7b976f764b97b0e21a41cde3

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1318703&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1516,import.go:51,cluster.go:1854,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1318703-import-tpcc-warehouses-1000-nodes-4:1 -- ./cockroach workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		1)
		
		Global Flags:
		      --experimental-direct-ingestion    Use the faster, but limited and still quite experimental, IMPORT without a distributed sort
		      --files-per-node int               number of file URLs to generate per node (default 1)
		      --gcs-billing-project string       Google Cloud project to use for storage billing; required to be non-empty if the bucket is requestor pays
		      --inject-stats                     Inject pre-calculated statistics if they are available (default true)
		      --logtostderr Severity[=DEFAULT]   logs at or above this threshold go to stderr (default NONE)
		      --no-color                         disable standard error log colorization
		      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging (significantly hurts performance)
		
		Error: unknown flag: --csv-server
		Failed running "workload fixtures import tpcc"
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.243.180.44_2019-06-01T07:50:33Z: exit status 1
		: exit status 1
	cluster.go:1875,import.go:54,import.go:90,test.go:1251: Goexit() was called

craig bot pushed a commit that referenced this issue Jun 3, 2019
37915: roachtest: fix import/tpcc/warehouses=1000/nodes=4 r=tbg a=danhhz

In #37726, I switched a few places use the `workload fixtures import` in
the cockroach cli so the version string would match between the
`fixtures import` command and whatever was producing the data, but I
should have left this one alone. It's using `--csv-server` so, in
contrast to the other places I fixed that that PR, in this case the
workload binary is both ends. (`--csv-server` causes cockroach to read
csv data from an http server inside the standalone workload command.)

This broke because `fixtures import` works with the `--csv-server` flag
on master, but that hasn't been backported to anything else, so the
workload built into the cockroach cli doesn't know about that flag for
fixtures import. My smoke tests didn't catch this because I was using a
cockroach binary built from master.

Hopefully this is the last of the fallout.

Touches #37371

Release note: None

Co-authored-by: Daniel Harrison <[email protected]>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog

The test failed on branch=provisional_201906271846_v19.2.0-alpha.20190701, cloud=gce:
	test.go:1235: test timed out (5h0m0s)
	cluster.go:1511,import.go:44,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364443-import-tpcc-warehouses-1000-nodes-4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190628 00:15:41.621304 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 9 tables
		I190628 00:15:42.477547 28 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 854.765853ms, 1.15 MiB/s)
		I190628 00:15:42.478474 27 ccl/workloadccl/fixture.go:396  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 855.764748ms, 0.06 MiB/s)
		I190628 00:15:47.628123 33 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 6.004216s, 1.29 MiB/s)
		I190628 00:16:47.986707 32 ccl/workloadccl/fixture.go:396  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 1m6.363093322s, 1.89 MiB/s)
		I190628 00:27:50.373140 31 ccl/workloadccl/fixture.go:396  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 12m8.749891419s, 1.86 MiB/s)
		: signal: killed
	cluster.go:1870,import.go:47,import.go:83,test.go:1249: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (5h0m0s)
	cluster.go:1511,import.go:44,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364821-import-tpcc-warehouses-1000-nodes-4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190628 08:18:06.200990 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 9 tables
		I190628 08:18:07.165822 39 ccl/workloadccl/fixture.go:396  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 963.666799ms, 0.05 MiB/s)
		I190628 08:18:07.573902 40 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 1.371762103s, 0.72 MiB/s)
		I190628 08:18:08.183511 45 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 1.981203813s, 3.92 MiB/s)
		I190628 08:20:08.148838 44 ccl/workloadccl/fixture.go:396  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 2m1.946552468s, 1.03 MiB/s)
		: signal: killed
	cluster.go:1870,import.go:47,import.go:83,test.go:1249: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (5h0m0s)
	cluster.go:1511,import.go:44,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-import-tpcc-warehouses-1000-nodes-4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		ccl/workloadccl/cliccl/fixtures.go:324  starting import of 9 tables
		I190630 08:12:55.760105 54 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 1.290755364s, 0.76 MiB/s)
		I190630 08:12:55.837159 53 ccl/workloadccl/fixture.go:396  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 1.367847201s, 0.04 MiB/s)
		I190630 08:12:56.374912 59 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 1.904229868s, 4.08 MiB/s)
		I190630 08:14:44.088575 58 ccl/workloadccl/fixture.go:396  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 1m49.619092672s, 1.15 MiB/s)
		I190630 08:34:54.733750 56 ccl/workloadccl/fixture.go:396  imported 4.3 GiB in history table (30000000 rows, 60000000 index entries, took 22m0.264379563s, 3.32 MiB/s)
		I190630 08:36:51.616125 57 ccl/workloadccl/fixture.go:396  imported 1.3 GiB in order table (30000000 rows, 30000000 index entries, took 23m57.145969724s, 0.94 MiB/s)
		: signal: killed
	cluster.go:1870,import.go:47,import.go:83,test.go:1249: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ca1ef4d4f8296b213c0b2b140f16e4a97931e6e7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpcc/warehouses=1000/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1368144&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (5h0m0s)
	cluster.go:1511,import.go:44,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1368144-import-tpcc-warehouses-1000-nodes-4:1 -- ./workload fixtures import tpcc --warehouses=1000 --csv-server='http://localhost:8081' returned:
		stderr:
		
		stdout:
		I190701 09:55:43.121909 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 9 tables
		I190701 09:55:44.193321 51 ccl/workloadccl/fixture.go:396  imported 53 KiB in warehouse table (1000 rows, 0 index entries, took 1.070020451s, 0.05 MiB/s)
		I190701 09:55:44.662478 52 ccl/workloadccl/fixture.go:396  imported 1006 KiB in district table (10000 rows, 0 index entries, took 1.539161212s, 0.64 MiB/s)
		I190701 09:55:45.189058 57 ccl/workloadccl/fixture.go:396  imported 7.8 MiB in item table (100000 rows, 0 index entries, took 2.061658819s, 3.77 MiB/s)
		I190701 09:57:24.370456 56 ccl/workloadccl/fixture.go:396  imported 126 MiB in new_order table (9000000 rows, 0 index entries, took 1m41.246614424s, 1.24 MiB/s)
		: signal: killed
	cluster.go:1870,import.go:47,import.go:83,test.go:1249: context canceled

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A for of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A lot of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
craig bot pushed a commit that referenced this issue Jul 3, 2019
38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten

Fixes #34180.
Fixes #35493.
Fixes #36983.
Fixes #37108.
Fixes #37371.
Fixes #37384.
Fixes #37551.
Fixes #37879.
Fixes #38095.
Fixes #38131.
Fixes #38136.
Fixes #38549.
Fixes #38552.
Fixes #38555.
Fixes #38560.
Fixes #38562.
Fixes #38563.
Fixes #38569.
Fixes #38578.
Fixes #38600.

_A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following:

![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png)

We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case.

The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #38632 Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants