Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/index/tpcc/w=100 failed #36024

Closed
cockroach-teamcity opened this issue Mar 21, 2019 · 25 comments · Fixed by #38579
Closed

roachtest: schemachange/index/tpcc/w=100 failed #36024

cockroach-teamcity opened this issue Mar 21, 2019 · 25 comments · Fixed by #38579
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog

The test failed on master:
	test.go:1202: test timed out (30m0s)
	cluster.go:1267,tpcc.go:142,schemachange.go:310,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1189954-schemachange-index-tpcc-w-100:5 -- ./workload check tpcc --warehouses=100 {pgurl:1} returned:
		stderr:
		
		stdout:
		I190321 13:44:27.112085 1 workload/tpcc/tpcc.go:292  check 3.3.2.1 took 9.626722644s
		I190321 13:45:00.489455 1 workload/tpcc/tpcc.go:292  check 3.3.2.2 took 33.377279093s
		I190321 13:45:11.828237 1 workload/tpcc/tpcc.go:292  check 3.3.2.3 took 11.338718794s
		I190321 13:47:44.111289 1 workload/tpcc/tpcc.go:292  check 3.3.2.4 took 2m32.282972385s
		I190321 13:47:50.255716 1 workload/tpcc/tpcc.go:292  check 3.3.2.5 took 6.144344078s
		I190321 13:48:12.868170 1 workload/tpcc/tpcc.go:292  check 3.3.2.7 took 22.612378595s
		I190321 13:48:15.409456 1 workload/tpcc/tpcc.go:292  check 3.3.2.8 took 2.541209195s
		: signal: killed

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Mar 21, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 21, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b5768aecd39461ab9a54e2e7db059a3fe8b00459

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191957&tab=buildLog

The test failed on release-19.1:
	schemachange.go:446,schemachange.go:314,cluster.go:1605,errgroup.go:57: pq: internal error: table: 58 has lease: node_id:4 expiration_time:1553259670816931282 , expected: {1 1553259366599558815 {} 0}: the schema change lease has expired
	cluster.go:1626,tpcc.go:140,schemachange.go:310,test.go:1214: Goexit() was called

@vivekmenezes vivekmenezes assigned knz and unassigned andreimatei Mar 22, 2019
@vivekmenezes
Copy link
Contributor

#36011 needs to get into 19.1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d03a34e92d2ee558fb6aedb0709b733a1fab97f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1207666&tab=buildLog

The test failed on master:
	schemachange.go:450,schemachange.go:314,cluster.go:1631,errgroup.go:57: dial tcp 35.196.111.127:26257: connect: connection refused
	cluster.go:1652,tpcc.go:140,schemachange.go:310,test.go:1223: Goexit() was called
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1652,tpcc.go:140,schemachange.go:310,test.go:1223: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1207666-schemachange-index-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		4: 3049
		2: 3885
		3: 3970
		1: dead
		Error:  1: dead

@vivekmenezes
Copy link
Contributor

an OOO failure on node 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1a5eabad4511a3371a6b2809d2bfc29e8aff66a6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1224702&tab=buildLog

The test failed on master:
	schemachange.go:450,schemachange.go:314,cluster.go:1667,errgroup.go:57: pq: internal error: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "unnamed" id=613ab53c key=/Table/SystemConfigSpan/Start rw=true pri=0.00469061 stat=ABORTED epo=0 ts=1554556132.303611831,0 orig=1554556120.934602346,0 max=1554556121.434602346,0 wto=false seq=1
	cluster.go:1329,tpcc.go:130,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1224702-schemachange-index-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=15m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		l
		  11m20s        0            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  11m20s        0            0.0            0.6      0.0      0.0      0.0      0.0 newOrder
		  11m20s        0            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  11m20s        0            1.0            0.3  45097.2  45097.2  45097.2  45097.2 payment
		  11m20s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  11m21s        0            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  11m21s        0            0.0            0.6      0.0      0.0      0.0      0.0 newOrder
		  11m21s        0            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  11m21s        0            0.0            0.3      0.0      0.0      0.0      0.0 payment
		  11m21s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1688,tpcc.go:140,schemachange.go:310,test.go:1228: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c03c307b1a4100a8a1edd2804bdfaf3903097756

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1229036&tab=buildLog

The test failed on release-19.1:
	cluster.go:1255,tpcc.go:157,tpcc.go:160,schemachange.go:310,test.go:1237: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1229036-schemachange-index-tpcc-w-100:1-4 --encrypt returned:
		stderr:
		
		stdout:
		teamcity-1229036-schemachange-index-tpcc-w-100: starting...........................................................................................................................
		3: exit status 255
		~ ./cockroach version
		Connection to 35.243.186.212 closed by remote host.
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.getCockroachVersion
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:95
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:289
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1417
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190409 12:43:56.200632 1 cluster_synced.go:1499  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dd7c697e986fc528da7b12c6c10dcce7f64a486c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1252804&tab=buildLog

The test failed on release-19.1:
	cluster.go:1329,tpcc.go:168,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1252804-schemachange-index-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=15m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		l
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  13m41s    26287            0.0            0.0      0.0      0.0      0.0      0.0 delivery
		  13m41s    26287            0.0            0.5      0.0      0.0      0.0      0.0 newOrder
		  13m41s    26287            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  13m41s    26287            1.0            0.4  77309.4  77309.4  77309.4  77309.4 payment
		  13m41s    26287            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		  13m42s    26287            0.0            0.0      0.0      0.0      0.0      0.0 delivery
		  13m42s    26287            0.0            0.5      0.0      0.0      0.0      0.0 newOrder
		  13m42s    26287            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  13m42s    26287            0.0            0.4      0.0      0.0      0.0      0.0 payment
		  13m42s    26287            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	schemachange.go:450,schemachange.go:314,cluster.go:1667,errgroup.go:57: dial tcp 34.73.225.29:26257: connect: connection refused
	cluster.go:1688,tpcc.go:178,schemachange.go:310,test.go:1237: unexpected node event: 4: dead
	cluster.go:953,context.go:89,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1688,tpcc.go:178,schemachange.go:310,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1252804-schemachange-index-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		4: dead
		2: 4075
		3: 4104
		1: dead
		Error:  4: dead, 1: dead

@vivekmenezes
Copy link
Contributor

profile001.pdf

heap profile on the failure

@vivekmenezes
Copy link
Contributor

@jordanlewis any chance you can take a look at this. It looks like the kv fetcher is using a lot of memory and I do see a large number of distsql flows. Is this to be expected?

@thoszhang
Copy link
Contributor

The most recent failure looks similar to #37108 (comment), where the cluster is already in a bad state in the tpcc ramp period before any schema changes have run. The extremely high latency for the tpcc queries is a symptom:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   4m57s        0            0.0            0.5      0.0      0.0      0.0      0.0 delivery
   4m57s        0            4.0            5.0  26843.5 103079.2 103079.2 103079.2 newOrder
   4m57s        0            0.0            0.6      0.0      0.0      0.0      0.0 orderStatus
   4m57s        0            1.0            4.6  47244.6  47244.6  47244.6  47244.6 payment
   4m57s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
   4m58s        0            2.0            0.5  32212.3  36507.2  36507.2  36507.2 delivery
   4m58s        0            5.0            5.0  53687.1 103079.2 103079.2 103079.2 newOrder
   4m58s        0            0.0            0.6      0.0      0.0      0.0      0.0 orderStatus
   4m58s        0            0.0            4.6      0.0      0.0      0.0      0.0 payment
   4m58s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
   4m59s        0            0.0            0.5      0.0      0.0      0.0      0.0 delivery
   4m59s        0            7.0            5.0  47244.6 103079.2 103079.2 103079.2 newOrder
   4m59s        0            3.0            0.6   5368.7   7516.2   7516.2   7516.2 orderStatus
   4m59s        0            0.0            4.6      0.0      0.0      0.0      0.0 payment
   4m59s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel
    5m0s        0            1.0            0.5  66572.0  66572.0  66572.0  66572.0 delivery
    5m0s        0            3.0            5.0  73014.4 103079.2 103079.2 103079.2 newOrder
    5m0s        0            2.0            0.6   5905.6   7247.8   7247.8   7247.8 orderStatus
    5m0s        0            1.0            4.6  10737.4  10737.4  10737.4  10737.4 payment
    5m0s        0            0.0            0.1      0.0      0.0      0.0      0.0 stockLevel

There are also lots of warnings in the logs about everything being slow: heartbeats, raft processing, etc.

@thoszhang
Copy link
Contributor

Also, the heap profile Vivek posted looks extremely similar to the heap profile posted for that same issue, in #37108 (comment).

craig bot pushed a commit that referenced this issue Apr 29, 2019
36744: storage/rangefeed: use fine-grained locking around Processor, add metrics r=nvanbenschoten a=nvanbenschoten

It is critical for correctness that operations like providing the processor
with logical ops or informing the processor of closed timestamp updates
be properly synchronized via the raftMu. This ensures that events published
on the processors eventC accurately reflect the order of events in the Raft
log.

However, it is not critical that lifecycle events around starting a
Rangefeed processor, stopping a Rangefeed processor, and registering
with a Rangefeed processor be synchronized with Raft application. This
change exploits this to break up the locking around rangefeed.Processor.

Using more fine-grained locking opens up the opportunity to interact
with the rangefeed Processor without needing to synchronize with the
Range's raft application, which can be very expensive. The next commit
uses this improvement to add a new `rangefeed_registrations` metric to
`RangeInfo`.

36999: roachtest: add schemachange/bulkingest test r=lucy-zhang a=lucy-zhang

Add a test to index the random `payload` column for the `bulkingest` workload,
to test the index backfiller for an index on randomly ordered values relative
to the primary key.

Release note: None

37169: opt: don't use delegateQuery for ShowZoneConfig r=RaduBerinde a=RaduBerinde

Part of a larger work item to remove `delegateQuery`, as it doesn't
work with the optimizer.

I could not use the new delegate infrastructure because it doesn't support hiding columns returned by the delegated query.

Release note: None

37182: roachtest: provision more cpu for schemachange tpcc tests r=vivekmenezes a=vivekmenezes

related to 
#36094 
#36024 
#36321 

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Lucy Zhang <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Vivek Menezes <[email protected]>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dcd4cc5e37ebbcebbbf0f01670811b176a58bf89

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1288281&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1833,tpcc.go:178,schemachange.go:305,test.go:1251: unexpected node event: 3: dead
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:774,cluster.go:1833,tpcc.go:178,schemachange.go:305,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1288281-schemachange-index-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		3: dead
		4: 3324
		2: 4197
		1: 4113
		Error:  3: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7b2651400b2003d0a381cba9dbfc0b7bc0dfee00

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1293898&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,tpcc.go:157,tpcc.go:159,schemachange.go:305,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1293898-schemachange-index-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		t-status reply 0
		debug3: receive packet: type 96
		debug2: channel 0: rcvd eof
		debug2: channel 0: output open -> drain
		debug2: channel 0: obuf empty
		debug2: channel 0: close_write
		debug2: channel 0: output drain -> closed
		debug3: receive packet: type 97
		debug2: channel 0: rcvd close
		debug3: channel 0: will not send data after close
		debug2: channel 0: almost dead
		debug2: channel 0: gc: notify user
		debug2: channel 0: gc: user detached
		debug2: channel 0: send close
		debug3: send packet: type 97
		debug2: channel 0: is dead
		debug2: channel 0: garbage collecting
		debug1: channel 0: free: client-session, nchannels 1
		debug3: channel 0: status: The following connections are open:
		  #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
		
		debug3: send packet: type 1
		debug1: fd 0 clearing O_NONBLOCK
		debug1: fd 1 clearing O_NONBLOCK
		debug3: fd 2 is not O_NONBLOCK
		Transferred: sent 3272, received 4668 bytes, in 350.6 seconds
		Bytes per second: sent 9.3, received 13.3
		debug1: Exit status 1
		: exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/923a3b2a6f4a6492883141092280d1041de1381a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1295056&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,tpcc.go:157,tpcc.go:159,schemachange.go:305,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1295056-schemachange-index-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		t-status reply 0
		debug3: receive packet: type 96
		debug2: channel 0: rcvd eof
		debug2: channel 0: output open -> drain
		debug2: channel 0: obuf empty
		debug2: channel 0: close_write
		debug2: channel 0: output drain -> closed
		debug3: receive packet: type 97
		debug2: channel 0: rcvd close
		debug3: channel 0: will not send data after close
		debug2: channel 0: almost dead
		debug2: channel 0: gc: notify user
		debug2: channel 0: gc: user detached
		debug2: channel 0: send close
		debug3: send packet: type 97
		debug2: channel 0: is dead
		debug2: channel 0: garbage collecting
		debug1: channel 0: free: client-session, nchannels 1
		debug3: channel 0: status: The following connections are open:
		  #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
		
		debug3: send packet: type 1
		debug1: fd 0 clearing O_NONBLOCK
		debug1: fd 1 clearing O_NONBLOCK
		debug3: fd 2 is not O_NONBLOCK
		Transferred: sent 3344, received 4700 bytes, in 409.5 seconds
		Bytes per second: sent 8.2, received 11.5
		debug1: Exit status 1
		: exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/cab299a0ef983f8b4ffe5d724e44587d9665d3a3

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1295811&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,tpcc.go:157,tpcc.go:159,schemachange.go:305,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1295811-schemachange-index-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		-status reply 0
		debug3: receive packet: type 96
		debug2: channel 0: rcvd eof
		debug2: channel 0: output open -> drain
		debug2: channel 0: obuf empty
		debug2: channel 0: close_write
		debug2: channel 0: output drain -> closed
		debug3: receive packet: type 97
		debug2: channel 0: rcvd close
		debug3: channel 0: will not send data after close
		debug2: channel 0: almost dead
		debug2: channel 0: gc: notify user
		debug2: channel 0: gc: user detached
		debug2: channel 0: send close
		debug3: send packet: type 97
		debug2: channel 0: is dead
		debug2: channel 0: garbage collecting
		debug1: channel 0: free: client-session, nchannels 1
		debug3: channel 0: status: The following connections are open:
		  #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
		
		debug3: send packet: type 1
		debug1: fd 0 clearing O_NONBLOCK
		debug1: fd 1 clearing O_NONBLOCK
		debug3: fd 2 is not O_NONBLOCK
		Transferred: sent 3272, received 4676 bytes, in 312.9 seconds
		Bytes per second: sent 10.5, received 14.9
		debug1: Exit status 1
		: exit status 1
		: exit status 1

@knz knz removed their assignment May 20, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/58c567a325056033b326cb9c4ed9ba490e8956da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1296592&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,tpcc.go:157,tpcc.go:159,schemachange.go:305,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1296592-schemachange-index-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		 open -> drain
		debug2: channel 0: obuf empty
		debug2: channel 0: close_write
		debug2: channel 0: output drain -> closed
		debug3: receive packet: type 98
		debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
		debug3: receive packet: type 97
		debug2: channel 0: rcvd close
		debug3: channel 0: will not send data after close
		debug2: channel 0: almost dead
		debug2: channel 0: gc: notify user
		debug2: channel 0: gc: user detached
		debug2: channel 0: send close
		debug3: send packet: type 97
		debug2: channel 0: is dead
		debug2: channel 0: garbage collecting
		debug1: channel 0: free: client-session, nchannels 1
		debug3: channel 0: status: The following connections are open:
		  #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
		
		debug3: send packet: type 1
		debug1: fd 0 clearing O_NONBLOCK
		debug1: fd 1 clearing O_NONBLOCK
		debug3: fd 2 is not O_NONBLOCK
		Transferred: sent 3236, received 4648 bytes, in 323.4 seconds
		Bytes per second: sent 10.0, received 14.4
		debug1: Exit status 1
		: exit status 1
		: exit status 1

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@nvanbenschoten
Copy link
Member

Previous two issues addressed by #37701.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5f358ed804af05f8c4b404efc4d8a282d8e0916c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1361643&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:167,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1361643-schemachange-index-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=15m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		    8.3      0.0      0.0      0.0      0.0 delivery
		  12m47s   177934            0.0           82.6      0.0      0.0      0.0      0.0 newOrder
		  12m47s   177934            0.0            8.3      0.0      0.0      0.0      0.0 orderStatus
		  12m47s   177934            0.0           82.2      0.0      0.0      0.0      0.0 payment
		  12m47s   177934            0.0            8.3      0.0      0.0      0.0      0.0 stockLevel
		E190627 02:18:30.729951 1 workload/cli/run.go:427  error in newOrder: dial tcp 10.142.0.55:26257: connect: connection refused
		  12m48s   233955            0.0            8.3      0.0      0.0      0.0      0.0 delivery
		  12m48s   233955            0.0           82.5      0.0      0.0      0.0      0.0 newOrder
		  12m48s   233955            0.0            8.3      0.0      0.0      0.0      0.0 orderStatus
		  12m48s   233955            0.0           82.1      0.0      0.0      0.0      0.0 payment
		  12m48s   233955            0.0            8.3      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1870,tpcc.go:177,schemachange.go:300,test.go:1249: unexpected node event: 3: dead
	cluster.go:1033,context.go:122,cluster.go:1022,asm_amd64.s:522,panic.go:397,test.go:783,test.go:769,cluster.go:1870,tpcc.go:177,schemachange.go:300,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1361643-schemachange-index-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 2: 3561
		4: 3328
		3: dead
		1: 3420
		5: skipped
		Error:  3: dead

@nvanbenschoten
Copy link
Member

Duplicate of #36094 (comment).

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog

The test failed on branch=provisional_201906271846_v19.2.0-alpha.20190701, cloud=gce:
	cluster.go:1511,tpcc.go:167,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364443-schemachange-index-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=15m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		un.go:427  error in newOrder: dial tcp 10.128.0.58:26257: connect: connection refused
		   1m10s    27330            1.0            7.9    104.9    104.9    104.9    104.9 delivery
		   1m10s    27330            0.0           75.0      0.0      0.0      0.0      0.0 newOrder
		   1m10s    27330            0.0            7.7      0.0      0.0      0.0      0.0 orderStatus
		   1m10s    27330            6.0           72.7   1811.9  11811.2  11811.2  11811.2 payment
		   1m10s    27330            0.0            7.9      0.0      0.0      0.0      0.0 stockLevel
		   1m11s    27330            0.0            7.8      0.0      0.0      0.0      0.0 delivery
		   1m11s    27330            0.0           73.9      0.0      0.0      0.0      0.0 newOrder
		   1m11s    27330            0.0            7.6      0.0      0.0      0.0      0.0 orderStatus
		   1m11s    27330            0.0           71.7      0.0      0.0      0.0      0.0 payment
		   1m11s    27330            0.0            7.8      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1870,tpcc.go:177,schemachange.go:300,test.go:1249: unexpected node event: 2: dead
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpcc.go:177,schemachange.go:300,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1364443-schemachange-index-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		4: 3407
		1: 3509
		2: dead
		3: 3877
		Error:  2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:167,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364821-schemachange-index-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=15m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		0      0.0      0.0 orderStatus
		  10m52s   113290            0.0           84.3      0.0      0.0      0.0      0.0 payment
		  10m52s   113290            0.0            8.5      0.0      0.0      0.0      0.0 stockLevel
		E190628 19:33:35.310787 1 workload/cli/run.go:427  error in newOrder: dial tcp 10.142.0.58:26257: connect: connection refused
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  10m53s   153374            0.0            8.5      0.0      0.0      0.0      0.0 delivery
		  10m53s   153374            0.0           84.5      0.0      0.0      0.0      0.0 newOrder
		  10m53s   153374            0.0            8.5      0.0      0.0      0.0      0.0 orderStatus
		  10m53s   153374            0.0           84.1      0.0      0.0      0.0      0.0 payment
		  10m53s   153374            0.0            8.4      0.0      0.0      0.0      0.0 stockLevel
		E190628 19:33:36.310829 1 workload/cli/run.go:427  error in newOrder: dial tcp 10.142.0.58:26257: connect: connection refused
		: signal: killed
	cluster.go:1870,tpcc.go:177,schemachange.go:300,test.go:1249: unexpected node event: 3: dead
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpcc.go:177,schemachange.go:300,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1364821-schemachange-index-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: 3376
		3: dead
		4: 3412
		1: 4102
		Error:  3: dead

@nvanbenschoten
Copy link
Member

F190628 10:22:27.042386 1387476 kv/txn_interceptor_heartbeater.go:360  [n2,txn-hb=00000000] txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" id=6e173f63 key=/Table/54/1/21/1/-2144/0 rw=true pri=0.02183043 stat=ABORTED epo=1 ts=1561717344.000511142,2 orig=1561717344.000511142,2 max=1561717342.603306298,0 wto=false seq=0

For both of the above.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=schemachange/index/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (45m0s)
	cluster.go:1511,tpcc.go:156,tpcc.go:158,schemachange.go:300,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-schemachange-index-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190630 19:12:48.596110 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190630 19:12:54.966404 33 ccl/workloadccl/fixture.go:476  loaded 7.8 MiB table item in 6.369998204s (100000 rows, 0 index entries, 1.2 MiB)
		I190630 19:13:07.121366 27 ccl/workloadccl/fixture.go:476  loaded 5.1 KiB table warehouse in 18.524838618s (100 rows, 0 index entries, 284 B)
		I190630 19:13:22.298783 28 ccl/workloadccl/fixture.go:476  loaded 99 KiB table district in 33.702466893s (1000 rows, 0 index entries, 2.9 KiB)
		: signal: killed

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 30, 2019
Fixes cockroachdb#36024.
Fixes cockroachdb#36094.

8b5bafb ensured that all transaction state was propagated by DistSender on
errors. In doing so, it touched that fact that DistSender drops all but the
first error that it sees. It ensured that even though this was the case, the
error metadata from these dropped errors would still be propagated (see
`pErr.UpdateTxn(resp.pErr.GetTxn())`).

This has an unintended consequence where it was now possible for a non-aborting
transaction retry error to be updated with an ABORTED transaction proto. This
caused confusion in the TxnCoordSender, triggering panics like we see in cockroachdb#36024
and cockroachdb#36094.

This change fixes this by being smarter about which errors get dropped when
concurrent partial batches each hit an error in DistSender. It does this by
prioritizing the most severe errors and merging transaction state into those.
In a lot of ways, this is the DistSender equivalent of 574e805, which is why
they now share code.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 1, 2019
Fixes cockroachdb#36024.
Fixes cockroachdb#36094.

8b5bafb ensured that all transaction state was propagated by DistSender on
errors. In doing so, it touched that fact that DistSender drops all but the
first error that it sees. It ensured that even though this was the case, the
error metadata from these dropped errors would still be propagated (see
`pErr.UpdateTxn(resp.pErr.GetTxn())`).

This has an unintended consequence where it was now possible for a non-aborting
transaction retry error to be updated with an ABORTED transaction proto. This
caused confusion in the TxnCoordSender, triggering panics like we see in cockroachdb#36024
and cockroachdb#36094.

This change fixes this by being smarter about which errors get dropped when
concurrent partial batches each hit an error in DistSender. It does this by
prioritizing the most severe errors and merging transaction state into those.
In a lot of ways, this is the DistSender equivalent of 574e805, which is why
they now share code.

Release note: None
craig bot pushed a commit that referenced this issue Jul 1, 2019
38579: kv: prioritize severe errors when merging partial batches in DistSender r=andreimatei a=nvanbenschoten

Fixes #36024.
Fixes #36094.

8b5bafb ensured that all transaction state was propagated by `DistSender` on errors. In doing so, it touched that fact that `DistSender` drops all but the first error that it sees. It ensured that even though this was the case, the error metadata from these dropped errors would still be propagated (see `pErr.UpdateTxn(resp.pErr.GetTxn())`).

This has an unintended consequence where it was now possible for a non-aborting transaction retry error to be updated with an ABORTED transaction proto. This caused confusion in the `TxnCoordSender`, triggering panics like the ones we see in #36024 and #36094.

This change fixes this by being smarter about which errors get dropped when concurrent partial batches each hit an error in `DistSender`. It does this by prioritizing the most severe errors and merging transaction state into those. In a lot of ways, this is the `DistSender` equivalent of 574e805, which is why they now share code.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #38579 Jul 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants