roachtest: perturbation/full/backfill failed #137093

cockroach-teamcity · 2024-12-10T14:47:33Z

roachtest.perturbation/full/backfill failed with artifacts on master @ de3b1220f5c71ac966561505c1b379060fa1407f:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:671
	            				pkg/cmd/roachtest/test_runner.go:1322
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 529429.2047 > 40.0000 BASE: 6.799776ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 599969.5016 > 40.0000 BASE: 6.000305ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 193048.0812 > 40.0000 BASE: 18.648204ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

Jira issue: CRDB-45381

The text was updated successfully, but these errors were encountered:

miraradeva · 2024-12-10T19:04:50Z

@andrewbaptist I took a stab at this based on the debug guide you wrote.

This failure is of the type where the latency during the perturbation period has exceeded the test passing criteria (significantly at that e.g. 599969.5016 > 40.0000).

From the logs, there are clear signs that n12 is overloaded (e.g. raft ready handling: 0.61s, slow non-blocking raft commit, raft receive queue for r128 is full). On other nodes, there are some messages about latches held for over 3-4s. I don't know what the backed-up data here looks like but there are some really huge keys being logged as part of the slow latch logs. There are some failed splits due to changed descriptors as well.

From Grafana, the only thing that stand out to me is elevated pmax latency on the non-perturbed nodes during the perturbation interval, up to 10s.

From the statement bundles, there are plenty of requests that took over 40ms, but only 2 were in the multiple seconds range. Both of them seem to involve n12 in the slow parts. E.g.:

  2004.700ms   2002.464ms                                    event:kv/kvserver/replica_proposal_buf.go:575 [n12,s24,r6268/8:/Table/109/1/919{0171…-2015…}] flushing proposal to Raft

From some of the other bundles, there are other slow parts, also involving n12:

370.409ms    369.193ms                            event:kv/kvserver/replica_evaluate.go:511 [n12,s24,r2865/7:/Table/109/1/8{78899…-80743…}] evaluated Get command header:<kvnemesis_seq:<> key:"\365\211\375\0148`\204@\342\221\256\210" > , txn="sql txn" meta={id=e05e18ec key=/Min iso=Serializable pri=0.02753469 epo=0 ts=1733841188.128916000,0 min=1733841188.128916000,0 seq=0} lock=false stat=PENDING rts=1733841188.128916000,0 wto=false gul=1733841188.128916000,0 : resp=num_keys:1 num_bytes:4104 , err=<nil>

But this is expected given the perturbation, right? What am I missing?

andrewbaptist · 2024-12-11T14:22:23Z

@miraradeva Thanks for digging into this. The goal of this test is to demonstrate that backfills do not cause overload on the cluster, so in this case it is correctly failing.

The thing to track down is why those 2s requests took so long. I took a look at one of them and I see this:

     2.237ms      0.016ms                                    event:kv/kvserver/replica_raft.go:439 [n12,s24,r6268/8:/Table/109/1/919{0171â€¦-2015â€¦}] submitting proposal to proposal buffer
  2004.700ms   2002.464ms                                    event:kv/kvserver/replica_proposal_buf.go:575 [n12,s24,r6268/8:/Table/109/1/919{0171â€¦-2015â€¦}] flushing proposal to Raft

Which is a very long time to be waiting between those two steps since they are in-memory. Potentially this request was waiting on a latch or the scheduler was overloaded.

Looking at the graph around the time of the failure (14:38:40 ) It seems like only one node - n12 is slow:

It also has >2s log commit latency.

Since this is running with the default AC mode (apply to elastic) we expected it to prevent this.

I checked the CPU and goroutine load on this node and didn't see anything unusually high, which made me suspect a lock was held for 2s on this replica (r6268).

I also checked if there was anything unusual on this node:

logs % grep -r -h r6268 . | sort | uniq
...
I241210 14:13:27.520898 633 2@kv/kvserver/replica_proposal.go:624 ⋮ [T1,Vsystem,n12,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›,raft] 439  new range lease repl=(n12,s24):8VOTER_INCOMING seq=27 start=1733840007.518633723,0 epo=1 min-exp=1733840013.518598700,0 pro=1733840007.519516598,0 acq=Request promoted from repl=(n12,s24):8VOTER_INCOMING seq=27 start=1733840007.518633723,0 exp=1733840013.518598700,0 pro=1733840007.518598700,0 acq=Transfer
I241210 14:13:27.526741 377394 13@kv/kvserver/replica_raft.go:379 ⋮ [T1,Vsystem,n12,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›] 1247  proposing LEAVE_JOINT: after=[(n8,s16):5 (n4,s7):7 (n10,s20):6LEARNER (n12,s24):8] next=9
I241210 14:13:27.528213 482395 13@kv/kvserver/replica_command.go:2503 ⋮ [T1,Vsystem,n10,replicate,s20,r6268/6:‹/Table/109/1/919{0171…-2015…}›] 39927  change replicas (add [] remove [(n10,s20):6LEARNER]): existing descriptor r6268:‹/Table/109/1/919{0171217604007220-2015707562382336}› [(n8,s16):5, (n4,s7):7, (n10,s20):6LEARNER, (n12,s24):8, next=9, gen=126, sticky=9223372036.854775807,2147483647]
I241210 14:13:27.535222 376289 13@kv/kvserver/replica_raft.go:379 ⋮ [T1,Vsystem,n12,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›] 1248  proposing SIMPLE(r6) [(n10,s20):6LEARNER]: after=[(n8,s16):5 (n4,s7):7 (n12,s24):8] next=9
I241210 14:13:27.536279 594 kv/kvserver/store_remove_replica.go:152 ⋮ [T1,Vsystem,n10,s20,r6268/6:‹/Table/109/1/919{0171…-2015…}›,raft] 7271  removing replica r6268/6
I241210 14:13:27.575502 377264 kv/kvserver/replica_command.go:2433 ⋮ [T1,Vsystem,n12,replicate,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›] 9695  we were trying to exit a joint config but found that we are no longer in one; skipping
I241210 14:13:27.575638 377264 kv/kvserver/replica_command.go:2449 ⋮ [T1,Vsystem,n12,replicate,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›] 9696  skipping learner removal because it was already removed

This suggests that n12 was the correct leaseholder and there were not any recent range operations (lease transfers or split/merges).

Looking at this time on the n12 logs we see a lot of messages related to the node being overloaded from 14:38:02 until 14:38:40 such as these:

I241210 14:38:04.302551 714 kv/kvserver/replica_raft.go:1782 ⋮ [T1,Vsystem,n12,s24,r7333/7:‹/Table/109/1/-24{6147…-5962…}›,raft] 15378  slow non-blocking raft commit: commit-wait 581.860666ms sem 231ns
I241210 14:38:04.344539 636 kv/kvserver/queue.go:616 ⋮ [T1,Vsystem,n12,s24,r4400/7:‹/Table/109/1/691{2226…-4070…}›,raft] 15379  rate limited in MaybeAdd (merge): throttled on async limiting semaphore
I241210 14:38:04.370956 638 kv/kvserver/store_raft.go:692 ⋮ [T1,Vsystem,n12,s24,r10256/3:‹/Table/106/2/-{55971…-34681…}›,raft] 15380  raft ready handling: 1.70s [append=1.70s, apply=0.00s, non-blocking-sync=0.00s, other=0.00s], wrote [append-batch=20 KiB, append-sst=16 MiB (1), ]; node might be overloaded

I looked at the disk metrics and see very high IOPS and write bandwidth on this node:

I'm going to try and dig into what happens between those lines that might touch disk (specifically either a read or a syncing write). I don't think there are any writes.

andrewbaptist · 2024-12-11T14:26:12Z

There was a spike in the raft scheduler latency around this time:

Although it doesn't exactly line up to the 14:38:02 - 14:38:40 spike (its before it and after it).

andrewbaptist · 2024-12-11T14:33:49Z

From what I can tell, I don't think this test has failed since #131768 on Oct 2nd, and looking at the graphs its been pretty well behaved for the past couple months. I'm going to assign myself to this for now and watch if other perturbation tests are failing in similar ways.

andrewbaptist · 2024-12-11T15:11:33Z

The system has relatively low CPU load (30%) and low P99.9 go scheduler latency (~1ms). The disk IO is high for both reads and writes but that normally shouldn't impact raft scheduler latency.

The part that is confusing to me is why the P99.9 raft scheduler (raft_scheduler_latency ) would be 2s while the P99 is only 3-4ms. If the scheduler was overloaded and had a queue then I’d expect both the P99 and P99.9 to move more in sync and be closer to each other.

It makes me think one of two things is happening

One of the scheduler workers is way behind during this interval and can’t process these replicas.
Some ranges have a replica level lock held and can’t be scheduled.

Looking through the code, I don’t see any replica level locks, and I think we have 128 shards for a 16vCPU system which implies that it is most likely one shard that is slow. However looking at the slow range ids, they don't correspond cleanly with a multiple of 16 (or 15) which would be what I'd expect if it was a slow shard. I'll keep this open to see if this re-occurs.

andrewbaptist · 2024-12-12T15:29:25Z

@ajwerner I was thinking the same thing. Having the ability to look at this situation with side-eye could really help in a case like this. Based on what the root issue is, some changes that could help are:

Modify the raft scheduler to rebalance if some shards are busy (what you suggest)
Make disk reads asynchronous with respect to Raft state machine kv: make disk reads asynchronous with respect to Raft state machine #105850
Separate log and state storage kvserver: separate raft log #16624 / raft: separate log and state storage logically #132030
Add dynamic disk bandwidth limits. We have the ability to specify a limit today but it is somewhat brittle. admission: disk bandwidth limiter integrations #121779

cockroach-teamcity · 2024-12-17T15:12:54Z

roachtest.perturbation/full/backfill failed with artifacts on master @ a653b4e4e6483cec7d65808ba4d55d8c63747a6e:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:671
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 580283.2330 > 40.0000 BASE: 6.203867ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 541453.7763 > 40.0000 BASE: 6.648767ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 360258.5936 > 40.0000 BASE: 9.992822ms SCORE: 1h0m0s
	            	
	            	FAILURE: follower-read  : Increase 580283.2330 > 40.0000 BASE: 6.203867ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 541453.7763 > 40.0000 BASE: 6.648767ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 360258.5936 > 40.0000 BASE: 9.992822ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-12-18T14:50:45Z

roachtest.perturbation/full/backfill failed with artifacts on master @ 51693691ed763f700dd06fa2d001cce1ffd42203:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: write          : Increase 44.3583 > 40.0000 BASE: 9.107319ms SCORE: 403.98543ms
	            	
	            	FAILURE: write          : Increase 54.7614 > 40.0000 BASE: 9.107319ms SCORE: 498.729869ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-12-22T14:23:03Z

roachtest.perturbation/full/backfill failed with artifacts on master @ 01f9e61532862cbbdbc64180013ca1cb57f4a017:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 629635.0321 > 40.0000 BASE: 5.717598ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 627756.1328 > 40.0000 BASE: 5.734711ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 427745.8282 > 40.0000 BASE: 8.416213ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-12-24T14:11:09Z

roachtest.perturbation/full/backfill failed with artifacts on master @ d58f071217505b226a097d39f87072f4e1bd8e06:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 556567.7702 > 40.0000 BASE: 6.468215ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 572014.8412 > 40.0000 BASE: 6.293543ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 212246.7682 > 40.0000 BASE: 16.961389ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-12-27T15:09:16Z

roachtest.perturbation/full/backfill failed with artifacts on master @ a0faa4e779ac9e54d17e8400141b0aa472887974:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 63.1325 > 40.0000 BASE: 7.701883ms SCORE: 486.238941ms
	            	
	            	FAILURE: read           : Increase 69.1431 > 40.0000 BASE: 7.531747ms SCORE: 520.768658ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-12-31T14:06:45Z

roachtest.perturbation/full/backfill failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 602662.8659 > 40.0000 BASE: 5.973489ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 606072.4418 > 40.0000 BASE: 5.939884ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 406699.5619 > 40.0000 BASE: 8.851743ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2025-01-05T15:15:48Z

roachtest.perturbation/full/backfill failed with artifacts on master @ 6e405b0552362bad6f0df5261376193fbf8b98cd:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 583147.5226 > 40.0000 BASE: 6.173395ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 584894.7994 > 40.0000 BASE: 6.154953ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 153885.4341 > 40.0000 BASE: 23.394027ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2025-01-07T13:43:16Z

roachtest.perturbation/full/backfill failed with artifacts on master @ 57aab736c34ce5dc7988bd53e0604fde48cef441:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 562463.7035 > 40.0000 BASE: 6.400413ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 560943.2947 > 40.0000 BASE: 6.417761ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 287851.0313 > 40.0000 BASE: 12.506469ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2025-01-09T14:53:51Z

roachtest.perturbation/full/backfill failed with artifacts on master @ 58e75b8c97804fea87f8f793665de98098e84b20:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 454613.0151 > 40.0000 BASE: 7.918823ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 469180.7544 > 40.0000 BASE: 7.672949ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 215059.4675 > 40.0000 BASE: 16.739556ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2025-01-10T15:03:16Z

roachtest.perturbation/full/backfill failed with artifacts on master @ 3ce8f44d1e033036783687e3c7ccb125d8de100b:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1315
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: write          : Increase 43.8486 > 40.0000 BASE: 10.273903ms SCORE: 450.496644ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2025-01-12T14:26:00Z

roachtest.perturbation/full/backfill failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1315
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 511091.0303 > 40.0000 BASE: 7.043755ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 559763.3694 > 40.0000 BASE: 6.431289ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 415671.6526 > 40.0000 BASE: 8.660682ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

The perturbation/*/backfill tests are flaky and are failing at least once a week with the default configuration. This change temporarily disables the check to allow easier investigation of the other failure modes such as backfill failing to complete and node OOMs. Once those are closed, and the test is running more stably, this threshold can be dropped. Fixes: cockroachdb#137093 Fixes: cockroachdb#137392 Informs: cockroachdb#133114 Release note: None

138688: server: fix admin server Settings RPC redaction logic r=kyle-a-wong a=kyle-a-wong Previously admin.Settings only allowed admins to view all cluster settings without redaction. If the requester was not an admin, would use the isReportable field on settings to determine if the setting should be redacted or not. This API also had outdated logic, as users with the MODIFYCLUSTERSETTINGS should also be able to view all cluster settings (See #115356 for more discussions on this). This patch respects this new role, and no longer uses the `isReportable` setting flag to determine if a setting should be redacted. This is implemented by query `crdb_internal.cluster_settings` directly, allowing the sql layer to permission check. This commit also removes the `unredacted_values` from the request entity as well, since it is no longer necessary. Ultimately, this commit updates the Settings RPC to have the same redaction logic as querying `crdb_internal.cluster_settings` or using `SHOW CLUSTER SETTINGS`. Epic: None Fixes: #137698 Release note (general change): The /_admin/v1/settings API now returns cluster settings using the same redaction logic as querying `SHOW CLUSTER SETTINGS` and `crdb_internal.cluster_settings`. This means that only settings flagged as "sensitive" will be redacted, all other settings will be visible. The same authorization is required for this endpoint, meaning the user must be an admin or have MODIFYCLUSTERSETTINGS or VIEWCLUSTERSETTINGS roles to hit this API. The exception is that if the user has VIEWACTIVITY or VIEWACTIVITYREDACTED, they will see console only settings. 138967: crosscluster/physical: return job id in SHOW TENANT WITH REPLICATION STATUS r=dt a=msbutler Fixes #138548 Release note (sql change): SHOW TENANT WITH REPLICATION STATUS will now display the `ingestion_job_id` column after the `name` column. 139043: crosscluster/logical: ensure offline scan procs shut down before next phase r=dt a=msbutler This patch adds a check that attempts to wait for the offline scan processors to spin down before transitioning to steady state ingestion or OnFailOrCancel during an offline scan. Epic: none Release note: none 139219: roachtest: disable backfill success check r=stevendanna a=andrewbaptist The perturbation/*/backfill tests are flaky and are failing at least once a week with the default configuration. This change temporarily disables the check to allow easier investigation of the other failure modes such as backfill failing to complete and node OOMs. Once those are closed, and the test is running more stably, this threshold can be dropped. Fixes: #137093 Fixes: #137392 Informs: #133114 Release note: None 139259: sql: deflake TestIndexBackfillFractionTracking r=rafiss a=rafiss Recent changes added some concurrency to index backfills, so the testing hook needs a mutex to prevent concurrent access. fixes #139213 Release note: None Co-authored-by: Kyle Wong <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

andrewbaptist self-assigned this Dec 11, 2024

andrewbaptist mentioned this issue Jan 16, 2025

roachtest: disable backfill success check #139219

Merged

craig bot closed this as completed in 1a6bbf3 Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: perturbation/full/backfill failed #137093

roachtest: perturbation/full/backfill failed #137093

cockroach-teamcity commented Dec 10, 2024 •

edited by cockroach-jira-scripts

Loading

miraradeva commented Dec 10, 2024

andrewbaptist commented Dec 11, 2024

andrewbaptist commented Dec 11, 2024

andrewbaptist commented Dec 11, 2024

andrewbaptist commented Dec 11, 2024

andrewbaptist commented Dec 12, 2024

cockroach-teamcity commented Dec 17, 2024

cockroach-teamcity commented Dec 18, 2024

cockroach-teamcity commented Dec 22, 2024

cockroach-teamcity commented Dec 24, 2024

cockroach-teamcity commented Dec 27, 2024

cockroach-teamcity commented Dec 31, 2024

cockroach-teamcity commented Jan 5, 2025

cockroach-teamcity commented Jan 7, 2025

cockroach-teamcity commented Jan 9, 2025

cockroach-teamcity commented Jan 10, 2025

cockroach-teamcity commented Jan 12, 2025

roachtest: perturbation/full/backfill failed #137093

roachtest: perturbation/full/backfill failed #137093

Comments

cockroach-teamcity commented Dec 10, 2024 • edited by cockroach-jira-scripts Loading

miraradeva commented Dec 10, 2024

andrewbaptist commented Dec 11, 2024

andrewbaptist commented Dec 11, 2024

andrewbaptist commented Dec 11, 2024

andrewbaptist commented Dec 11, 2024

andrewbaptist commented Dec 12, 2024

cockroach-teamcity commented Dec 17, 2024

cockroach-teamcity commented Dec 18, 2024

cockroach-teamcity commented Dec 22, 2024

cockroach-teamcity commented Dec 24, 2024

cockroach-teamcity commented Dec 27, 2024

cockroach-teamcity commented Dec 31, 2024

cockroach-teamcity commented Jan 5, 2025

cockroach-teamcity commented Jan 7, 2025

cockroach-teamcity commented Jan 9, 2025

cockroach-teamcity commented Jan 10, 2025

cockroach-teamcity commented Jan 12, 2025

cockroach-teamcity commented Dec 10, 2024 •

edited by cockroach-jira-scripts

Loading