Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: perturbation/full/backfill failed #137093

Closed
cockroach-teamcity opened this issue Dec 10, 2024 · 17 comments · Fixed by #139219
Closed

roachtest: perturbation/full/backfill failed #137093

cockroach-teamcity opened this issue Dec 10, 2024 · 17 comments · Fixed by #139219
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 10, 2024

roachtest.perturbation/full/backfill failed with artifacts on master @ de3b1220f5c71ac966561505c1b379060fa1407f:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:671
	            				pkg/cmd/roachtest/test_runner.go:1322
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 529429.2047 > 40.0000 BASE: 6.799776ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 599969.5016 > 40.0000 BASE: 6.000305ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 193048.0812 > 40.0000 BASE: 18.648204ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-45381

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Dec 10, 2024
@miraradeva
Copy link
Contributor

@andrewbaptist I took a stab at this based on the debug guide you wrote.

This failure is of the type where the latency during the perturbation period has exceeded the test passing criteria (significantly at that e.g. 599969.5016 > 40.0000).

From the logs, there are clear signs that n12 is overloaded (e.g. raft ready handling: 0.61s, slow non-blocking raft commit, raft receive queue for r128 is full). On other nodes, there are some messages about latches held for over 3-4s. I don't know what the backed-up data here looks like but there are some really huge keys being logged as part of the slow latch logs. There are some failed splits due to changed descriptors as well.

From Grafana, the only thing that stand out to me is elevated pmax latency on the non-perturbed nodes during the perturbation interval, up to 10s.

From the statement bundles, there are plenty of requests that took over 40ms, but only 2 were in the multiple seconds range. Both of them seem to involve n12 in the slow parts. E.g.:

  2004.700ms   2002.464ms                                    event:kv/kvserver/replica_proposal_buf.go:575 [n12,s24,r6268/8:/Table/109/1/919{0171…-2015…}] flushing proposal to Raft

From some of the other bundles, there are other slow parts, also involving n12:

370.409ms    369.193ms                            event:kv/kvserver/replica_evaluate.go:511 [n12,s24,r2865/7:/Table/109/1/8{78899…-80743…}] evaluated Get command header:<kvnemesis_seq:<> key:"\365\211\375\0148`\204@\342\221\256\210" > , txn="sql txn" meta={id=e05e18ec key=/Min iso=Serializable pri=0.02753469 epo=0 ts=1733841188.128916000,0 min=1733841188.128916000,0 seq=0} lock=false stat=PENDING rts=1733841188.128916000,0 wto=false gul=1733841188.128916000,0 : resp=num_keys:1 num_bytes:4104 , err=<nil>

But this is expected given the perturbation, right? What am I missing?

@andrewbaptist
Copy link
Collaborator

@miraradeva Thanks for digging into this. The goal of this test is to demonstrate that backfills do not cause overload on the cluster, so in this case it is correctly failing.

The thing to track down is why those 2s requests took so long. I took a look at one of them and I see this:

     2.237ms      0.016ms                                    event:kv/kvserver/replica_raft.go:439 [n12,s24,r6268/8:/Table/109/1/919{0171…-2015…}] submitting proposal to proposal buffer
  2004.700ms   2002.464ms                                    event:kv/kvserver/replica_proposal_buf.go:575 [n12,s24,r6268/8:/Table/109/1/919{0171…-2015…}] flushing proposal to Raft

Which is a very long time to be waiting between those two steps since they are in-memory. Potentially this request was waiting on a latch or the scheduler was overloaded.

Looking at the graph around the time of the failure (14:38:40 ) It seems like only one node - n12 is slow:
image

It also has >2s log commit latency.

Since this is running with the default AC mode (apply to elastic) we expected it to prevent this.

I checked the CPU and goroutine load on this node and didn't see anything unusually high, which made me suspect a lock was held for 2s on this replica (r6268).

I also checked if there was anything unusual on this node:

logs % grep -r -h r6268 . | sort | uniq
...
I241210 14:13:27.520898 633 2@kv/kvserver/replica_proposal.go:624 ⋮ [T1,Vsystem,n12,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›,raft] 439  new range lease repl=(n12,s24):8VOTER_INCOMING seq=27 start=1733840007.518633723,0 epo=1 min-exp=1733840013.518598700,0 pro=1733840007.519516598,0 acq=Request promoted from repl=(n12,s24):8VOTER_INCOMING seq=27 start=1733840007.518633723,0 exp=1733840013.518598700,0 pro=1733840007.518598700,0 acq=Transfer
I241210 14:13:27.526741 377394 13@kv/kvserver/replica_raft.go:379 ⋮ [T1,Vsystem,n12,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›] 1247  proposing LEAVE_JOINT: after=[(n8,s16):5 (n4,s7):7 (n10,s20):6LEARNER (n12,s24):8] next=9
I241210 14:13:27.528213 482395 13@kv/kvserver/replica_command.go:2503 ⋮ [T1,Vsystem,n10,replicate,s20,r6268/6:‹/Table/109/1/919{0171…-2015…}›] 39927  change replicas (add [] remove [(n10,s20):6LEARNER]): existing descriptor r6268:‹/Table/109/1/919{0171217604007220-2015707562382336}› [(n8,s16):5, (n4,s7):7, (n10,s20):6LEARNER, (n12,s24):8, next=9, gen=126, sticky=9223372036.854775807,2147483647]
I241210 14:13:27.535222 376289 13@kv/kvserver/replica_raft.go:379 ⋮ [T1,Vsystem,n12,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›] 1248  proposing SIMPLE(r6) [(n10,s20):6LEARNER]: after=[(n8,s16):5 (n4,s7):7 (n12,s24):8] next=9
I241210 14:13:27.536279 594 kv/kvserver/store_remove_replica.go:152 ⋮ [T1,Vsystem,n10,s20,r6268/6:‹/Table/109/1/919{0171…-2015…}›,raft] 7271  removing replica r6268/6
I241210 14:13:27.575502 377264 kv/kvserver/replica_command.go:2433 ⋮ [T1,Vsystem,n12,replicate,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›] 9695  we were trying to exit a joint config but found that we are no longer in one; skipping
I241210 14:13:27.575638 377264 kv/kvserver/replica_command.go:2449 ⋮ [T1,Vsystem,n12,replicate,s24,r6268/8:‹/Table/109/1/919{0171…-2015…}›] 9696  skipping learner removal because it was already removed

This suggests that n12 was the correct leaseholder and there were not any recent range operations (lease transfers or split/merges).

Looking at this time on the n12 logs we see a lot of messages related to the node being overloaded from 14:38:02 until 14:38:40 such as these:

I241210 14:38:04.302551 714 kv/kvserver/replica_raft.go:1782 ⋮ [T1,Vsystem,n12,s24,r7333/7:‹/Table/109/1/-24{6147…-5962…}›,raft] 15378  slow non-blocking raft commit: commit-wait 581.860666ms sem 231ns
I241210 14:38:04.344539 636 kv/kvserver/queue.go:616 ⋮ [T1,Vsystem,n12,s24,r4400/7:‹/Table/109/1/691{2226…-4070…}›,raft] 15379  rate limited in MaybeAdd (merge): throttled on async limiting semaphore
I241210 14:38:04.370956 638 kv/kvserver/store_raft.go:692 ⋮ [T1,Vsystem,n12,s24,r10256/3:‹/Table/106/2/-{55971…-34681…}›,raft] 15380  raft ready handling: 1.70s [append=1.70s, apply=0.00s, non-blocking-sync=0.00s, other=0.00s], wrote [append-batch=20 KiB, append-sst=16 MiB (1), ]; node might be overloaded

I looked at the disk metrics and see very high IOPS and write bandwidth on this node:
image

I'm going to try and dig into what happens between those lines that might touch disk (specifically either a read or a syncing write). I don't think there are any writes.

@andrewbaptist
Copy link
Collaborator

There was a spike in the raft scheduler latency around this time:

image

Although it doesn't exactly line up to the 14:38:02 - 14:38:40 spike (its before it and after it).

@andrewbaptist
Copy link
Collaborator

From what I can tell, I don't think this test has failed since #131768 on Oct 2nd, and looking at the graphs its been pretty well behaved for the past couple months. I'm going to assign myself to this for now and watch if other perturbation tests are failing in similar ways.

@andrewbaptist andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. P-3 Issues/test failures with no fix SLA and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 11, 2024
@andrewbaptist
Copy link
Collaborator

The system has relatively low CPU load (30%) and low P99.9 go scheduler latency (~1ms). The disk IO is high for both reads and writes but that normally shouldn't impact raft scheduler latency.

The part that is confusing to me is why the P99.9 raft scheduler (raft_scheduler_latency ) would be 2s while the P99 is only 3-4ms. If the scheduler was overloaded and had a queue then I’d expect both the P99 and P99.9 to move more in sync and be closer to each other.

It makes me think one of two things is happening

  • One of the scheduler workers is way behind during this interval and can’t process these replicas.
  • Some ranges have a replica level lock held and can’t be scheduled.

Looking through the code, I don’t see any replica level locks, and I think we have 128 shards for a 16vCPU system which implies that it is most likely one shard that is slow. However looking at the slow range ids, they don't correspond cleanly with a multiple of 16 (or 15) which would be what I'd expect if it was a slow shard. I'll keep this open to see if this re-occurs.

@andrewbaptist andrewbaptist self-assigned this Dec 11, 2024
@andrewbaptist
Copy link
Collaborator

@ajwerner I was thinking the same thing. Having the ability to look at this situation with side-eye could really help in a case like this. Based on what the root issue is, some changes that could help are:

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ a653b4e4e6483cec7d65808ba4d55d8c63747a6e:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:671
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 580283.2330 > 40.0000 BASE: 6.203867ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 541453.7763 > 40.0000 BASE: 6.648767ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 360258.5936 > 40.0000 BASE: 9.992822ms SCORE: 1h0m0s
	            	
	            	FAILURE: follower-read  : Increase 580283.2330 > 40.0000 BASE: 6.203867ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 541453.7763 > 40.0000 BASE: 6.648767ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 360258.5936 > 40.0000 BASE: 9.992822ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ 51693691ed763f700dd06fa2d001cce1ffd42203:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: write          : Increase 44.3583 > 40.0000 BASE: 9.107319ms SCORE: 403.98543ms
	            	
	            	FAILURE: write          : Increase 54.7614 > 40.0000 BASE: 9.107319ms SCORE: 498.729869ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ 01f9e61532862cbbdbc64180013ca1cb57f4a017:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 629635.0321 > 40.0000 BASE: 5.717598ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 627756.1328 > 40.0000 BASE: 5.734711ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 427745.8282 > 40.0000 BASE: 8.416213ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ d58f071217505b226a097d39f87072f4e1bd8e06:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 556567.7702 > 40.0000 BASE: 6.468215ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 572014.8412 > 40.0000 BASE: 6.293543ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 212246.7682 > 40.0000 BASE: 16.961389ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ a0faa4e779ac9e54d17e8400141b0aa472887974:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 63.1325 > 40.0000 BASE: 7.701883ms SCORE: 486.238941ms
	            	
	            	FAILURE: read           : Increase 69.1431 > 40.0000 BASE: 7.531747ms SCORE: 520.768658ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 602662.8659 > 40.0000 BASE: 5.973489ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 606072.4418 > 40.0000 BASE: 5.939884ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 406699.5619 > 40.0000 BASE: 8.851743ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ 6e405b0552362bad6f0df5261376193fbf8b98cd:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 583147.5226 > 40.0000 BASE: 6.173395ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 584894.7994 > 40.0000 BASE: 6.154953ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 153885.4341 > 40.0000 BASE: 23.394027ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ 57aab736c34ce5dc7988bd53e0604fde48cef441:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 562463.7035 > 40.0000 BASE: 6.400413ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 560943.2947 > 40.0000 BASE: 6.417761ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 287851.0313 > 40.0000 BASE: 12.506469ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ 58e75b8c97804fea87f8f793665de98098e84b20:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 454613.0151 > 40.0000 BASE: 7.918823ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 469180.7544 > 40.0000 BASE: 7.672949ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 215059.4675 > 40.0000 BASE: 16.739556ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ 3ce8f44d1e033036783687e3c7ccb125d8de100b:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1315
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: write          : Increase 43.8486 > 40.0000 BASE: 10.273903ms SCORE: 450.496644ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/backfill failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1315
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/backfill
	Messages:   	FAILURE: follower-read  : Increase 511091.0303 > 40.0000 BASE: 7.043755ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 559763.3694 > 40.0000 BASE: 6.431289ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 415671.6526 > 40.0000 BASE: 8.660682ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/backfill/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Jan 16, 2025
The perturbation/*/backfill tests are flaky and are failing at least
once a week with the default configuration. This change temporarily
disables the check to allow easier investigation of the other failure
modes such as backfill failing to complete and node OOMs. Once those are
closed, and the test is running more stably, this threshold can be
dropped.

Fixes: cockroachdb#137093
Fixes: cockroachdb#137392

Informs: cockroachdb#133114

Release note: None
craig bot pushed a commit that referenced this issue Jan 16, 2025
138688: server: fix admin server Settings RPC redaction logic r=kyle-a-wong a=kyle-a-wong

Previously admin.Settings only allowed admins to view all cluster settings without redaction. If the
requester was not an admin, would use the isReportable field on settings to determine if the setting should be redacted or not. This API also had outdated logic, as users with the MODIFYCLUSTERSETTINGS should also be able to view all cluster settings (See #115356 for more discussions on this).

This patch respects this new role, and no longer uses the `isReportable` setting flag to determine if a setting should be redacted. This is implemented by query `crdb_internal.cluster_settings` directly, allowing the sql layer to permission check.

This commit also removes the `unredacted_values` from the request entity as well, since it is no longer necessary.

Ultimately, this commit updates the Settings RPC to have the same redaction logic as querying `crdb_internal.cluster_settings` or using `SHOW CLUSTER SETTINGS`.

Epic: None
Fixes: #137698
Release note (general change): The /_admin/v1/settings API now returns cluster settings using the same redaction logic as querying `SHOW CLUSTER SETTINGS` and `crdb_internal.cluster_settings`. This means that only settings flagged as "sensitive" will be redacted, all other settings will be visible. The same authorization is required for this endpoint, meaning the user must be an admin or have MODIFYCLUSTERSETTINGS or VIEWCLUSTERSETTINGS roles to hit this API. The exception is that if the user has VIEWACTIVITY or VIEWACTIVITYREDACTED, they will see console only settings.

138967: crosscluster/physical: return job id in SHOW TENANT WITH REPLICATION STATUS r=dt a=msbutler

Fixes #138548

Release note (sql change): SHOW TENANT WITH REPLICATION STATUS will now display the `ingestion_job_id` column after the `name` column.

139043: crosscluster/logical: ensure offline scan procs shut down before next phase r=dt a=msbutler

This patch adds a check that attempts to wait for the offline scan processors to spin down before transitioning to steady state ingestion or OnFailOrCancel during an offline scan.

Epic: none

Release note: none

139219: roachtest: disable backfill success check r=stevendanna a=andrewbaptist

The perturbation/*/backfill tests are flaky and are failing at least once a week with the default configuration. This change temporarily disables the check to allow easier investigation of the other failure modes such as backfill failing to complete and node OOMs. Once those are closed, and the test is running more stably, this threshold can be dropped.

Fixes: #137093
Fixes: #137392

Informs: #133114

Release note: None

139259: sql: deflake TestIndexBackfillFractionTracking r=rafiss a=rafiss

Recent changes added some concurrency to index backfills, so the testing hook needs a mutex to prevent concurrent access.

fixes #139213
Release note: None

Co-authored-by: Kyle Wong <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig craig bot closed this as completed in 1a6bbf3 Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants