-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: clearrange/checks=false failed #78408
Comments
This one is a new failure mode. Node 1 died due to being OOM killed. This happened on the very step in the test setup - the fixture import. [ 742.764301] Tasks state (memory values in pages):
[ 742.764301] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 742.764304] [ 12532] 1000 12532 2154 487 53248 0 0 bash
[ 742.764306] [ 12539] 1000 12539 5685019 3471454 43855872 0 0 cockroach
[ 742.764307] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/system.slice/cockroach.service,task_memcg=/system.slice/cockroach.service,task=cockroach,pid=12539,uid=1000
[ 742.764475] Memory cgroup out of memory: Killed process 12539 (cockroach) total-vm:22740076kB, anon-rss:13885816kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:42828kB oom_score_adj:0
[ 743.536731] oom_reaper: reaped process 12539 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB |
The workers are all n1-standard-4's, which have 15GB of RAM (source). Pulling the metrics, node 1 stands out as having different behavior: |
I see a lot of warnings about node 1 being overloaded, starting right around when memory started increasing on that node: I220324 07:21:26.707657 309 kv/kvserver/pkg/kv/kvserver/store_raft.go:515 ⋮ [n1,s1,r226/5:‹/Table/106/1/1{32725…-95350…}›,raft] 1551 handle raft ready: 1.6s [applied=4, batches=4, state_assertions=0]; node might be overloaded I found the following error in the logs for node 1: E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 ⋮ [n1,s1,store-rebalancer] 2137 unable to relocate range to [n9,s9 n1,s1 n7,s7]: while carrying out changes [{ADD_VOTER n9,s9} {REMOVE_VOTER n4,s4}]: change replicas of r407 failed: descriptor changed: [expected] r407:‹/Table/106/1/3{3784918-9061000}› [(n4,s4):1, (n1,s1):2, (n7,s7):3, next=4, gen=94, sticky=1648110326.133963771,0] != [actual] r407:‹/Table/106/1/33{784918-835414}› [(n4,s4):1, (n1,s1):2, (n7,s7):3, next=4, gen=95, sticky=1648110326.133963771,0]
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +(1)
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | (opaque error wrapper)
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | reportable 0:
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + |
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).relocateReplicas
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:2851
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).AdminRelocateRange
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:2769
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminBatch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:951
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithoutRangeID
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:177
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:100
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:197
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:191
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/server/node.go:1006
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:344
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/server/node.go:989
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/server/node.go:1058
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:554
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:209
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:191
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2060
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1608
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1210
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:831
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/db.go:222
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/db.go:968
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv.(*DB).send
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/db.go:951
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv.sendAndFill
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/db.go:830
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/db.go:853
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv.(*DB).AdminRelocateRange
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/db.go:677
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).rebalanceStore.func2
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:370
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).rebalanceStore
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:369
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).Start.func1
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:225
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | runtime.goexit
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 + | GOROOT/src/runtime/asm_amd64.s:1581
E220324 07:25:46.413081 380 kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:378 â‹® [n1,s1,store-rebalancer] 2137 +Wraps: (2) while carrying out changes [{ADD_VOTER n9,s9} {REMOVE_VOTER n4,s4}] Full error preserved here. |
cc: @nvanbenschoten (current KV oncall) - reckon you'd be able to help me triage this - specifically the rebalancer error? |
roachtest.clearrange/checks=false failed with artifacts on master @ 32b45c4bcf1ab41f0ba3abd36cb670eea7f450fd:
|
This reproduces pretty easily. Here's a heap profile dumped shortly before a node OOM'd in my own run.
|
cc @cockroachdb/bulk-io |
This comment was marked as off-topic.
This comment was marked as off-topic.
https://share.polarsignals.com/a1940fb/ from #78408 (comment) This isn't as obvious yet. @jbowens could you post some heap profiles from your local run since you probably captured at a better time. You can use |
(just trying to follow along)
|
I took a brief look at the workload side of this. The workload synthesizer uses the same common infra as all the other IMPORT "frontends" (e.g. csv, mysql, etc) namely a rowconverter, which can be handed rows (datums) to convert to KVs for it to send on a channel to the "backend" buffer/sort/ingest process part of an IMPORT. This rowconverter helper accumulates "small" batches of KVs to send on this channel rather than sending each KV individually, to reduce locking channel overhead. "Small" is currently defined as 5000 KVs. These batches are then pulled off the channel on the ingest process side where they go into a large, but memory-monitored buffer, (which appears to be well behaved) before being sorted, divvied into SSTables and sent. The channel connecting the front and back end sides is buffered with a capacity of 10 batches. Workload starts runtime.GOMAXPROCs workers to synthesize rows, each of with have their own converter with its own batch. Thus on 16vCPU machine, we could have up to 16 full batches in converters plus 10 batches in the channel, so we could have 16+10*5000 = as many as 130k KVs across various batches on their way to the ingest side. This specific roachtest is configuring the workload synthesizer to make 10240 byte payload fields in its rows. So if all those fixed size batches happen to be full at once, with entries that have this relatively large row size, we'd expect to see 130k * 10kb = ~1.3gb. Add in the key size and col-id / family encoding to that 10kb per pk along with the currently draining batch, and maybe a couple drained batches that haven't been been freed yet, and I can see getting to 1.5. That said, I would not expect this spiking much above that; it is big, and not hooked into accounting, but not unbounded that I can see. While the long-term answer might be to plumb full memory monitoring throughout the IMPORT frontend, in the meantime I'll whip up a patch to just add a second limit to the batch size on aggregate key/val cap, in addition to the 5k count, so we're less susceptible to "big" row input blowing up their footprint. |
I don't think the above batch sizes stuff explains the spike though, or that #78945 will fix this, since I suspect 1.3-1.5gb is about the biggest this would grow, and doesn't explain what spikes up and OOMs it |
I ran the final heap profile from each node in the failed run through pprof (via polarsignals): Node 1 is clearly different in that it has additional in-use memory tied up in a I see the same 1.5G / 70% of in-use space from the workload reader across all the nodes 2-10. Tagging onto what @sumeerbhola had mentioned, should we be looking at what's happening with |
We understand this already, start reading here: #71805 (comment) I am hoping to look into a mitigation for this "soon" (original plan was for this week, but not sure there is enough week left now; I will try to get started at least). However, once this is mitigated it will likely still indicate an unhealthy node. We are falling behind applying raft commands (node is receiving commands, queueing them, but not pulling them from the queue for handing to raft in a timely fashion), and I would guess this happens due to slowness of the storage layer. This might be worth confirming in these runs if you can, @nicktrav, to get ahead of the "next stage" of problem. |
Perfect. I was unaware of this until now. Thanks for linking it.
I think we see the same thing in these runs. The store on the node that crashes is in poor health, relative to the others. Digging into the storage metrics for the last failed run, it's clear that n1 is different (it's the timeseries that stands out :) ). Here's the same graphs for the first failed run in this ticket: The causality is tricky. Was the disk latency increased due to something else, or is the disk latency causing issues elsewhere. Read amp looked fine in both cases: |
@nicktrav officially assigned you for actuarial purposes. |
roachtest.clearrange/checks=false failed with artifacts on master @ 771432d1099e516dbc11827c5458886c176e73e3:
Same failure on other branches
|
ℹ️ Hello! I am a human and not at all a robot! Look at my very human username! 🤖 🎶 |
Sounds good, we also don't see these as often any more right now, and I'm working on #79215 (comment) to guide solutions. |
roachtest.clearrange/checks=false failed with artifacts on master @ 8b367174769c89c0fcfe50986ed68d4650be7750:
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-14116
The text was updated successfully, but these errors were encountered: