roachtest: import/tpcc/warehouses=4000/geo failed #71050

cockroach-teamcity · 2021-10-03T11:08:37Z

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on release-21.2 @ d1231cff60125b397ccce6c79c9aeea771cdcca4:


	monitor.go:128,import.go:134,import.go:159,test_runner.go:777: monitor failure: unexpected node event: 4: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:134
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:159
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 4: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3531708-1633241850-46-n8cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 7: 8546
		8: 8530
		1: 8432
		3: 8357
		2: 8542
		4: dead (exit status 137)
		5: 8112
		6: 8112
		Error: UNCLASSIFIED_PROBLEM: 4: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 4: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: import/tpcc/warehouses=4000/geo failed [raft sideloading oom] #70307 roachtest: import/tpcc/warehouses=4000/geo failed [C-test-failure O-roachtest O-robot T-bulkio T-kv branch-master]

/cc @cockroachdb/bulk-io _{This test on roachdash | Improve this report!}

The text was updated successfully, but these errors were encountered:

dt · 2021-10-04T01:42:23Z

node 4 was oomkilled. What's interesting is prior to being killed there were some messages from store_send saying that some sst ingested has been delayed by 35 minutes, and for almost an hour prior to that the logs have intermittent warnings from KV of unavailable ranges. Something appears to have gone wrong at kv/storage level, and enough requests queued up that it was OOMkilled?

dt · 2021-10-04T01:46:47Z

This cluster looks sad almost from the get-go: just two minutes into the run, we start seeing see replica_write.go's log line about range unavailable: have been waiting 15.00s for proposing command GC and slow command GC [‹/Table/57/1/3073/9/-1740›,‹/Table/57/1/3073/9/-1740/NULL›) finished after 18.70s with error ‹result is ambiguous (context canceled)›. We see more intermittent unavailable range log lines for the next hour, before n4 is oomkilled.

tbg · 2021-10-04T14:53:26Z

Going to look at this one tomorrow morning, moving to GA while I do.

tbg · 2021-10-05T11:25:28Z

Looking at this with COCKROACH_DEBUG_TS_IMPORT_FILE=tsdump.gob cockroach start --insecure and this tsdump.gob.yaml (21.2 doesn't have this auto-generated yet):

1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8

Memory usage is interesting, looks like it's blowing up only towards the end:

Whatever is going on on n4 is also happening to n6 (though n6 survives).

There's also a spike in disk read iops:

The overload dashboard doesn't show anything I would consider of interest.

The import is still going on at that point (all the results are running)

grep -v 'succeeded' crdb_internal.jobs.txt | grep -Eo '"IMPORT TABLE [^ ]+'
"IMPORT TABLE tpcc.public.stock
"IMPORT TABLE tpcc.public.order_line
"IMPORT TABLE tpcc.public.customer

so I'm not sure what would've triggered this behavior change. KV is indeed unhappy throughout the import, but I suspect we will see that every time we run this test (?).

tbg · 2021-10-05T11:54:59Z

Well well well, this is interesting. Thanks to the heap profiler (🎖️ to @knz) we have heap profiles leading up right to the crash, inspected via variations on

**edit: this is on n6, not n4 - my bad, still interesting **

go tool pprof -http :6060 memprof.2021-10-03T11_06_44.153.8378880272.pprof

We have a burst of profiles at 11:06 and a previous burst at 11:04. They all show the same thing and look basically identical to the last one (11:06:44):

tbg · 2021-10-05T12:08:25Z

Oh wait, this is actually n6 (which had high RSS, but ended up not crashing). Nevertheless, interesting. Here's n4:

These do very much seem like two sides of the same coin. n4 is definitely sending large Raft messages, n6 is definitely receiving some. Chances are n4 is sending to n6.

tbg · 2021-10-05T12:24:42Z

Scouring the logs for interesting things happening just after 11:05:30 (right before spike). This is interesting:

teamcity-3531708-1633241850-46-n8cpu16-geo-0006> W211003 11:06:24.542748 289 kv/kvserver/split_trigger_helper.go:146 ⋮ [n6,s6,r2361/2:{-}] 7425 would have dropped incoming MsgApp to wait for split trigger, but allowing due to 10655 (>100) ticks
teamcity-3531708-1633241850-46-n8cpu16-geo-0006> W211003 11:06:24.544537 289 kv/kvserver/split_trigger_helper.go:146 ⋮ [n6,s6,r2361/2:{-}] 7426 would have dropped incoming MsgApp to wait for split trigger, but allowing due to 10655 (>100) ticks
teamcity-3531708-1633241850-46-n8cpu16-geo-0006> W211003 11:06:24.544758 289 kv/kvserver/split_trigger_helper.go:146 ⋮ [n6,s6,r2361/2:{-}] 7427 would have dropped incoming MsgApp to wait for split trigger, but allowing due to 10655 (>100) ticks
teamcity-3531708-1633241850-46-n8cpu16-geo-0006> W211003 11:06:24.544900 289 kv/kvserver/split_trigger_helper.go:146 ⋮ [n6,s6,r2361/2:{-}] 7428 would have dropped incoming MsgApp to wait for split trigger, but allowing due to 10655 (>100) ticks

This is very very interesting and obviously relevant:

teamcity-3531708-1633241850-46-n8cpu16-geo-0004> W211003 11:06:25.824117 56998 kv/kvserver/raft_transport.go:637 ⋮ [n4] 8649 while processing outgoing Raft queue to node 6: ‹rpc error: code = ResourceExhausted desc = trying to send message larger than max (2493464442 vs. 2147483647)›:

Uhm, 2493464442 bytes is >2.3GiB? First of all that's crazy and second of all, how did this get past the cluster setting

kv.raft.command.max_size        64 MiB  z       false   maximum size of a raft command

Ok, I have an idea:

cockroach/pkg/kv/kvserver/raft_transport.go

Lines 489 to 506 in 4304289

    
           case req := <-ch: 
        
           	batch.Requests = append(batch.Requests, *req) 
        
           	req.release() 
        
           	// Pull off as many queued requests as possible. 
        
           	// 
        
           	// TODO(peter): Think about limiting the size of the batch we send. 
        
           	for done := false; !done; { 
        
           		select { 
        
           		case req = <-ch: 
        
           			batch.Requests = append(batch.Requests, *req) 
        
           			req.release() 
        
           		default: 
        
           			done = true 
        
           		} 
        
           	} 
        
           	err := stream.Send(batch) 
        
           	batch.Requests = batch.Requests[:0]

This code definitely lets us build batches of any size. What's worse, once it does that, it also doesn't properly release the memory (like 506 doesn't zero out the slice).

That definitely seems pretty bad. I'll whip up a PR.

tbg · 2021-10-05T12:32:21Z

Another few choice quotes from the logs:

teamcity-3531708-1633241850-46-n8cpu16-geo-0004> I211003 11:06:30.325056 83183 kv/kvserver/store_send.go:297 ⋮ [n4,s4] 8678 SST ingestion was delayed by 34m17.53404451s (47.431µs for storage engine back-pressure)

34 minutes spent queuing on the semaphore behind this cluster setting:

cockroach/pkg/kv/kvserver/store.go

Lines 126 to 132 in 5b38e1e

    
           // addSSTableRequestLimit limits concurrent AddSSTable requests. 
        
           var addSSTableRequestLimit = settings.RegisterIntSetting( 
        
           	"kv.bulk_io_write.concurrent_addsstable_requests", 
        
           	"number of AddSSTable requests a store will handle concurrently before queuing", 
        
           	1, 
        
           	settings.PositiveInt, 
        
           )

Makes you wonder.

tbg · 2021-10-05T12:35:27Z

I was wondering why n6's metrics didn't recover, but now I am realizing it also went belly-up. However, it did so while the debug.zip was pulled. The last log message is dated 11:06:55 so it really didn't last much longer than n4. Had it managed to handle the large raft command batch, my next question would've been why its RSS didn't come down. But it crashed, so the question is moot and the recv-side code that the large allocation would've gone out of scope and be gc'ed.

Standard Go error - we were trying to avoid allocations by recycling a slice but weren't zeroing it out before. The occasional long slice that would reference a ton of memory would then effectively keep that large amount of memory alive forever. Touches cockroachdb#71050. Release note: None

In cockroachdb#71050, we saw evidence of very large (2.3+GiB) Raft messages being sent on the stream, which overwhelmed both the sender and the receiver. Raft messages are batched up before sending and so what must have happened here is that a large number of reasonably-sized messages (up to 64MiB in this case due to the max_size setting) were merged into a giant blob. As of this commit, we apply the max_size chunking on the batching step before sending messages as well. Closes cockroachdb#71050. Release note: None

cockroach-teamcity · 2021-10-07T11:04:35Z

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on release-21.2 @ 82e8782453b1cf14460d93b6bf8328a7b2964575:


	monitor.go:128,import.go:134,import.go:159,test_runner.go:777: monitor failure: unexpected node event: 3: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:134
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:159
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3550382-1633588489-56-n8cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 8: 9367
		7: 9344
		1: 9932
		2: 9211
		4: 9129
		3: dead (exit status 137)
		5: 9408
		6: 8927
		Error: UNCLASSIFIED_PROBLEM: 3: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 3: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: import/tpcc/warehouses=4000/geo failed [raft sideloading oom] #70307 roachtest: import/tpcc/warehouses=4000/geo failed [C-test-failure O-roachtest O-robot T-bulkio T-kv branch-master]

/cc @cockroachdb/bulk-io _{This test on roachdash | Improve this report!}

tbg · 2021-10-07T16:16:32Z

Haven't investigated the above failure (about to head out on vacation), but it could well be an instance of the same issue, which would then be fixed by #71132 as well. However, a question remains whether we're now more likely to create lots of large proposals than previously, and if that's something that should be looked into.

cockroach-teamcity · 2021-10-08T10:57:29Z

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on release-21.2 @ 3cf3b0ea3b08cb24e4d6b84c6c237b856ce6b411:


	monitor.go:128,import.go:134,import.go:159,test_runner.go:777: monitor failure: unexpected node event: 3: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:134
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:159
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3555401-1633674802-51-n8cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 8: 9465
		2: 9186
		7: 9494
		4: 9128
		1: 10097
		3: dead (exit status 137)
		5: 8984
		6: 8929
		Error: UNCLASSIFIED_PROBLEM: 3: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 3: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

See: roachtest README

Same failure on other branches

roachtest: import/tpcc/warehouses=4000/geo failed [raft sideloading oom] #70307 roachtest: import/tpcc/warehouses=4000/geo failed [C-test-failure O-roachtest O-robot T-bulkio T-kv branch-master]

/cc @cockroachdb/bulk-io _{This test on roachdash | Improve this report!}

Standard Go error - we were trying to avoid allocations by recycling a slice but weren't zeroing it out before. The occasional long slice that would reference a ton of memory would then effectively keep that large amount of memory alive forever. Touches cockroachdb#71050. Release note: None

…g batching In cockroachdb#71050, we saw evidence of very large (2.3+GiB) Raft messages being sent on the stream, which overwhelmed both the sender and the receiver. Raft messages are batched up before sending and so what must have happened here is that a large number of reasonably-sized messages (up to 64MiB in this case due to the max_size setting) were merged into a giant blob. As of this commit, we apply the max-size chunking on the batching step before sending messages as well. Closes cockroachdb#71050. Release note: None

… msg batching In cockroachdb#71050, we saw evidence of very large (2.3+GiB) Raft messages being sent on the stream, which overwhelmed both the sender and the receiver. Raft messages are batched up before sending and so what must have happened here is that a large number of reasonably-sized messages (up to 64MiB in this case due to the max_size setting) were merged into a giant blob. As of this commit, we apply the traget-size chunking on the batching step before sending messages as well. Closes cockroachdb#71050. Release note: None

71132: kvserver: apply a limit to outgoing raft msg batching r=erikgrinaker a=tbg In #71050, we saw evidence of very large (2.3+GiB) Raft messages being sent on the stream, which overwhelmed both the sender and the receiver. Raft messages are batched up before sending and so what must have happened here is that a large number of reasonably-sized messages (up to 64MiB in this case due to the `max_size` setting) were merged into a giant blob. As of this commit, we apply the `max_size` chunking on the batching step before sending messages as well. Closes #71050. Release note: None Co-authored-by: Tobias Grieger <[email protected]>

Standard Go error - we were trying to avoid allocations by recycling a slice but weren't zeroing it out before. The occasional long slice that would reference a ton of memory would then effectively keep that large amount of memory alive forever. Touches #71050. Release note: None

… msg batching In #71050, we saw evidence of very large (2.3+GiB) Raft messages being sent on the stream, which overwhelmed both the sender and the receiver. Raft messages are batched up before sending and so what must have happened here is that a large number of reasonably-sized messages (up to 64MiB in this case due to the max_size setting) were merged into a giant blob. As of this commit, we apply the target-size chunking on the batching step before sending messages as well. Closes #71050. Release note: None

Standard Go error - we were trying to avoid allocations by recycling a slice but weren't zeroing it out before. The occasional long slice that would reference a ton of memory would then effectively keep that large amount of memory alive forever. Touches #71050. Release note: None

… msg batching In #71050, we saw evidence of very large (2.3+GiB) Raft messages being sent on the stream, which overwhelmed both the sender and the receiver. Raft messages are batched up before sending and so what must have happened here is that a large number of reasonably-sized messages (up to 64MiB in this case due to the max_size setting) were merged into a giant blob. As of this commit, we apply the target-size chunking on the batching step before sending messages as well. Closes #71050. Release note: None

tbg · 2021-10-22T15:07:59Z

This was closed prematurely, backport is still open:

#71748

Standard Go error - we were trying to avoid allocations by recycling a slice but weren't zeroing it out before. The occasional long slice that would reference a ton of memory would then effectively keep that large amount of memory alive forever. Touches #71050. Release note: None

… msg batching In #71050, we saw evidence of very large (2.3+GiB) Raft messages being sent on the stream, which overwhelmed both the sender and the receiver. Raft messages are batched up before sending and so what must have happened here is that a large number of reasonably-sized messages (up to 64MiB in this case due to the max_size setting) were merged into a giant blob. As of this commit, we apply the target-size chunking on the batching step before sending messages as well. Closes #71050. Release note: None

lunevalex · 2021-10-28T21:10:42Z

backport merged, closing.

cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 3, 2021

blathers-crl bot added the T-disaster-recovery label Oct 3, 2021

blathers-crl bot added the T-kv KV Team label Oct 4, 2021

tbg added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 4, 2021

tbg self-assigned this Oct 5, 2021

tbg mentioned this issue Oct 5, 2021

kvserver: apply a limit to outgoing raft msg batching #71132

Merged

tbg removed the GA-blocker label Oct 5, 2021

dt mentioned this issue Oct 11, 2021

roachtest: restore/nodeShutdown/coordinator failed #71377

Closed

craig bot closed this as completed in 1d959cb Oct 20, 2021

blathers-crl bot mentioned this issue Oct 20, 2021

release-21.1: kvserver: apply a limit to outgoing raft msg batching #71747

Merged

blathers-crl bot mentioned this issue Oct 20, 2021

release-21.2: kvserver: apply a limit to outgoing raft msg batching #71748

Merged

dt mentioned this issue Oct 21, 2021

kvserver: raft receive queue may OOM under overload #71805

Open

AlexTalks mentioned this issue Oct 21, 2021

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802

Closed

tbg reopened this Oct 25, 2021

lunevalex closed this as completed Oct 28, 2021

tbg mentioned this issue Nov 4, 2021

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: import/tpcc/warehouses=4000/geo failed #71050

roachtest: import/tpcc/warehouses=4000/geo failed #71050

cockroach-teamcity commented Oct 3, 2021

dt commented Oct 4, 2021

dt commented Oct 4, 2021

tbg commented Oct 4, 2021

tbg commented Oct 5, 2021 •

edited

Loading

tbg commented Oct 5, 2021 •

edited

Loading

tbg commented Oct 5, 2021

tbg commented Oct 5, 2021

tbg commented Oct 5, 2021

tbg commented Oct 5, 2021

cockroach-teamcity commented Oct 7, 2021

tbg commented Oct 7, 2021

cockroach-teamcity commented Oct 8, 2021

tbg commented Oct 22, 2021

lunevalex commented Oct 28, 2021

roachtest: import/tpcc/warehouses=4000/geo failed #71050

roachtest: import/tpcc/warehouses=4000/geo failed #71050

Comments

cockroach-teamcity commented Oct 3, 2021

dt commented Oct 4, 2021

dt commented Oct 4, 2021

tbg commented Oct 4, 2021

tbg commented Oct 5, 2021 • edited Loading

tbg commented Oct 5, 2021 • edited Loading

tbg commented Oct 5, 2021

tbg commented Oct 5, 2021

tbg commented Oct 5, 2021

tbg commented Oct 5, 2021

cockroach-teamcity commented Oct 7, 2021

tbg commented Oct 7, 2021

cockroach-teamcity commented Oct 8, 2021

tbg commented Oct 22, 2021

lunevalex commented Oct 28, 2021

tbg commented Oct 5, 2021 •

edited

Loading

tbg commented Oct 5, 2021 •

edited

Loading