Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpcc/warehouses=4000/geo failed #71050

Closed
cockroach-teamcity opened this issue Oct 3, 2021 · 14 comments · Fixed by #71132
Closed

roachtest: import/tpcc/warehouses=4000/geo failed #71050

cockroach-teamcity opened this issue Oct 3, 2021 · 14 comments · Fixed by #71132
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on release-21.2 @ d1231cff60125b397ccce6c79c9aeea771cdcca4:


	monitor.go:128,import.go:134,import.go:159,test_runner.go:777: monitor failure: unexpected node event: 4: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:134
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:159
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 4: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3531708-1633241850-46-n8cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 7: 8546
		8: 8530
		1: 8432
		3: 8357
		2: 8542
		4: dead (exit status 137)
		5: 8112
		6: 8112
		Error: UNCLASSIFIED_PROBLEM: 4: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 4: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 3, 2021
@dt
Copy link
Member

dt commented Oct 4, 2021

node 4 was oomkilled. What's interesting is prior to being killed there were some messages from store_send saying that some sst ingested has been delayed by 35 minutes, and for almost an hour prior to that the logs have intermittent warnings from KV of unavailable ranges. Something appears to have gone wrong at kv/storage level, and enough requests queued up that it was OOMkilled?

@blathers-crl blathers-crl bot added the T-kv KV Team label Oct 4, 2021
@dt
Copy link
Member

dt commented Oct 4, 2021

This cluster looks sad almost from the get-go: just two minutes into the run, we start seeing see replica_write.go's log line about range unavailable: have been waiting 15.00s for proposing command GC and slow command GC [‹/Table/57/1/3073/9/-1740›,‹/Table/57/1/3073/9/-1740/NULL›) finished after 18.70s with error ‹result is ambiguous (context canceled)›. We see more intermittent unavailable range log lines for the next hour, before n4 is oomkilled.

@tbg tbg added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 4, 2021
@tbg
Copy link
Member

tbg commented Oct 4, 2021

Going to look at this one tomorrow morning, moving to GA while I do.

@tbg tbg self-assigned this Oct 5, 2021
@tbg
Copy link
Member

tbg commented Oct 5, 2021

Looking at this with COCKROACH_DEBUG_TS_IMPORT_FILE=tsdump.gob cockroach start --insecure and this tsdump.gob.yaml (21.2 doesn't have this auto-generated yet):

1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8

Memory usage is interesting, looks like it's blowing up only towards the end:

image

Whatever is going on on n4 is also happening to n6 (though n6 survives).

There's also a spike in disk read iops:

image

The overload dashboard doesn't show anything I would consider of interest.

The import is still going on at that point (all the results are running)

grep -v 'succeeded' crdb_internal.jobs.txt | grep -Eo '"IMPORT TABLE [^ ]+'
"IMPORT TABLE tpcc.public.stock
"IMPORT TABLE tpcc.public.order_line
"IMPORT TABLE tpcc.public.customer

so I'm not sure what would've triggered this behavior change. KV is indeed unhappy throughout the import, but I suspect we will see that every time we run this test (?).

@tbg
Copy link
Member

tbg commented Oct 5, 2021

Well well well, this is interesting. Thanks to the heap profiler (🎖️ to @knz) we have heap profiles leading up right to the crash, inspected via variations on

**edit: this is on n6, not n4 - my bad, still interesting **

go tool pprof -http :6060 memprof.2021-10-03T11_06_44.153.8378880272.pprof 

We have a burst of profiles at 11:06 and a previous burst at 11:04. They all show the same thing and look basically identical to the last one (11:06:44):

image

@tbg
Copy link
Member

tbg commented Oct 5, 2021

Oh wait, this is actually n6 (which had high RSS, but ended up not crashing). Nevertheless, interesting. Here's n4:

image

These do very much seem like two sides of the same coin. n4 is definitely sending large Raft messages, n6 is definitely receiving some. Chances are n4 is sending to n6.

@tbg
Copy link
Member

tbg commented Oct 5, 2021

Scouring the logs for interesting things happening just after 11:05:30 (right before spike). This is interesting:

teamcity-3531708-1633241850-46-n8cpu16-geo-0006> W211003 11:06:24.542748 289 kv/kvserver/split_trigger_helper.go:146 ⋮ [n6,s6,r2361/2:{-}] 7425 would have dropped incoming MsgApp to wait for split trigger, but allowing due to 10655 (>100) ticks
teamcity-3531708-1633241850-46-n8cpu16-geo-0006> W211003 11:06:24.544537 289 kv/kvserver/split_trigger_helper.go:146 ⋮ [n6,s6,r2361/2:{-}] 7426 would have dropped incoming MsgApp to wait for split trigger, but allowing due to 10655 (>100) ticks
teamcity-3531708-1633241850-46-n8cpu16-geo-0006> W211003 11:06:24.544758 289 kv/kvserver/split_trigger_helper.go:146 ⋮ [n6,s6,r2361/2:{-}] 7427 would have dropped incoming MsgApp to wait for split trigger, but allowing due to 10655 (>100) ticks
teamcity-3531708-1633241850-46-n8cpu16-geo-0006> W211003 11:06:24.544900 289 kv/kvserver/split_trigger_helper.go:146 ⋮ [n6,s6,r2361/2:{-}] 7428 would have dropped incoming MsgApp to wait for split trigger, but allowing due to 10655 (>100) ticks

This is very very interesting and obviously relevant:

teamcity-3531708-1633241850-46-n8cpu16-geo-0004> W211003 11:06:25.824117 56998 kv/kvserver/raft_transport.go:637 ⋮ [n4] 8649 while processing outgoing Raft queue to node 6: ‹rpc error: code = ResourceExhausted desc = trying to send message larger than max (2493464442 vs. 2147483647)›:

Uhm, 2493464442 bytes is >2.3GiB? First of all that's crazy and second of all, how did this get past the cluster setting

kv.raft.command.max_size        64 MiB  z       false   maximum size of a raft command

Ok, I have an idea:

case req := <-ch:
batch.Requests = append(batch.Requests, *req)
req.release()
// Pull off as many queued requests as possible.
//
// TODO(peter): Think about limiting the size of the batch we send.
for done := false; !done; {
select {
case req = <-ch:
batch.Requests = append(batch.Requests, *req)
req.release()
default:
done = true
}
}
err := stream.Send(batch)
batch.Requests = batch.Requests[:0]

This code definitely lets us build batches of any size. What's worse, once it does that, it also doesn't properly release the memory (like 506 doesn't zero out the slice).

That definitely seems pretty bad. I'll whip up a PR.

@tbg
Copy link
Member

tbg commented Oct 5, 2021

Another few choice quotes from the logs:

teamcity-3531708-1633241850-46-n8cpu16-geo-0004> I211003 11:06:30.325056 83183 kv/kvserver/store_send.go:297 ⋮ [n4,s4] 8678 SST ingestion was delayed by 34m17.53404451s (47.431µs for storage engine back-pressure)

34 minutes spent queuing on the semaphore behind this cluster setting:

// addSSTableRequestLimit limits concurrent AddSSTable requests.
var addSSTableRequestLimit = settings.RegisterIntSetting(
"kv.bulk_io_write.concurrent_addsstable_requests",
"number of AddSSTable requests a store will handle concurrently before queuing",
1,
settings.PositiveInt,
)

Makes you wonder.

@tbg
Copy link
Member

tbg commented Oct 5, 2021

I was wondering why n6's metrics didn't recover, but now I am realizing it also went belly-up. However, it did so while the debug.zip was pulled. The last log message is dated 11:06:55 so it really didn't last much longer than n4. Had it managed to handle the large raft command batch, my next question would've been why its RSS didn't come down. But it crashed, so the question is moot and the recv-side code that the large allocation would've gone out of scope and be gc'ed.

tbg added a commit to tbg/cockroach that referenced this issue Oct 5, 2021
Standard Go error - we were trying to avoid allocations by recycling a
slice but weren't zeroing it out before. The occasional long slice that
would reference a ton of memory would then effectively keep that large
amount of memory alive forever.

Touches cockroachdb#71050.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 5, 2021
In cockroachdb#71050, we saw evidence of very large (2.3+GiB) Raft messages being
sent on the stream, which overwhelmed both the sender and the receiver.
Raft messages are batched up before sending and so what must have
happened here is that a large number of reasonably-sized messages (up to
64MiB in this case due to the max_size setting) were merged into a giant
blob. As of this commit, we apply the max_size chunking on the batching
step before sending messages as well.

Closes cockroachdb#71050.

Release note: None
@tbg tbg removed the GA-blocker label Oct 5, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on release-21.2 @ 82e8782453b1cf14460d93b6bf8328a7b2964575:


	monitor.go:128,import.go:134,import.go:159,test_runner.go:777: monitor failure: unexpected node event: 3: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:134
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:159
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3550382-1633588489-56-n8cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 8: 9367
		7: 9344
		1: 9932
		2: 9211
		4: 9129
		3: dead (exit status 137)
		5: 9408
		6: 8927
		Error: UNCLASSIFIED_PROBLEM: 3: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 3: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Oct 7, 2021

Haven't investigated the above failure (about to head out on vacation), but it could well be an instance of the same issue, which would then be fixed by #71132 as well. However, a question remains whether we're now more likely to create lots of large proposals than previously, and if that's something that should be looked into.

@cockroach-teamcity
Copy link
Member Author

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on release-21.2 @ 3cf3b0ea3b08cb24e4d6b84c6c237b856ce6b411:


	monitor.go:128,import.go:134,import.go:159,test_runner.go:777: monitor failure: unexpected node event: 3: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:134
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:159
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3555401-1633674802-51-n8cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 8: 9465
		2: 9186
		7: 9494
		4: 9128
		1: 10097
		3: dead (exit status 137)
		5: 8984
		6: 8929
		Error: UNCLASSIFIED_PROBLEM: 3: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 3: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

tbg added a commit to tbg/cockroach that referenced this issue Oct 19, 2021
Standard Go error - we were trying to avoid allocations by recycling a
slice but weren't zeroing it out before. The occasional long slice that
would reference a ton of memory would then effectively keep that large
amount of memory alive forever.

Touches cockroachdb#71050.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 19, 2021
…g batching

In cockroachdb#71050, we saw evidence of very large (2.3+GiB) Raft messages being
sent on the stream, which overwhelmed both the sender and the receiver.
Raft messages are batched up before sending and so what must have
happened here is that a large number of reasonably-sized messages (up to
64MiB in this case due to the max_size setting) were merged into a giant
blob. As of this commit, we apply the max-size chunking on the batching
step before sending messages as well.

Closes cockroachdb#71050.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 19, 2021
… msg batching

In cockroachdb#71050, we saw evidence of very large (2.3+GiB) Raft messages being
sent on the stream, which overwhelmed both the sender and the receiver.
Raft messages are batched up before sending and so what must have
happened here is that a large number of reasonably-sized messages (up to
64MiB in this case due to the max_size setting) were merged into a giant
blob. As of this commit, we apply the traget-size chunking on the batching
step before sending messages as well.

Closes cockroachdb#71050.

Release note: None
craig bot pushed a commit that referenced this issue Oct 20, 2021
71132: kvserver: apply a limit to outgoing raft msg batching r=erikgrinaker a=tbg

In #71050, we saw evidence of very large (2.3+GiB) Raft messages being
sent on the stream, which overwhelmed both the sender and the receiver.
Raft messages are batched up before sending and so what must have
happened here is that a large number of reasonably-sized messages (up to
64MiB in this case due to the `max_size` setting) were merged into a giant
blob. As of this commit, we apply the `max_size` chunking on the batching
step before sending messages as well.

Closes #71050.

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in 1d959cb Oct 20, 2021
blathers-crl bot pushed a commit that referenced this issue Oct 20, 2021
Standard Go error - we were trying to avoid allocations by recycling a
slice but weren't zeroing it out before. The occasional long slice that
would reference a ton of memory would then effectively keep that large
amount of memory alive forever.

Touches #71050.

Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 20, 2021
… msg batching

In #71050, we saw evidence of very large (2.3+GiB) Raft messages being
sent on the stream, which overwhelmed both the sender and the receiver.
Raft messages are batched up before sending and so what must have
happened here is that a large number of reasonably-sized messages (up to
64MiB in this case due to the max_size setting) were merged into a giant
blob. As of this commit, we apply the target-size chunking on the batching
step before sending messages as well.

Closes #71050.

Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 20, 2021
Standard Go error - we were trying to avoid allocations by recycling a
slice but weren't zeroing it out before. The occasional long slice that
would reference a ton of memory would then effectively keep that large
amount of memory alive forever.

Touches #71050.

Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 20, 2021
… msg batching

In #71050, we saw evidence of very large (2.3+GiB) Raft messages being
sent on the stream, which overwhelmed both the sender and the receiver.
Raft messages are batched up before sending and so what must have
happened here is that a large number of reasonably-sized messages (up to
64MiB in this case due to the max_size setting) were merged into a giant
blob. As of this commit, we apply the target-size chunking on the batching
step before sending messages as well.

Closes #71050.

Release note: None
@tbg
Copy link
Member

tbg commented Oct 22, 2021

This was closed prematurely, backport is still open:

#71748

@tbg tbg reopened this Oct 25, 2021
cameronnunez pushed a commit that referenced this issue Oct 27, 2021
Standard Go error - we were trying to avoid allocations by recycling a
slice but weren't zeroing it out before. The occasional long slice that
would reference a ton of memory would then effectively keep that large
amount of memory alive forever.

Touches #71050.

Release note: None
cameronnunez pushed a commit that referenced this issue Oct 27, 2021
… msg batching

In #71050, we saw evidence of very large (2.3+GiB) Raft messages being
sent on the stream, which overwhelmed both the sender and the receiver.
Raft messages are batched up before sending and so what must have
happened here is that a large number of reasonably-sized messages (up to
64MiB in this case due to the max_size setting) were merged into a giant
blob. As of this commit, we apply the target-size chunking on the batching
step before sending messages as well.

Closes #71050.

Release note: None
@lunevalex
Copy link
Collaborator

backport merged, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants