-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #75071
Comments
roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 912964e02ddd951c77d4f71981ae18b3894e9084:
Same failure on other branches
|
Looking at the second failure.
The latest heap profile isn't that useful. It shows around 2GB allocated, and while we regularly see this undercount, the RSS at time of OOMkill is around 13GB so it's likely that we just didn't catch whatever was consuming all this memory. The process was killed at
The profile is from 13_56_18, so a solid 30 minutes prior. Looking at the graphs for n6, we see memory use grow linearly until OOM, starting sometime after 14:10. It's not surprising that things would change around that time, because that's when we hit the cluster with this:
This isn't supposed to kill nodes, it's the "warmup" load that we apply during rebalancing. That memory usage is definitely not intentional, though. We'll need to get a heap dump at the right time to figure this out, I think. Will look at the first occurrence next. |
First occurrence: Same thing, a node (n2) OOMs during warm-up:
https://share.polarsignals.com/822d7d3 (15_26_53) I think we should run these tests with COCKROACH_MEMPROF_INTERVAL=15s, so that we'll get a heap dump every 15s. This should be plenty to detect what's causing the linear memory growth. Will send a PR. |
Can hopefully help understand cockroachdb#75071. Release note: None
roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ dc07599dc9db1acd5afa3a6537297815f25c1fca:
Same failure on other branches
|
#75204 is now merging, so future failures (perhaps not the next one, since it may not have picked up the PR yet) should have plenty of heap profiles. |
75056: sql: avoid CREATE INDEX failure on retry r=postamar a=stevendanna Previously, if a transaction including a CREATE INDEX statement that used expressions in the list of included columns encountered a TransactionRetryWithProtoRefreshError, the retry would fail with an error such as ``` (42703) column "crdb_internal_idx_expr_6" does not exist column_resolver.go:196: in NewUndefinedColumnError() ``` This was the result of makeIndexDescriptor substituting the expressions with the names of the newly added columns on the CreateIndexNode itself. When the transaction is retried, the generated column names do not yet exist. Here, we resolve this issue by only modifying a copy of the IndexElemList when generating an index descriptor. Release note (bug fix): CREATE INDEX statements using expressions previously failed in some cases if they encountered an internal retry. 75204: roachtest: heap profile tpccbench every 15s r=erikgrinaker a=tbg Can hopefully help understand #75071. Release note: None Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
Can hopefully help understand cockroachdb#75071. Release note: None
roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ c4c5ca2fdd5a641433a85a28d4dfd3bd4443015d:
Same failure on other branches
|
Node died at 16:36:06, we have lots of heap dumps, the most recent is 16_35_51: https://share.polarsignals.com/fc00722 This looks fine honestly: The memory blowup, at least as captured in the timeseries, isn't abrupt, but there is some goroutine blowup: There is a goroutine dump. It's partial (node must've died writing it), but it shows .. after some searching through less interesting things, 531 goroutines belonging to replica checksum computations. Fixed by #75448. |
roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 365b4da8bd02c06ee59d2130a56dec74ffc9ce21:
Help
See: roachtest README
See: How To Investigate (internal)
Same failure on other branches
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: