-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: splits/largerange/size=32GiB,nodes=6 failed [OOM] #94371
Comments
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 64e4fc9faa4e0ab19fe5ba78f053bc2b1390cb5e:
Parameters: |
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 250238cd29102391dddbc8cc71380090c49ce509:
Parameters: |
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 0725273ac7f789ba8ed78aacaf73cc953ca47fe8:
Parameters: |
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 0725273ac7f789ba8ed78aacaf73cc953ca47fe8:
Parameters: |
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 2c3d75f1ce31024d7ffe530f91f22162c053abcd:
Parameters: |
n1 OOM-killed. Pretty sure this is fallout from #92858, which also caused OOMs in #93579, and is possibly fixed by #93823. Reassigning to @kvoli to confirm.
|
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ d1233022fc2f67a3901c0be08ec71abb13b1e8fe:
Parameters: |
I ran this locally and wasn't able to reproduce it with a recent commit off master. Looking at the logs the OOM occurs right after the last node joins the cluster. However, the RSS bytes for the OOM node (n1) was already at 13gb prior to this - so really it seems an issue before and is pushed over the edge by the additional metadata information from other nodes joining the cluster.
rough timeline, where the stats are from n1 which OOM'd:
The last mem profile that was captured was from 10:56:55 when the most recent heath log shows an rss which is only 1gb off the rss reported just before failure. This makes me lean towards thinking that it may be a SQL related issue, given the profile below:
I think that corresponds to this line here cockroach/pkg/sql/exec_util.go Line 2038 in 5a46213
Does the above profile look normal, given a cc @cockroachdb/sql-queries @DrewKimball If you have a moment could you take a look at the heap pprof above and lmk if that alloc, specifically |
This seems like a recently-introduced active statements' cache. cc @matthewtodd |
x-ref #94205 which also crashes due to OOM. |
I've just sent out #94793 to revert the offending commit. We'll get it tightened up before reintroducing. |
94793: sql, server: revert new recent statements cache r=matthewtodd a=matthewtodd This reverts #93270. We've seen a number of roachtest and roachperf failures as a result of this work, so let's revert until we can address them. Part of #86955 Addresses #94205 Addresses #94371 Addresses #94676 Release note: None Co-authored-by: Matthew Todd <[email protected]>
We have marked this test failure issue as stale because it has been |
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 6723e00a46aaa3ea575093bd82a02b7d6f6b131b:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-22859
Epic CRDB-18656
The text was updated successfully, but these errors were encountered: