-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore2TB/nodes=32 failed [raft recv oom, reproducible] #80155
Comments
n10 OOMKilled, 09:09:30. 15s prior, at 09_09_15, https://share.polarsignals.com/e12c582/ 5s prior, at 09_09_25: https://share.polarsignals.com/926274a/ |
@tbg this looks like another one? |
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 7ea67fd5d4966508877fc88ba21bdd498ff162e8:
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 8fe42c18e5e2a82494fa6a34e068f4ab801f87c4:
|
Confirming that the last failure hit the same issue. The memprofiles look the same. |
@dt for posterity - how did you find that node 10 OOMed? I saw other crashes where it was obvious in the issue, but here it's not really clear. I guess greping would do the work but I'm wondering whether you have a better trick. |
this one is less clear from the issue poster; usually the monitor says which node died but here it just has the ip that couldn't connect, but that just added a hop: I opened the roachprod-state json file in the test's artifacts dir and searched for that ip, which revealed it was n10. From there it was back to the artifacts and check the end of the dmesg file for n10 to see if there was an ommkiller message; in this case there was. If there's not, I go over to the unredacted logs for n10 and scroll to the bottom looking for a stacktrace, then up to the top of that to see who crashed. |
I ran some tests with different values for I found that with 3 workers, the failure rate was right around 10%, with 5/50 iterations hitting an OOM. With 2 workers ( With 1 worker ( |
Additional relevant PR: #80494 So we have one more escape hatch should we see this kind of OOM in prod. |
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 72e485705aa859ef850ebc67568bc29fe9b7cfd6:
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ bc1ee7c7c276984fce8ff5ba4fcfcdff335dde50:
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 7dd1c51f6b5802e32bafd82e46747f349836592f:
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 12198a51408e7333cd4f96b221e6734239479765:
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ e6815947a050e32f21e983aa30dc74ab2a247af3:
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 662cc5c3070e6d64d155a9cc9f33253ee5d99ee9:
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 54b4fcaa231abd16260fb09722ed291dbd6a4900:
Parameters: Same failure on other branches
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ e4cafeb8b1d586d091fb98e3e570650d7eeea294:
Parameters: |
roachtest.restore2TB/nodes=32 failed with artifacts on master @ eabdc49383a19d5731b765dbb0d8b45bd9e24404:
Parameters: |
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 8db26695114981eec5c12317f3dc5d770b522a3f:
Parameters: |
roachtest.restore2TB/nodes=32 failed with artifacts on master @ f4042d47fa8062a612c38d4696eb6bee9cee7c21:
Parameters: Same failure on other branches
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ b173a16715e71e94115820374da1eb350b3b459d:
Parameters: Same failure on other branches
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 5c2c62ecf1bea60c807edc6b4da22d900ad4ae03:
Parameters: Same failure on other branches
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ cb55144cdec54d2a70f074ad64b4eca5e6c6891a:
Parameters: Same failure on other branches
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 80c274877a917580af62be6eb0cd48c8c7ae9c08:
Parameters: Same failure on other branches
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 003c0360de8b64319b5f0f127b99be91dbdca8a3:
Parameters: Same failure on other branches
|
Until we can fix this properly in #71805, the test OOMs are mitigated by using highmem instances in #88444. The later failures here, listed below, were fixed by #85405.
|
roachtest.restore2TB/nodes=32 failed with artifacts on master @ 1fdbb16fa206c5fd77c3aa111ec40916edfb55df:
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-15827
The text was updated successfully, but these errors were encountered: