roachtest: nightly suite runner is extremely overloaded, causes test failures #138904
Labels
A-testing
Testing tools and infrastructure
branch-master
Failures and bugs on the master branch.
branch-release-24.1
Used to mark GA and release blockers, technical advisories, and bugs for 24.1
branch-release-24.2
Used to mark GA and release blockers, technical advisories, and bugs for 24.2
branch-release-24.3
Used to mark GA and release blockers, technical advisories, and bugs for 24.3
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-testeng
TestEng Team
For a long time, we've had issues with tests in the roachtest nightly suite failing in a mysterious way that suggested that a
time.Sleep(time.Second)
was sometimes taking much (>>20x) longer.I finally got to the bottom of it and the answer appears to be that the roachtest runner machine is severely overloaded: CPU utilization is at 100%, the machine is almost out of memory, and heap GC runs take north of a minute.
It's surprising that anything works at all under these circumstances, but I'm sure it causes all kinds of spurious failures across many tests that we likely shrug off as "infra flakes".
We need to find out why this machine is so overloaded, fix that, and set up monitoring for the health of the roachtest machine.
#138864 has the in-depth analysis. The execution trace referred in that analysis is in this slack thread (in case the artifacts are already gone).
Jira issue: CRDB-46415
The text was updated successfully, but these errors were encountered: