Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: nightly suite runner is extremely overloaded, causes test failures #138904

Open
tbg opened this issue Jan 13, 2025 · 2 comments
Open
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@tbg
Copy link
Member

tbg commented Jan 13, 2025

For a long time, we've had issues with tests in the roachtest nightly suite failing in a mysterious way that suggested that a time.Sleep(time.Second) was sometimes taking much (>>20x) longer.

I finally got to the bottom of it and the answer appears to be that the roachtest runner machine is severely overloaded: CPU utilization is at 100%, the machine is almost out of memory, and heap GC runs take north of a minute.

It's surprising that anything works at all under these circumstances, but I'm sure it causes all kinds of spurious failures across many tests that we likely shrug off as "infra flakes".

We need to find out why this machine is so overloaded, fix that, and set up monitoring for the health of the roachtest machine.

#138864 has the in-depth analysis. The execution trace referred in that analysis is in this slack thread (in case the artifacts are already gone).

Jira issue: CRDB-46415

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. T-testeng TestEng Team branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 labels Jan 13, 2025
Copy link

blathers-crl bot commented Jan 13, 2025

cc @cockroachdb/test-eng

@RaduBerinde
Copy link
Member

Once we address this, we could set up a background goroutine in roachtest which keeps scheduling a timer and panics if it's awakened more than e.g. 10s than expected. This would at least save us from having to debug spurious test failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
None yet
Development

No branches or pull requests

2 participants