This repo is used for tracking flaky tests on the Node.js CI and fixing them.
Current status: work in progress. Please go to the issue tracker to discuss!
Updates should be merged as soon as possible. We can revert or modify afterwards. This repo is mostly for coordination so we need to move fast and reduce the noise.
Make the CI green again.
- Taking actual failures from PRs into account, at least 80% of the node-test-pull-request (or node-test-commit) CI runs should be green.
- At least 90% of the node-daily-master CI should be green.
-
A green CI run is a run with a SUCCESS status, UNSTABLE does not count as green
-
Taking the last 100 runs, at any given time the green rate is calculated as follows
SUCCESS / (100 - RUNNING - ABORTED)
See https://nodejs-ci-health.mmarchini.me/#/job-summary
UTC Time | RUNNING | SUCCESS | UNSTABLE | ABORTED | FAILURE | Green Rate |
---|---|---|---|---|---|---|
2018-06-01 20:00 | 1 | 1 | 15 | 11 | 72 | 1.13% |
2018-06-03 11:36 | 3 | 6 | 21 | 10 | 60 | 6.89% |
2018-06-04 15:00 | 0 | 9 | 26 | 10 | 55 | 10.00% |
2018-06-15 17:42 | 1 | 27 | 4 | 17 | 51 | 32.93% |
2018-06-24 18:11 | 0 | 27 | 2 | 8 | 63 | 29.35% |
2018-07-08 19:40 | 1 | 35 | 2 | 4 | 58 | 36.84% |
2018-07-18 20:46 | 2 | 38 | 4 | 5 | 51 | 40.86% |
2018-07-24 22:30 | 2 | 46 | 3 | 4 | 45 | 48.94% |
2018-08-01 19:11 | 4 | 17 | 2 | 2 | 75 | 18.09% |
2018-08-14 15:42 | 5 | 22 | 0 | 14 | 59 | 27.16% |
2018-08-22 13:22 | 2 | 29 | 4 | 9 | 56 | 32.58% |
2018-10-31 13:28 | 0 | 40 | 13 | 4 | 43 | 41.67% |
2018-11-19 10:32 | 0 | 48 | 8 | 5 | 39 | 50.53% |
2018-12-08 20:37 | 2 | 18 | 4 | 3 | 73 | 18.95% |
TODO: automate all of this in ncu-ci
When checking the CI results of a PR, if there is one or more failed tests (with
not ok
as the TAP result):
- If the failed test is not related to the PR (does not touch the modified
code path), search the test name in the issue tracker of this repo. If there
is an existing issue, add a reply there using the reproduction template,
and open a pull request updating
flakes.json
. - If there are no new existing issues about the test, run the CI again. If the failure disappears in the next run, then it is potential flake. See When discovering a potential flake on the CI on what to do for a new flake.
- If the failure reproduces in the next run, it is likely that the failure is related to the PR. Do not re-run CI without code changes in the next 24 hours, try to debug the failure.
- If the cause of the failure still cannot be identified 24 hours later, and the code has not been changed, start a CI run and see if the failure disappears. Go back to step 3 if the failure still reproduces, and go to step 2 if the failure disappears.
-
Open an issue in this repo using the flake issue template:
- Title should be
Investigate path/under/the/test/directory/without/extension
, for exampleInvestigate async-hooks/test-zlib.zlib-binding.deflate
.
- Title should be
-
Add the
Flaky Test
label and relevant subsystem labels (TODO: create useful labels). -
Open a pull request updating
flakes.json
. -
Notify the subsystem team related to the flake.
When the CI run fails because:
- There are network connection issues
- There are tests fail with
ENOSPAC
(No space left on device) - The CI machine has trouble pulling source code from the repository
Do the following:
- Search in this repo with the error message and see if there is any open issue about this.
- If there is an existing issue, wait until the problem gets fixed.
- If there are no similar issues, open a new one with the build infra issue template.
- Add label
Build Infra
. - Notify the
@nodejs/build-infra
team in the issue.
When the CI run of a PR that does not touch the build files ends with build failures (e.g. the run ends before the test runner has a chance to run):
- Search in this repo with the error message that contains keywords like
fatal
,error
, etc. - If there is a similar issue, add a reply there using the reproduction template.
- If there are no similar issues, open a new one with the build file issue template.
- Add label
Build Files
. - Notify the
@nodejs/build-files
team in the issue.
- Settle down on the flake database schema
- Read the flake database in ncu-ci so people can quickly tell if a failure is a flake
- Automate the report process in ncu-ci
- Migrate existing issues in nodejs/node and nodejs/build, close outdated ones.
- Automate CI health history tracking