-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This should help provide context on how CI works and what processes to follow to maintain our tests. Signed-off-by: alejandrox1 <[email protected]>
- Loading branch information
1 parent
b1c6840
commit 48bae65
Showing
6 changed files
with
212 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
# Monitoring Kubernetes Health | ||
|
||
**Table of Contents** | ||
|
||
- [Monitoring Kubernetes Health](#monitoring-kubernetes-health) | ||
- [Monitoring the health of Kubernetes with TestGrid](#monitoring-the-health-of-kubernetes-with-testgrid) | ||
- [What dashboards should I monitor?](#what-dashboards-should-i-monitor) | ||
- [What do I do when I see a TestGrid alert?](#what-do-i-do-when-i-see-a-testgrid-alert) | ||
- [Communicate your findings](#communicate-your-findings) | ||
- [Fill out the issue](#fill-out-the-issue) | ||
- [Iterate](#iterate) | ||
|
||
|
||
## Monitoring the health of Kubernetes with TestGrid | ||
|
||
TestGrid is a highly-configurable, interactive dashboard for viewing your test | ||
results in a grid, see https://github.com/GoogleCloudPlatform/testgrid. | ||
|
||
The Kubernetes community has its own instance of TestGrid, https://testgrid.k8s.io/, | ||
which we use to monitor and observe the health of the project. | ||
|
||
Each SIG has its own set of dashboards, and each dashboard is composed of | ||
different end-to-end (e2e) jobs. | ||
E2E jobs are in turn made up of test stages (i.e., bootstraping a Kubernetes | ||
cluster, tearing down a Kubernetes cluster) and e2e tests (e.g., Kubectl client | ||
Kubectl logs should be able to retrieve and filter logs). | ||
These views allow different teams to monitor and understand how their areas | ||
are doing. | ||
|
||
We highly encourage anyone to periodically monitor these dashboards. | ||
If you see that a job or test has been failing, please raise an issue with the | ||
corresponding SIG in either their mailing list of in slack. | ||
|
||
Help maintaining tests, fixing broken tests, improving test success rates, and | ||
overall test improvements are always highly needed throughout the project. | ||
|
||
**Note**: It is important that all SIGs periodically monitor their jobs and | ||
tests. These are used to figure out when to release Kubernetes. | ||
Furthermore, if jobs or tests are failing or flaking, then pull requests will | ||
take a lot longer to be merged. | ||
|
||
|
||
### What dashboards should I monitor? | ||
|
||
This depends on what areas of Kubernetes you want to contribute to. | ||
You should monitor the dashboards owned by the SIG you are working with. | ||
Additionally, you should check | ||
|
||
* https://testgrid.k8s.io/sig-release-master-blocking and | ||
* https://testgrid.k8s.io/sig-release-master-informing | ||
|
||
Since these jobs run tests owned by other SIGs. | ||
Also make sure to periodically check on the "blocking" and "informing" | ||
dashboards for past releases. | ||
|
||
--- | ||
|
||
## What do I do when I see a TestGrid alert? | ||
|
||
If you are part of a SIG's mailing list, occasionally you may see email from | ||
TestGrid reporting that a job or a test has recently failed. | ||
If you are casually browsing through TestGrid, you may also see jobs labeled as | ||
"flaky" (in purple) or as "failing" (in red). | ||
This section is to help guide you on what to do in these occasions. | ||
|
||
### Communicate your findings | ||
|
||
The number one thing to do is to communicate your finding (a test or job has | ||
been flaking or failing). | ||
If you saw a TestGrid alert on a mailing list, please reply to the thread and | ||
mentioned that you are looking into it. | ||
It is important to communicate to prevent duplicate work and to ensure CI | ||
problems get attention. | ||
|
||
In order to communicate with the rest of the community and to drive the work, | ||
please open up an issue on Kubernetes, | ||
https://github.com/kubernetes/kubernetes/issues/new/choose , chose the appropriate issue | ||
template. | ||
|
||
### Fill out the issue | ||
|
||
1. **Which job(s) are failing or flaking** | ||
|
||
The job is the tab in TestGrid that you are looking at. | ||
|
||
<img src="./testgrid-images/testgrid-jobs.png" height="50%" width="40%"> | ||
|
||
The above example was taken from the SIG Release dashboard and we can see that | ||
* `conformance-ga-only` https://testgrid.k8s.io/sig-release-master-blocking#conformance-ga-only, | ||
* `skew-cluster-latest-kubectl-stable1-gce` https://testgrid.k8s.io/sig-release-master-blocking#skew-cluster-latest-kubectl-stable1-gce, | ||
* `gci-gce-ingress` https://testgrid.k8s.io/sig-release-master-blocking#gci-gce-ingress, | ||
* `kind-master-parallel` https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel | ||
|
||
are flaky (we should have some issues opened up for these to investigate why | ||
:smile:). | ||
|
||
2. **Which tests are failing or flaking** | ||
|
||
Let's grab an example from the SIG release dashboards and look at the | ||
`node-kubelet-features-master` job in | ||
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master. | ||
|
||
<img src="./testgrid-images/failed-tests.png" height="70%" width="100%"> | ||
|
||
Here we see that at 16.07 EDT and 15:07 EDT the job | ||
``` | ||
[k8s.io] NodeProblemDetector [NodeFeature:NodeProblemDetector] [k8s.io] SystemLogMonitor should generate node condition and events for corresponding errors [ubuntu] | ||
``` | ||
Failed for Kubernetes commit `9af86e8db` (this value is a row below the time - | ||
the value above it is the run ID). | ||
The corresponding test-infra commit was `11cb57d36` (the value below the commit | ||
for Kubernetes). | ||
|
||
At 15:07 EDT, the job | ||
``` | ||
[k8s.io] NodeProblemDetector [NodeFeature:NodeProblemDetector] [k8s.io] SystemLogMonitor should generate node condition and events for corresponding errors [cos-stable2] | ||
``` | ||
|
||
failed as well. | ||
|
||
If one or both of these jobs continue failing, or if they failed frently | ||
enough, we should open an issue an investigate. | ||
|
||
3. **Since when has it been failing or flaking** | ||
|
||
This information you can get from the header of the page showing you all the | ||
tests. | ||
Going from top to bottom, you will see | ||
* date | ||
* time | ||
* job run ID | ||
* Kubernetes commit that was tested | ||
* (Most often) test-infra commit | ||
|
||
4. **Reason for failure** | ||
|
||
The aim for this issues is to begin investigating - you don't have to find the | ||
reason for failure right away (nor the solution). | ||
However, do post any information you find usesful. | ||
|
||
One way of getting useful information is to click on the failed runs (the red | ||
rectablges). | ||
This will send you to a page called **spyglass**. | ||
|
||
If we do this for the above test failures in `node-kubelet-features-master`, we | ||
will see the following | ||
|
||
<img src="./testgrid-images/spyglass-summary.png" height="60%" width="100%"> | ||
|
||
Right away it will show you what tests failed. | ||
Here we see that 2 test failed (both related to the node problem detector) and | ||
the `e2e.go: Node Tests` stage was marked as failed (because the node problem | ||
detector tests failed). | ||
|
||
You will often see "stages" (steps in an e2e job) as mixed with the tests | ||
themselves. | ||
The statges imform you of what is going on in the e2e job when an error | ||
occurred. | ||
|
||
If we click on the first test error we will see logs that will (hopefully) help | ||
us figure out why the test failed. | ||
|
||
<img src="./testgrid-images/spyglass-result.png" height="60%" width="100%"> | ||
|
||
Further down the page you will see all the logs for the entire test run. | ||
Please copy any information you think may be useful from here into the issue. | ||
|
||
5. **Anything else we need to know** | ||
|
||
There is this wonderful page built by SIG testing that often comes in handy: | ||
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1 | ||
This page is called **Triage**. | ||
We can use it to see if test we see failing in a given job has been failing in | ||
others and, in general, to understand how jobs are behaving. | ||
|
||
For example, we can see how the job we have been looking at has been behaving | ||
recently. | ||
|
||
There is one important detail we have to mention at this point, the job names | ||
you see on TestGrid are often aliases. | ||
For example, when we clicked on a test run for | ||
`node-kubelet-features-master` | ||
( | ||
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master | ||
), at the top left corner of spyglass the page tells us the real job name, | ||
`ci-kubernetes-node-kubelet-features` (notice the "ci-kubernetes-" prefix). | ||
Then we can use this full job name in Triage | ||
|
||
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=ci-kubernetes-node-kubelet-features | ||
|
||
At the time of this writing we saw the following | ||
|
||
<img src="./testgrid-images/triage.png" height="50%" width="100%"> | ||
|
||
**Note**: notice that you can also improve your query by filtering or excluding | ||
results based on test name or failure text. | ||
|
||
Sometime Triage will help you find patterns to figure out whats wrong. | ||
In this instance we can also see that this job has been failing rather | ||
frequently (about 2 times per hour). | ||
|
||
### Iterate | ||
|
||
Once you have filled out the issue, please mention it in the appropriate mailing | ||
list thread (if you see an email from testgrid mentioning a job or test | ||
failure) and share it with the appropriate SIG in the Kubernetes slack. | ||
|
||
Don't worry if you are not sure how to debug further or how to resole the | ||
issue! | ||
All issues are unique and require a bit of experience to figure out how to work | ||
on them. | ||
For the time being, reach out to people in slack or the mailing list. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.