Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guide for monitoring tests run by CI - final push #5449

Merged
merged 26 commits into from
Aug 17, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
e945c56
Describe how to find and report failing tests
RobertKielty Feb 11, 2021
3b25079
updates TestGrid screen shot
RobertKielty May 28, 2021
d7d2e31
adds fixes for review comments
RobertKielty May 31, 2021
e60c96d
adds a generated toc
RobertKielty May 31, 2021
ef10326
adds generated toc
RobertKielty May 31, 2021
5351309
removes duplicated toc heading
RobertKielty May 31, 2021
8a97b14
removes toc comment
RobertKielty May 31, 2021
2b4ac85
adds capitalization feedback
RobertKielty May 31, 2021
04874ae
adds grammar and typo fixes
RobertKielty May 31, 2021
f14cb06
clarifies that this doc is about CI on Kubernetes
RobertKielty May 31, 2021
043765f
fixes a markdown issue
RobertKielty May 31, 2021
d9a6347
adds fixes for review comments
RobertKielty May 31, 2021
9725c9e
adds generated toc
RobertKielty May 31, 2021
1242317
removes toc comment
RobertKielty May 31, 2021
b56a446
adds capitalization feedback
RobertKielty May 31, 2021
4c81b45
adds grammar and typo fixes
RobertKielty May 31, 2021
53acc89
clarifies that this doc is about CI on Kubernetes
RobertKielty May 31, 2021
78a8e2f
fixes a markdown issue
RobertKielty May 31, 2021
defabde
Merge branch 'triage-tests-2' of github.com:RobertKielty/community in…
RobertKielty May 31, 2021
9e751ce
Revert "fixes a markdown issue"
RobertKielty May 31, 2021
8051f68
Revert "clarifies that this doc is about CI on Kubernetes"
RobertKielty May 31, 2021
ba0117f
Revert "adds grammar and typo fixes"
RobertKielty May 31, 2021
5a5a372
Revert "adds capitalization feedback"
RobertKielty May 31, 2021
82e1b04
Revert "removes toc comment"
RobertKielty May 31, 2021
d60f029
Revert "adds generated toc"
RobertKielty May 31, 2021
27b7d43
Revert "adds fixes for review comments"
RobertKielty May 31, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
360 changes: 360 additions & 0 deletions contributors/devel/sig-testing/monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,360 @@
---
title: "Monitoring Kubernetes Health"
weight: 14
description: |
Guidelines for finding and reporting failing tests in Kubernetes.
---

## Monitoring Kubernetes Health

### Table of Contents

- [Monitoring the health of Kubernetes with TestGrid](#monitoring-the-health-of-kubernetes-with-testgrid)
- [What dashboards should I monitor?](#what-dashboards-should-i-monitor)
- [Test failures that block my Pull Request](#pr-test-failures)
- [What do I do when I see a TestGrid alert?](#what-do-i-do-when-i-see-a-testgrid-alert)
- [Communicate your findings](#communicate-your-findings)
- [Fill out the issue](#fill-out-the-issue)
- [Iterate](#iterate)
RobertKielty marked this conversation as resolved.
Show resolved Hide resolved

## Overview

This document describes the tools used to monitor CI jobs that check the
correctness of changes made to core Kubernetes.

## Monitoring the health of Kubernetes CI Jobs with TestGrid

TestGrid is a highly-configurable, interactive dashboard for viewing your test
results in a grid. TestGrid's back end components are open sourced and can be
viewed in the [TestGrid repo] The front-end code
Comment on lines +28 to +29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unsure which is the correct way, but back end vs front-end should be consistent (i.e. use a space for both or use a hyphen for both).

that renders the dashboard is not currently open sourced.

The Kubernetes community has its own [TestGrid instance] which we use to monitor
and observe the health of the project.

Each Special Interest Group or [SIG] has its own set of dashboards. Each
dashboard is composed of different jobs (build, unit test, integration test,
end-to-end (e2e) test, etc.) These dashboards allow different teams to monitor
and understand how their areas are doing.

End-to-End test (e2e) jobs are in turn made up of test stages (e.g.,
bootstrapping a Kubernetes cluster, tearing down a Kubernetes cluster) and e2e
tests are organized hierarchically per Component and Subcategory within that
component. e.g., the [Kubectl client component tests]
have tests that describe the expected behavior of [Kubectl logs],
one of which is described as [should be able to retrieve and filter logs].

This hierarchy is not currently reflected in TestGrid so a test row will contain
a flattened name which concatenates all of these strings in to a single string.

We highly encourage SIGs to periodically monitor the dashboards related to the
sub-projects that they own. If you see that a job or test has been failing,
please raise an issue with the corresponding SIG in either their mailing list or
in Slack.

In particular, we always welcome the following contributions:

- [Triage Flaky Tests]
- [Fix Flaky Tests]

**Note**: It is important that all SIGs periodically monitor their jobs and
tests. Furthermore, if jobs or tests are failing or flaking, then pull requests
will take a lot longer to be merged. For more information on how flaking tests
disrupt PR merging and how to eliminate them see [Flaky Tests]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
disrupt PR merging and how to eliminate them see [Flaky Tests]
disrupt PR merging and how to eliminate them see [Flaky Tests].


### What dashboards should I monitor?

This depends on what areas of Kubernetes you want to contribute to. You should
monitor the dashboards owned by the SIG you are working with.

Additionally, you should check:
RobertKielty marked this conversation as resolved.
Show resolved Hide resolved

- [sig-release-master-blocking]
- [sig-release-master-informing]

since these jobs run tests that are used by SIG Release to determine the overall
quality of Kubernetes and whether or not the commit on master can be considered
suitable for release. Failing tests on a job in sig-release-master-blocking
block a release from taking place.

If your contributions involve code for past releases of kubernetes (e.g.
RobertKielty marked this conversation as resolved.
Show resolved Hide resolved
cherry-picks or backports), we recommend you periodically check on the
*blocking* and *informing* dashboards for [past releases]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*blocking* and *informing* dashboards for [past releases]
*blocking* and *informing* dashboards for [past releases].


---

## Pull request test failures caused by tests unrelated to your change

If a test fails on your Pull Request, and it's clearly not related to the code
your wrote, this presents an opportunity to improve the signal delivered by CI.

Find any open issues that appear related (have the name of the test in them,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Find any open issues that appear related (have the name of the test in them,
Find any open issues that appear related (e.g. have the name of the test in them,

describe a similar error, etc.). You can link the open issue in a comment you
use to retrigger jobs, either calling the job out specifically:

```markdown
./test pull-kubernetes-foo
https://github.com/kubernetes/kubernetes/issues/foo
```

or even if just invoking retest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
or even if just invoking retest
or just invoking retest:


```markdown
./retest
https://github.com/kubernetes/kubernetes/issues/foo
```

(Note the . prefixes are so you don't actually trigger Prow)

You can back-link from the issue to your PR that encountered it, to bump the
issue's last updated date.

When you do this you are adding evidence to support the need to fix the issue by
documenting the pain contributors are experiencing.

## What do I do when I see a TestGrid alert?
RobertKielty marked this conversation as resolved.
Show resolved Hide resolved

If you are part of a SIG's mailing list, occasionally you may see emails from
TestGrid reporting that a job or a test has recently failed.

Alerts are also displayed on the Summary Page of TestGrid dashboards when you
click on the Show All Alerts button at the top of the Summary or Show Alerts
for an individual Job.

However, if you are viewing the summary page of a Testgrid dashboard alerts are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, if you are viewing the summary page of a Testgrid dashboard alerts are
However, if you are viewing the summary page of a Testgrid dashboard, alerts are

only of secondary interest as the current status of the jobs that are part of
the dashboard are displayed more prominently as follows :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the dashboard are displayed more prominently as follows :
the dashboard are displayed more prominently as follows (taken from [sig-release-master-blocking]):


- Passing jobs look like this
<img src="./testgrid-images/testgrid-summary-passing-job.png">
- Flaky jobs like this
<img src="./testgrid-images/testgrid-summary-flaking-job.png">
- Failing job with alert shown
<img src="./testgrid-images/testgrid-summary-failing-job.png">
Comment on lines +128 to +133
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason Markdown doesn't render these bullet points correctly - only the first?
image


Taken from [sig-release-master-blocking]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this up to line 125 just so that it's not awkwardly dangling on its own here (see suggested change above)


Note the metadata on the right hand side showing job run times, the commit id of
the last green (passing) job run and the time at which the summary page was
loaded (refreshing the browser updates the browser and the update time)

### Communicate your findings

The number one thing to do is to communicate your findings: a test or job has
been flaking or failing. If you saw a TestGrid alert on a mailing list, please
reply to the thread and mention that you are looking into it.

First, check GitHub to see if an issue has already been logged by checking the
following:

- [Issues logged as Flaky Tests - not triaged]
- [Issues logged as Flaky Tests - triaged]
- [CI Signal Board] flaky tests issues segmented by problem resolution workflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [CI Signal Board] flaky tests issues segmented by problem resolution workflow.
- [CI Signal Board] flaky test issues segmented by problem resolution workflow.


If an issue has already been opened for the test, you can add any new findings
that are not already documented in the issue.

For example, if a test is flaking intermittently and you have found another
incident where the test has failed that has not been recorded in the issue, then
add the new information to the existing issue.

You can:

- Add a link to the Prow job where the latest test failure has occurred, and
- Note the error message

New evidence is especially useful if the root cause of the problem with the test
has not yet been determined and the issue still has a *needs-triage* label.

If the issue has not already been logged, please [create a new issue] in the
kubernetes/kubernetes repo, and choose the appropriate issue template.

You can jump to create either test issue type using the following links :

- [create a new issue - Failing Test]
- [create a new issue - Flaking Test]

#### Filling out an issue

Both test issue templates are reasonably self-explanatory, what follows are
guidelines and tips on filling out the templates.

When logging a Flaking or Failing test please:

- use plain text when referring to test names and job names. Inconsistent
formatting of names makes it harder to process issues via automation.
- keep an eye out for test names that contain markdown parse-able formatting.

If you are a test maintainer, refrain from including markdown in strings that
are used to name your tests and test components.

#### Fill out the issue for a Flaking Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is overlap here with https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md#github-issues-for-known-flakes, ultimately we should converge on one location for instructions on how to file an issue, and link to it. I don't have strong opinions on whether that's this doc, the other doc, the issue template, or some other location. But I have strong opinions that we avoid N different guides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should avoid having many different guides; I personally like what's stated here more than in the Flaky Tests document because I think it's more detailed (the other doc feels more like guidelines as opposed to this doc which seems more like a step-by-step walkthrough).


1 **Which jobs are flaking**

The example below was taken from the SIG Release dashboard:

<img src="./testgrid-images/testgrid-jobs.png" height="50%" width="100%">

We can see that the following jobs were flaky at the time this screenshot was taken:

- [conformance-ga-only]
- [skew-cluster-latest-kubectl-stable1-gce]
- [gci-gce-ingress]
- [kind-master-parallel]

1. **Which tests are flaking**

Let's grab an example from the SIG release dashboards and look at the
`node-kubelet-master` job in sig-release-master [node-kubelet-master].

<img src="./testgrid-images/test-grid-job.svg" height="70%" width="100%">

Here we see that at 07.19 IST the tests

```text
E2eNode Suite.[sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api [cos-stable2]
kubetest.Node Tests [runner]
```

Failed for Kubernetes commit `d8f9e4587`
The corresponding test-infra commit was `fe9c22dc8`

3. **Since when has it been flaking**

You can get the start time of a flake from the header of the TestGrid page
showing you all the tests. The red numbers in the screen shot above annotate the
grid headings.

They are:

- 1 This row has the times each Prow job was started, each column on the grid
represents a single run of the Prow job
- 2 This row is the Prow job run id number
- 3 This is the kubernetes/kubernetes commit id that was tested
- 4 Theses are the kubernetes/test-infra commit ids that were used to build and
Comment on lines +231 to +235
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 1 This row has the times each Prow job was started, each column on the grid
represents a single run of the Prow job
- 2 This row is the Prow job run id number
- 3 This is the kubernetes/kubernetes commit id that was tested
- 4 Theses are the kubernetes/test-infra commit ids that were used to build and
- 1: The starting times for each Prow job; each column on the grid
represents a single run of the Prow job.
- 2: The Prow job run id number.
- 3: The `kubernetes/kubernetes` commit id that was tested.
- 4: These are most commonly the `kubernetes/test-infra` commit ids that were used to build and

run the Prow Job; kubernetes/test-infra contains CI job definition yaml, builds
for container images used in CI on the Kubernetes project, and also code that
implements a lot of the components used to deliver CI, such as Prow, SpyGlass
and other components.

Click on a cell in the grid to take you to SpyGlass which displays the Prow job
results.

You can also find this data in Triage (see below).

4. **Reason for failure**

Logging an issue brings the flake or failure to the attention of the wider
community, as the issue reporter you do not have to find the reason for failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
community, as the issue reporter you do not have to find the reason for failure
community; as the issue reporter you do not have to find the reason for failure

right away (nor the solution). You can just log the error reported by the test
when the job was run.

Click on the failed runs (the red cells in the grid) to see the results in
RobertKielty marked this conversation as resolved.
Show resolved Hide resolved
SpyGlass.

For `node-kubelet-master`, we see the following:

![Spyglass Prow Job Results for node-kubelet-master`](./testgrid-images/spyglass-summary-node-kubelet-master.png "Spyglass Prow Job results viewer")

Here we see that 2 tests failed (both related to the node problem detector) and
the `e2e.go: Node Tests` stage was marked as failed (because the node problem
detector tests failed).

You will often see "stages" (steps in an e2e job) as mixed with the tests
themselves. The stages tell you what was going on in the e2e job when an error
occurred.

If we click on the first test error, we will see logs that will (hopefully) help
us figure out why the test failed.
![Spyglass - Prow Job Results viewer](./testgrid-images/spyglass-result.png "Spyglass Prow Job results viewer")

Further down the page you will see all the logs for the entire test run.
Please copy any information you think may be useful from here into the issue.

You can reference a specific line in the logs by click on the line number and
RobertKielty marked this conversation as resolved.
Show resolved Hide resolved
then copying the URL which will now include an anchor to the specific line.

5. **Anything else we need to know**

It is important to review the behavior of the flaking test across a range of
jobs using [Triage].

We can use the Triage tool to see if a test we see failing in a given job has
been failing in others and to understand how jobs are behaving.

For example, we can see how the job we have been looking at has been behaving
recently.

One important detail is that the job names you see on tabs in TestGrid are often
aliases. Job definition details including the job name, the job definition
configuration file and a description of the job can be found below the tab name
in TestGrid with a URL pointing to the yaml file where the job is configured.

For example, when we clicked on a test run for [node-kubelet-master],
the job name, `ci-kubernetes-node-kubelet-features`, can be found at the top left
corner of the Spyglass page (notice the "ci-kubernetes-" prefix).

Then we can run a query on Triage using [ci-kubernetes-node-kubelet-features in
the job field] Note that the Triage query can be bookmarked and can be used as a
deep link that can be added to GitHub issues to assist test maintainers in
understanding what is wrong with a test.

At the time of this writing we saw the following:

<img src="./testgrid-images/triage.png" height="50%" width="100%">

**Note**: notice that you can also improve your query by filtering or excluding
RobertKielty marked this conversation as resolved.
Show resolved Hide resolved
results based on test name or failure text.

Sometimes, Triage will help you find patterns to figure out the root cause of
the problem. In this instance, we can also see that this job has been failing
about 2 times per hour.

### Iterate

Once you have filled out the issue, please mention it in the appropriate mailing
list thread (if you see an email from TestGrid mentioning a job or test failure)
and share it with the appropriate SIG in the Kubernetes Slack.

Don't worry if you are not sure how to debug further or how to resolve the
issue! All issues are unique and require a bit of experience to figure out how
to work on them. For the time being, reach out to people in Slack or the mailing
list.
RobertKielty marked this conversation as resolved.
Show resolved Hide resolved

[TestGrid repo]: https://github.com/GoogleCloudPlatform/testgrid
[TestGrid instance]: https://testgrid.k8s.io/

[SIG]: https://github.com/kubernetes/community/blob/master/sig-list.md

[Kubectl client component tests]: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl/kubectl.go#L229
[Kubectl logs]:https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl/kubectl.go#L1389
[should be able to retrieve and filter logs]:https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl/kubectl.go#L1412

[Triage Flaky Tests]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake
[Fix Flaky Tests]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake+-label%3Aneeds-triage+

[Flaky Tests]:https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md#flaky-tests

[sig-release-master-blocking]:https://testgrid.k8s.io/sig-release-master-blocking
[sig-release-master-informing]:https://testgrid.k8s.io/sig-release-master-informing

[past releases]:https://testgrid.k8s.io/sig-release

[create a new issue]:https://github.com/kubernetes/kubernetes/issues/new/choose
[create a new issue - Failing Test]:https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Ffailing-test&template=failing-test.md
[create a new issue - Flaking Test]:https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Fflake&template=flaking-test.md

[Issues logged as Flaky Tests - not triaged]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake
[Issues logged as Flaky Tests - triaged]:https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+kind%2Fflake+-label%3Aneeds-triage+

[CI Signal Board]:https://github.com/orgs/kubernetes/projects/11

[conformance-ga-only]:https://testgrid.k8s.io/sig-release-master-blocking#conformance-ga-only
[skew-cluster-latest-kubectl-stable1-gce]:https://testgrid.k8s.io/sig-release-master-blocking#skew-cluster-latest-kubectl-stable1-gce
[gci-gce-ingress]:https://testgrid.k8s.io/sig-release-master-blocking#gci-gce-ingress
[kind-master-parallel]:https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel
[node-kubelet-master]:https://testgrid.k8s.io/sig-release-master-blocking#node-kubelet-master

[Triage]:https://go.k8s.io/triage
[ci-kubernetes-node-kubelet-features in the job field]:https://go.k8s.io/triage?pr=1&job=ci-kubernetes-node-kubelet-features
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading