Added guide for monitoring CI #5244

alejandrox1 · 2020-10-16T22:48:38Z

This should help provide context on how CI works and what processes to
follow to maintain our tests.

Signed-off-by: alejandrox1 [email protected]

alejandrox1 · 2020-10-16T22:52:35Z

For a better read check out the version from my fork directly: https://github.com/alejandrox1/community/blob/triage-tests/contributors/devel/sig-testing/monitoring.md#fill-out-the-issue

Pinging couple people I mentioned this too for review
/cc @hasheddan @SergeyKanzhelev @MHBauer @Merkes @kscharm @knabben @RobertKielty
ptal

k8s-ci-robot · 2020-10-16T22:52:39Z

@alejandrox1: GitHub didn't allow me to request PR reviews from the following users: merkes, kscharm.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

For a better read check out the version from my fork directly: https://github.com/alejandrox1/community/blob/triage-tests/contributors/devel/sig-testing/monitoring.md#fill-out-the-issue

Pinging couple people I mentioned this too for review
/cc @hasheddan @SergeyKanzhelev @MHBauer @Merkes @kscharm @knabben @RobertKielty
ptal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

thejoycekung

Thank you so much for taking the time to take that doc and really write it out!!! Most of these comments are grammar/spelling nits (because I am kinda nitpicky I am very sorry) but some are also questions/suggestions!

(also yes I am aware it is 1AM on a Friday night I am so sorry please enjoy your weekend)

thejoycekung · 2020-10-17T03:51:24Z

contributors/devel/sig-testing/monitoring.md

+## Monitoring the health of Kubernetes with TestGrid
+
+TestGrid is a highly-configurable, interactive dashboard for viewing your test
+results in a grid, see https://github.com/GoogleCloudPlatform/testgrid.


nit: A better transition would be nice here, like 'it is partially open-sourced so you can view the source code here' or something to that effect?

To disambiguate even more "the back end is open-sourced"

+1 to clarifying the repo contains the back-end components of testgrid, not the dashboard code itself

contributors/devel/sig-testing/monitoring.md

thejoycekung · 2020-10-17T04:01:57Z

contributors/devel/sig-testing/monitoring.md

+Each SIG has its own set of dashboards and each dashboard is composed of
+different end-to-end (e2e) jobs.
+E2E jobs are in turn made up of test stages (i.e., bootstraping a Kubernetes
+cluster, tearing down a Kubernetes cluster) and e2e tests (i.e., Kubectl client


grammar nit:

Suggested change

cluster, tearing down a Kubernetes cluster) and e2e tests (i.e., Kubectl client

cluster, tearing down a Kubernetes cluster) and e2e tests (e.g., Kubectl client

suggestion: Picking a different test as an example might be more readable here? For this particular test it feels a little clumsy to read - I wasn't sure whether the test was "Kubectl logs should be able to retrieve and filter logs" and it was under "Kubectl client" (but uh, no turns out "Kubectl client Kubectl logs ..." is the real name)

contributors/devel/sig-testing/monitoring.md

thejoycekung · 2020-10-17T04:56:15Z

contributors/devel/sig-testing/monitoring.md

+Once you have filled out the issue, please mention it in the appropriate mailing
+list thread (if you see an email from testgrid mentioning a job or test
+failure) and share it with the appropriate SIG in the Kubernetes slack.


grammar nit:

Suggested change

Once you have filled out the issue, please mention it in the appropriate mailing

list thread (if you see an email from testgrid mentioning a job or test

failure) and share it with the appropriate SIG in the Kubernetes slack.

Once you have filled out the issue, please mention it in the appropriate mailing

list thread (if you see an email from TestGrid mentioning a job or test

failure) and share it with the appropriate SIG in the Kubernetes Slack.

question: should we mention it to a mailing list also if we did not see an email from testgrid (e.g. it was a flake)?

I think it'd be good

contributors/devel/sig-testing/monitoring.md

thejoycekung · 2020-10-17T05:06:11Z

contributors/devel/sig-testing/monitoring.md

+list thread (if you see an email from testgrid mentioning a job or test
+failure) and share it with the appropriate SIG in the Kubernetes slack.
+
+Don't worry if you are not sure how to debug further or how to resole the


As a person who is very new, this line is deeply comforting to read. 😅

thejoycekung · 2020-10-17T05:08:27Z

contributors/devel/sig-testing/monitoring.md

+https://github.com/kubernetes/kubernetes/issues/new/choose , chose the appropriate issue
+template.
+
+### Fill out the issue


Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should talk a little bit about how to title it, e.g. prefix with [Failing Test] or [Flaky Test] (depending on what's happening)

I said something similar in a different PR review that wrote instructions on how to file a flake issue #5205 (comment)

Instructions on correctly filling out an issue are most likely to be read if they are part of the issue template itself. Alternatively, make a page dedicated just to how to file kubernetes/kubernetes issues, and link to that page from the issue template.

I opened kubernetes/kubernetes#95528 to cover updating the flake template, maybe it should expand for both.

thejoycekung · 2020-10-17T05:13:02Z

contributors/devel/sig-testing/monitoring.md

+Sometime Triage will help you find patterns to figure out whats wrong.
+In this instance we can also see that this job has been failing rather
+frequently (about 2 times per hour).
+


Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should also mention adding a relevant SIG through /sig <name> and/or cc'ing relevant people like /cc @kubernetes/sig-<foo>-test-failures or @kubernetes/ci-signal

contributors/devel/sig-testing/monitoring.md

knabben · 2020-10-17T14:18:05Z

contributors/devel/sig-testing/monitoring.md

+
+failed as well.
+
+If one or both of these jobs continue failing, or if they failed frently


Could we be more specific in the frequency of the failure? Having an agreement can make it less subjective.

knabben · 2020-10-17T14:20:05Z

contributors/devel/sig-testing/monitoring.md

+
+Further down the page you will see all the logs for the entire test run.
+Please copy any information you think may be useful from here into the issue.
+


Maybe is worth mention the artifacts for more logging of other components.

justaugustus

Two filename misspellings:

failes-tests.png --> failed-tests.png
spyglass-sumary.png --> spyglass-summary.png

Ensure you update any references in the document as well.

I'll hold my content review for now, as @thejoycekung has already done an excellent review to start you off.

qiutongs · 2020-10-19T23:19:25Z

I am new to the CI infra. This doc is really informational and provides a great starting point. Thanks!

SergeyKanzhelev · 2020-10-19T23:56:00Z

contributors/devel/sig-testing/monitoring.md

+* `gci-gce-ingress` https://testgrid.k8s.io/sig-release-master-blocking#gci-gce-ingress,
+* `kind-master-parallel` https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel
+
+are flaky (we should have some issues opened up for these to investigate why


it will be useful to include the actual issues examples

hasheddan

@alejandrox1 thank you so much for your great work here 🙂 and thanks to @thejoycekung for the thorough and helpful review! Once some of the suggestions are incorporated, I think this will be a great document to continue to iterate on. One minor nitpick on document name, it kind of makes me think of monitoring in the more traditional software engineering sense of the word, rather than monitoring testgrid and test results. Maybe something like addressing-test-failures.md or something of that nature?

hasheddan · 2020-10-20T01:51:17Z

contributors/devel/sig-testing/monitoring.md

+Furthermore, if jobs or tests are failing or flaking, then pull requests will
+take a lot longer to be merged.


Might be an opportune time to mention the difference between periodic / presubmit / postsubmit

Also possibly mentioning metrics

contributors/devel/sig-testing/monitoring.md

MushuEE · 2020-10-23T21:20:07Z

contributors/devel/sig-testing/monitoring.md

+## Monitoring the health of Kubernetes with TestGrid
+
+TestGrid is a highly-configurable, interactive dashboard for viewing your test
+results in a grid, see https://github.com/GoogleCloudPlatform/testgrid.


To disambiguate even more "the back end is open-sourced"

MushuEE · 2020-10-23T21:22:07Z

contributors/devel/sig-testing/monitoring.md

+are doing.
+
+We highly enourage any one to take periodically monitor these dashboards.
+If you see that a job or test has been failing please raise an issue with the


Is it worth mentioning that clicking a failing cell will take you to the job execution in Deck?

MushuEE · 2020-10-23T21:23:02Z

contributors/devel/sig-testing/monitoring.md

+Furthermore, if jobs or tests are failing or flaking, then pull requests will
+take a lot longer to be merged.


Also possibly mentioning metrics

contributors/devel/sig-testing/monitoring.md

This should help provide context on how CI works and what processes to follow to maintain our tests. Signed-off-by: alejandrox1 <[email protected]>

alejandrox1 · 2020-12-07T16:51:08Z

Got through the syntax comments 😅 (sorry for taking so long).
pondering on the open-questions

SergeyKanzhelev · 2020-12-07T17:56:23Z

/lgtm

hasheddan

I think this PR is at a merge-able state and further improvements can be made iteratively. Thanks for this great write up @alejandrox1!

/lgtm

k8s-ci-robot · 2020-12-17T19:13:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alejandrox1, hasheddan, qiutongs
To complete the pull request process, please assign bentheelder after the PR has been reviewed.
You can assign the PR to them by writing /assign @bentheelder in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

contributors/devel/sig-testing/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hasheddan · 2020-12-17T19:18:40Z

@BenTheElder @spiffxp can we get some eyes on this?

spiffxp

I feel like @thejoycekung raised some valid unaddressed concerns

spiffxp · 2020-11-30T19:36:05Z

contributors/devel/sig-testing/monitoring.md

+which we use to monitor and observe the health of the project.
+
+Each SIG has its own set of dashboards, and each dashboard is composed of
+different end-to-end (e2e) jobs.


there are more than e2e jobs

Suggested change

different end-to-end (e2e) jobs.

different jobs (build, unit test, integration test, end-to-end (e2e) test, etc.)

spiffxp · 2021-01-19T21:29:54Z

contributors/devel/sig-testing/monitoring.md

+## Monitoring the health of Kubernetes with TestGrid
+
+TestGrid is a highly-configurable, interactive dashboard for viewing your test
+results in a grid, see https://github.com/GoogleCloudPlatform/testgrid.


+1 to clarifying the repo contains the back-end components of testgrid, not the dashboard code itself

spiffxp · 2021-01-19T21:33:51Z

contributors/devel/sig-testing/monitoring.md

+These views allow different teams to monitor and understand how their areas
+are doing.
+
+We highly encourage anyone to periodically monitor these dashboards.


I don't want to encourage toil. SIGs should be periodically monitoring the dashboards related to subprojects they own.

spiffxp · 2021-01-19T21:37:11Z

contributors/devel/sig-testing/monitoring.md

+**Note**: It is important that all SIGs periodically monitor their jobs and
+tests. These are used to figure out when to release Kubernetes.


This is too broad. Not all jobs/tests are used to figure out when to release Kubernetes.

spiffxp · 2021-01-19T21:39:52Z

contributors/devel/sig-testing/monitoring.md

+The number one thing to do is to communicate your findings: a test or job has
+been flaking or failing.
+If you saw a TestGrid alert on a mailing list, please reply to the thread and
+mention that you are looking into it.
+It is important to communicate to prevent duplicate work and to ensure CI
+problems get attention.
+
+In order to communicate with the rest of the community and to drive the work,
+please open up an issue on Kubernetes,
+https://github.com/kubernetes/kubernetes/issues/new/choose, and choose the appropriate issue
+template.


I would suggest:

look to see if there is already an open issue for the relevant repo, if not, create one

the relevant repo for kubernetes/kubernetes release-blocking or merge-blocking jobs is kubernetes/kubernetes

reply to alert with link to issue

all further communication on that issue

How do I decide which kubernetes/kubernetes issue template to use?

if the job is failing continuously, failing test

if the job is occasionally passing and failing, flaking test

spiffxp · 2021-01-19T21:58:15Z

contributors/devel/sig-testing/monitoring.md

+5. **Anything else we need to know**
+
+There is this wonderful page built by SIG testing that often comes in handy:
+https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1


Please use the shortlink, the bucket name is going to change eventually (ref: kubernetes/k8s.io#1305)

Suggested change

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1

https://go.k8s.io/triage

spiffxp · 2021-01-19T21:59:57Z

contributors/devel/sig-testing/monitoring.md

+There is one important detail we have to mention at this point, the job names
+you see on TestGrid are often aliases.
+For example, when we clicked on a test run for
+`node-kubelet-features-master`
+(
+https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master
+), at the top left corner of spyglass the page tells us the real job name,
+`ci-kubernetes-node-kubelet-features` (notice the "ci-kubernetes-" prefix).


agree, I suggested earlier in the doc

spiffxp · 2021-01-19T22:00:18Z

contributors/devel/sig-testing/monitoring.md

+`node-kubelet-features-master`
+(
+https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master
+), at the top left corner of spyglass the page tells us the real job name,


job name, not tab name

spiffxp · 2021-01-19T22:04:48Z

contributors/devel/sig-testing/monitoring.md

@@ -0,0 +1,212 @@
+# Monitoring Kubernetes Health


Suggested change

# Monitoring Kubernetes Health

# Monitoring Kubernetes Test Health

spiffxp · 2021-01-19T22:11:20Z

contributors/devel/sig-testing/monitoring.md

+https://github.com/kubernetes/kubernetes/issues/new/choose , chose the appropriate issue
+template.
+
+### Fill out the issue


I said something similar in a different PR review that wrote instructions on how to file a flake issue #5205 (comment)

Instructions on correctly filling out an issue are most likely to be read if they are part of the issue template itself. Alternatively, make a page dedicated just to how to file kubernetes/kubernetes issues, and link to that page from the issue template.

I opened kubernetes/kubernetes#95528 to cover updating the flake template, maybe it should expand for both.

knabben · 2021-01-20T13:12:06Z

/lgtm

spiffxp · 2021-01-28T23:13:37Z

/hold
#5244 (review) remains unaddressed

justaugustus · 2021-01-29T20:01:44Z

Sounds like we've got a crew to swarm this! Thanks everyone!

@RobertKielty --

scoop the branch
resubmit a PR
start addressing the feedback
tag @thejoycekung @hasheddan @jeremyrickard as reviewers

ref: https://kubernetes.slack.com/archives/C2C40FMNF/p1611943998099400

spiffxp · 2021-01-29T20:54:16Z

Please tag me as as reviewer as well

spiffxp · 2021-02-05T21:45:24Z

/close
Since @RobertKielty has picked this up over in #5449

k8s-ci-robot · 2021-02-05T21:45:33Z

@spiffxp: Closed this PR.

In response to this:

/close
Since @RobertKielty has picked this up over in #5449

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 16, 2020

k8s-ci-robot requested review from spiffxp and stevekuznetsov October 16, 2020 22:48

k8s-ci-robot added area/developer-guide Issues or PRs related to the developer guide sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 16, 2020

k8s-ci-robot requested review from hasheddan, SergeyKanzhelev, MHBauer and knabben October 16, 2020 22:52

k8s-ci-robot requested a review from RobertKielty October 16, 2020 22:52

alejandrox1 force-pushed the triage-tests branch from c8a26a5 to 10ca137 Compare October 16, 2020 22:54

thejoycekung suggested changes Oct 17, 2020

View reviewed changes

knabben reviewed Oct 17, 2020

View reviewed changes

contributors/devel/sig-testing/monitoring.md Show resolved Hide resolved

knabben reviewed Oct 17, 2020

View reviewed changes

justaugustus suggested changes Oct 17, 2020

View reviewed changes

qiutongs approved these changes Oct 19, 2020

View reviewed changes

SergeyKanzhelev reviewed Oct 19, 2020

View reviewed changes

hasheddan reviewed Oct 20, 2020

View reviewed changes

MushuEE reviewed Oct 23, 2020

View reviewed changes

alejandrox1 force-pushed the triage-tests branch 6 times, most recently from 70db858 to 48bae65 Compare November 9, 2020 19:07

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2020

Added guide for monitoring CI

9ceb4ce

This should help provide context on how CI works and what processes to follow to maintain our tests. Signed-off-by: alejandrox1 <[email protected]>

alejandrox1 force-pushed the triage-tests branch from d9bab58 to 9ceb4ce Compare December 7, 2020 16:50

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 7, 2020

k8s-ci-robot assigned SergeyKanzhelev Dec 7, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 7, 2020

qiutongs mentioned this pull request Dec 9, 2020

REQUEST: New membership for qiutongs kubernetes/org#2373

Closed

6 tasks

hasheddan approved these changes Dec 17, 2020

View reviewed changes

k8s-ci-robot assigned hasheddan Dec 17, 2020

spiffxp suggested changes Jan 19, 2021

View reviewed changes

k8s-ci-robot assigned knabben Jan 20, 2021

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 28, 2021

This was referenced Jan 30, 2021

Guide for monitoring tests run by CI on Kubernetes - final push #5448

Closed

Guide for monitoring tests run by CI - final push #5449

Merged

k8s-ci-robot closed this Feb 5, 2021

	cluster, tearing down a Kubernetes cluster) and e2e tests (i.e., Kubectl client
	cluster, tearing down a Kubernetes cluster) and e2e tests (e.g., Kubectl client


		failed as well.

		If one or both of these jobs continue failing, or if they failed frently


		Further down the page you will see all the logs for the entire test run.
		Please copy any information you think may be useful from here into the issue.

		Furthermore, if jobs or tests are failing or flaking, then pull requests will
		take a lot longer to be merged.

	different end-to-end (e2e) jobs.
	different jobs (build, unit test, integration test, end-to-end (e2e) test, etc.)

		Note: It is important that all SIGs periodically monitor their jobs and
		tests. These are used to figure out when to release Kubernetes.

	https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1
	https://go.k8s.io/triage

	# Monitoring Kubernetes Health
	# Monitoring Kubernetes Test Health

Added guide for monitoring CI #5244

Added guide for monitoring CI #5244

Conversation

alejandrox1 commented Oct 16, 2020

alejandrox1 commented Oct 16, 2020 • edited Loading

k8s-ci-robot commented Oct 16, 2020

thejoycekung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knabben Oct 17, 2020 • edited Loading

Choose a reason for hiding this comment

knabben Oct 17, 2020 • edited Loading

Choose a reason for hiding this comment

justaugustus left a comment

Choose a reason for hiding this comment

qiutongs commented Oct 19, 2020

Choose a reason for hiding this comment

hasheddan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alejandrox1 commented Dec 7, 2020

SergeyKanzhelev commented Dec 7, 2020

hasheddan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 17, 2020

hasheddan commented Dec 17, 2020

spiffxp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knabben commented Jan 20, 2021

spiffxp commented Jan 28, 2021

justaugustus commented Jan 29, 2021

spiffxp commented Jan 29, 2021

spiffxp commented Feb 5, 2021

k8s-ci-robot commented Feb 5, 2021

alejandrox1 commented Oct 16, 2020 •

edited

Loading

knabben Oct 17, 2020 •

edited

Loading

knabben Oct 17, 2020 •

edited

Loading