-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added guide for monitoring CI #5244
Conversation
For a better read check out the version from my fork directly: https://github.com/alejandrox1/community/blob/triage-tests/contributors/devel/sig-testing/monitoring.md#fill-out-the-issue Pinging couple people I mentioned this too for review |
@alejandrox1: GitHub didn't allow me to request PR reviews from the following users: merkes, kscharm. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
c8a26a5
to
10ca137
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for taking the time to take that doc and really write it out!!! Most of these comments are grammar/spelling nits (because I am kinda nitpicky I am very sorry) but some are also questions/suggestions!
(also yes I am aware it is 1AM on a Friday night I am so sorry please enjoy your weekend)
## Monitoring the health of Kubernetes with TestGrid | ||
|
||
TestGrid is a highly-configurable, interactive dashboard for viewing your test | ||
results in a grid, see https://github.com/GoogleCloudPlatform/testgrid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: A better transition would be nice here, like 'it is partially open-sourced so you can view the source code here' or something to that effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To disambiguate even more "the back end is open-sourced"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to clarifying the repo contains the back-end components of testgrid, not the dashboard code itself
Each SIG has its own set of dashboards and each dashboard is composed of | ||
different end-to-end (e2e) jobs. | ||
E2E jobs are in turn made up of test stages (i.e., bootstraping a Kubernetes | ||
cluster, tearing down a Kubernetes cluster) and e2e tests (i.e., Kubectl client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar nit:
cluster, tearing down a Kubernetes cluster) and e2e tests (i.e., Kubectl client | |
cluster, tearing down a Kubernetes cluster) and e2e tests (e.g., Kubectl client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Picking a different test as an example might be more readable here? For this particular test it feels a little clumsy to read - I wasn't sure whether the test was "Kubectl logs should be able to retrieve and filter logs" and it was under "Kubectl client" (but uh, no turns out "Kubectl client Kubectl logs ..." is the real name)
Once you have filled out the issue, please mention it in the appropriate mailing | ||
list thread (if you see an email from testgrid mentioning a job or test | ||
failure) and share it with the appropriate SIG in the Kubernetes slack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar nit:
Once you have filled out the issue, please mention it in the appropriate mailing | |
list thread (if you see an email from testgrid mentioning a job or test | |
failure) and share it with the appropriate SIG in the Kubernetes slack. | |
Once you have filled out the issue, please mention it in the appropriate mailing | |
list thread (if you see an email from TestGrid mentioning a job or test | |
failure) and share it with the appropriate SIG in the Kubernetes Slack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: should we mention it to a mailing list also if we did not see an email from testgrid (e.g. it was a flake)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good
list thread (if you see an email from testgrid mentioning a job or test | ||
failure) and share it with the appropriate SIG in the Kubernetes slack. | ||
|
||
Don't worry if you are not sure how to debug further or how to resole the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a person who is very new, this line is deeply comforting to read. 😅
https://github.com/kubernetes/kubernetes/issues/new/choose , chose the appropriate issue | ||
template. | ||
|
||
### Fill out the issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should talk a little bit about how to title it, e.g. prefix with [Failing Test]
or [Flaky Test]
(depending on what's happening)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I said something similar in a different PR review that wrote instructions on how to file a flake issue #5205 (comment)
Instructions on correctly filling out an issue are most likely to be read if they are part of the issue template itself. Alternatively, make a page dedicated just to how to file kubernetes/kubernetes issues, and link to that page from the issue template.
I opened kubernetes/kubernetes#95528 to cover updating the flake template, maybe it should expand for both.
Sometime Triage will help you find patterns to figure out whats wrong. | ||
In this instance we can also see that this job has been failing rather | ||
frequently (about 2 times per hour). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should also mention adding a relevant SIG through /sig <name>
and/or cc'ing relevant people like /cc @kubernetes/sig-<foo>-test-failures
or @kubernetes/ci-signal
|
||
failed as well. | ||
|
||
If one or both of these jobs continue failing, or if they failed frently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we be more specific in the frequency of the failure? Having an agreement can make it less subjective.
|
||
Further down the page you will see all the logs for the entire test run. | ||
Please copy any information you think may be useful from here into the issue. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe is worth mention the artifacts for more logging of other components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two filename misspellings:
failes-tests.png
-->failed-tests.png
spyglass-sumary.png
-->spyglass-summary.png
Ensure you update any references in the document as well.
I'll hold my content review for now, as @thejoycekung has already done an excellent review to start you off.
I am new to the CI infra. This doc is really informational and provides a great starting point. Thanks! |
* `gci-gce-ingress` https://testgrid.k8s.io/sig-release-master-blocking#gci-gce-ingress, | ||
* `kind-master-parallel` https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel | ||
|
||
are flaky (we should have some issues opened up for these to investigate why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be useful to include the actual issues examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alejandrox1 thank you so much for your great work here 🙂 and thanks to @thejoycekung for the thorough and helpful review! Once some of the suggestions are incorporated, I think this will be a great document to continue to iterate on. One minor nitpick on document name, it kind of makes me think of monitoring in the more traditional software engineering sense of the word, rather than monitoring testgrid and test results. Maybe something like addressing-test-failures.md
or something of that nature?
Furthermore, if jobs or tests are failing or flaking, then pull requests will | ||
take a lot longer to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be an opportune time to mention the difference between periodic / presubmit / postsubmit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also possibly mentioning metrics
## Monitoring the health of Kubernetes with TestGrid | ||
|
||
TestGrid is a highly-configurable, interactive dashboard for viewing your test | ||
results in a grid, see https://github.com/GoogleCloudPlatform/testgrid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To disambiguate even more "the back end is open-sourced"
are doing. | ||
|
||
We highly enourage any one to take periodically monitor these dashboards. | ||
If you see that a job or test has been failing please raise an issue with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth mentioning that clicking a failing cell will take you to the job execution in Deck?
Furthermore, if jobs or tests are failing or flaking, then pull requests will | ||
take a lot longer to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also possibly mentioning metrics
70db858
to
48bae65
Compare
This should help provide context on how CI works and what processes to follow to maintain our tests. Signed-off-by: alejandrox1 <[email protected]>
d9bab58
to
9ceb4ce
Compare
Got through the syntax comments 😅 (sorry for taking so long). |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is at a merge-able state and further improvements can be made iteratively. Thanks for this great write up @alejandrox1!
/lgtm
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alejandrox1, hasheddan, qiutongs The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@BenTheElder @spiffxp can we get some eyes on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like @thejoycekung raised some valid unaddressed concerns
which we use to monitor and observe the health of the project. | ||
|
||
Each SIG has its own set of dashboards, and each dashboard is composed of | ||
different end-to-end (e2e) jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are more than e2e jobs
different end-to-end (e2e) jobs. | |
different jobs (build, unit test, integration test, end-to-end (e2e) test, etc.) |
## Monitoring the health of Kubernetes with TestGrid | ||
|
||
TestGrid is a highly-configurable, interactive dashboard for viewing your test | ||
results in a grid, see https://github.com/GoogleCloudPlatform/testgrid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to clarifying the repo contains the back-end components of testgrid, not the dashboard code itself
These views allow different teams to monitor and understand how their areas | ||
are doing. | ||
|
||
We highly encourage anyone to periodically monitor these dashboards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to encourage toil. SIGs should be periodically monitoring the dashboards related to subprojects they own.
**Note**: It is important that all SIGs periodically monitor their jobs and | ||
tests. These are used to figure out when to release Kubernetes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too broad. Not all jobs/tests are used to figure out when to release Kubernetes.
The number one thing to do is to communicate your findings: a test or job has | ||
been flaking or failing. | ||
If you saw a TestGrid alert on a mailing list, please reply to the thread and | ||
mention that you are looking into it. | ||
It is important to communicate to prevent duplicate work and to ensure CI | ||
problems get attention. | ||
|
||
In order to communicate with the rest of the community and to drive the work, | ||
please open up an issue on Kubernetes, | ||
https://github.com/kubernetes/kubernetes/issues/new/choose, and choose the appropriate issue | ||
template. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest:
- look to see if there is already an open issue for the relevant repo, if not, create one
- the relevant repo for kubernetes/kubernetes release-blocking or merge-blocking jobs is kubernetes/kubernetes
- reply to alert with link to issue
- all further communication on that issue
How do I decide which kubernetes/kubernetes issue template to use?
- if the job is failing continuously, failing test
- if the job is occasionally passing and failing, flaking test
5. **Anything else we need to know** | ||
|
||
There is this wonderful page built by SIG testing that often comes in handy: | ||
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the shortlink, the bucket name is going to change eventually (ref: kubernetes/k8s.io#1305)
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1 | |
https://go.k8s.io/triage |
There is one important detail we have to mention at this point, the job names | ||
you see on TestGrid are often aliases. | ||
For example, when we clicked on a test run for | ||
`node-kubelet-features-master` | ||
( | ||
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master | ||
), at the top left corner of spyglass the page tells us the real job name, | ||
`ci-kubernetes-node-kubelet-features` (notice the "ci-kubernetes-" prefix). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, I suggested earlier in the doc
`node-kubelet-features-master` | ||
( | ||
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master | ||
), at the top left corner of spyglass the page tells us the real job name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job name, not tab name
@@ -0,0 +1,212 @@ | |||
# Monitoring Kubernetes Health |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Monitoring Kubernetes Health | |
# Monitoring Kubernetes Test Health |
https://github.com/kubernetes/kubernetes/issues/new/choose , chose the appropriate issue | ||
template. | ||
|
||
### Fill out the issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I said something similar in a different PR review that wrote instructions on how to file a flake issue #5205 (comment)
Instructions on correctly filling out an issue are most likely to be read if they are part of the issue template itself. Alternatively, make a page dedicated just to how to file kubernetes/kubernetes issues, and link to that page from the issue template.
I opened kubernetes/kubernetes#95528 to cover updating the flake template, maybe it should expand for both.
/lgtm |
/hold |
Sounds like we've got a crew to swarm this! Thanks everyone!
ref: https://kubernetes.slack.com/archives/C2C40FMNF/p1611943998099400 |
Please tag me as as reviewer as well |
/close |
@spiffxp: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This should help provide context on how CI works and what processes to
follow to maintain our tests.
Signed-off-by: alejandrox1 [email protected]