Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a simple e2e test #323

Merged
merged 2 commits into from
Aug 16, 2019
Merged

Add a simple e2e test #323

merged 2 commits into from
Aug 16, 2019

Conversation

xueweiz
Copy link
Contributor

@xueweiz xueweiz commented Aug 10, 2019

This PR adds an e2e test to verify that NPD will report metric host_uptime in Prometheus format on a clean VM. This is part of #296

I verified this test locally:

ZONE=us-central1-a PROJECT=xueweiz-experimental \
VM_IMAGE=cos-73-11647-217-0 IMAGE_PROJECT=cos-cloud \
SSH_USER=${USER} SSH_KEY=~/.ssh/id_rsa make e2e-test

I also have an internal Prow job running this test to verify that it works in Prow environment.
See the continuous Prow jobs here, a sample run result here, and a testgrid setup here.

I will later publish the Prow pipeline setup publicly after we have this PR merged, so that we will have a public CI job and testgrid.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 10, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @xueweiz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 10, 2019
@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 10, 2019

/assign @wangzhen127
/cc @Random-Liu
/cc @andyxning

@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 10, 2019

/cc @krzyzacy
And huge thanks to Sen for helping out with the test design & setup!

@k8s-ci-robot k8s-ci-robot requested a review from krzyzacy August 10, 2019 00:28
test/e2e/containers/prow-npd-e2e/Makefile Outdated Show resolved Hide resolved
test/e2e/containers/prow-npd-e2e/script.sh Outdated Show resolved Hide resolved
@andyxning
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 11, 2019
Copy link
Member

@wangzhen127 wangzhen127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding e2e tests! I left a few comments.

config/systemd/node-problem-detector-standalone.service Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
test/install.sh Outdated Show resolved Hide resolved
test/e2e/lib/gce/gce.go Show resolved Hide resolved
test/e2e/lib/npd/npd.go Outdated Show resolved Hide resolved
test/e2e/standalone/basic_metrics_test.go Outdated Show resolved Hide resolved
test/e2e/standalone/e2e_npd_test.go Outdated Show resolved Hide resolved
test/install.sh Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
test/e2e/containers/prow-npd-e2e/Dockerfile Outdated Show resolved Hide resolved
@xueweiz xueweiz force-pushed the test branch 2 times, most recently from 9b5b2ed to b7fb568 Compare August 13, 2019 00:12
@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 13, 2019

Hi Zhen, thanks for the review! I just address your comments.

Makefile Show resolved Hide resolved
pkg/util/metrics/helpers.go Outdated Show resolved Hide resolved
test/e2e/lib/gce/gce.go Show resolved Hide resolved
test/e2e/lib/npd/npd.go Show resolved Hide resolved
@xueweiz xueweiz force-pushed the test branch 2 times, most recently from 6599e5d to f04baab Compare August 13, 2019 20:11
@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 13, 2019

Hi Zhen, I just fixed the above problems and added some printing for debugging (useful when test fails).

@xueweiz xueweiz force-pushed the test branch 4 times, most recently from 31fc50f to 872727b Compare August 13, 2019 21:30
@xueweiz xueweiz force-pushed the test branch 4 times, most recently from 12bedf4 to 66621dd Compare August 13, 2019 23:17
@xueweiz xueweiz force-pushed the test branch 2 times, most recently from f0563f1 to 222246b Compare August 14, 2019 00:36
@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 14, 2019

I just added some retry logic in the test, so that it will give NPD enough time to fully start.
Can you help take a look again? Thanks! :)

})

ginkgo.AfterEach(func() {
npdMetrics := instance.RunCommand("curl http://localhost:20257/metrics")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this and below printing journal log for debugging? Would it make the output look too verbose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these prints are for debugging. I think it's reasonable to print out the NPD reported metrics. Because after all, they are the key subject that we are asserting against (e.g. NPD has this metric and that metric...).

In my opinion, the output looks OK for now. See example at https://gubernator-internal.googleplex.com/build/gke-prow/logs/ci-npd-e2e-cos-experimental/1161702249425014784/

I printed out the journal logs, mainly to catch invalid config file problem. i.e. The case where someone changed some config / arguments and made NPD panic just when starting.

However, I totally agree that it's best if we do not print out journal logs. They are not the tested subject, and are merely used for debugging the tested subject. And most importantly, they CAN potentially be large when the test gets more complicated.

I think the best practice for uploading large debugging text in testing, is to tar these texts and upload them to some test artifact storage. Do you know whether this can be done with Prow? Can you point me to some code sample?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting those into artifacts? For example, current NPD k8s e2e test stores node-problem-detector.log as one of the artifact. See the following artifacts from a CI job.
https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-npd-e2e-kubernetes-gce-gci/1161681249476022272/artifacts/e2e-3e359a827e-7abb6-minion-group-7pkc/

@krzyzacy may know how to set that up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// Wait for NPD to be ready for a maximum of 120 seconds.
err = retry.Do(verifyUptimeMetric,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some generic way of waiting for NPD coming up, put that into some function and call it after npd.SetupNPD.

If you think waiting for uptime metric is the easiest way to check NPD is up, then put this into that function, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. Added WaitForNPD() function.

I don't want to hard code the "ready signal" (e.g. host_uptime metric) in the function. Mainly because NPD is configurable. We might test it under different configurations, which will yield different ready signals.

I'm also think about maybe we should use the systemd service property as another ready signal. But in my opinion, those are minor details, and we can improve them along the way.

@xueweiz xueweiz force-pushed the test branch 2 times, most recently from 3555124 to c143b99 Compare August 14, 2019 21:21
@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 14, 2019

Hi Zhen, I just added the wait function, and uploaded these debugging data as test artifacts. Could you help take another look? Thanks!

@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 14, 2019

The above test failures (1 2) seems to be some test infra problem. Perhaps the docker-in-docker container has a bug in it.

Let's ignore it for now.

@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 15, 2019

/retest

Copy link
Member

@wangzhen127 wangzhen127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits on naming

test/e2e/lib/npd/npd.go Outdated Show resolved Hide resolved
test/e2e/lib/npd/npd.go Outdated Show resolved Hide resolved
The first test is a very simple test. It installs NPD on a VM, and then
verifies that NPD reports metric host_uptime in Prometheus format.
@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 16, 2019

Hi Zhen, I just fixed the artifact names :) Could you help take another look? Thanks!

@wangzhen127
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wangzhen127, xueweiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2019
@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 16, 2019

/retest

This is the above failure:

$ git fetch https://github.com/kubernetes/node-problem-detector.git master
fatal: unable to access 'https://github.com/kubernetes/node-problem-detector.git/': Could not resolve host: github.com
# Error: exit status 128

I feel like it's an infra problem. Retrying.

@k8s-ci-robot k8s-ci-robot merged commit 424b864 into kubernetes:master Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants