Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e test for NPD metric-only mode #296

Closed
6 tasks
xueweiz opened this issue Jun 22, 2019 · 11 comments
Closed
6 tasks

Add e2e test for NPD metric-only mode #296

xueweiz opened this issue Jun 22, 2019 · 11 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@xueweiz
Copy link
Contributor

xueweiz commented Jun 22, 2019

Today, NPD has two test pipelines: CI and presubmits.

In both cases, they are essentially running unit tests (make test) and k8s e2e test (kubernetes_e2e.py ... --ginkgo.focus=NodeProblemDetector).

We should add more NPD-focused tests to ensure it does not regress. And many of them are best done from e2e tests, e.g.:

  • Trigger/simulate failures (kill -9 [docker], echo c > /proc/sysrq-trigger, stress --vm 2 --vm-bytes 128M --timeout 120s, echo "OOM" > /dev/kmsg, etc), and see if NPD can correctly behave under the situation.

  • Performance test. e.g. write a busy program spamming "OOM" to /dev/kmsg, and see if NPD's CPU usage is capped. (We could use cgroup to limit NPD's usage to make the test pass)

  • Soak test. e.g. Let NPD run for ~5 days, then see if it has an expanded memory footprint.

Obviously, these tests are best done in e2e tests, rather than unit test. Also, I think we should do the tests on standalone machines, rather than in a k8s cluster. Because the test should focus on node-level behavior, rather than cluster-level.

Action items:

@xueweiz
Copy link
Contributor Author

xueweiz commented Jun 22, 2019

/assign @xueweiz
/cc @wangzhen127

@xueweiz
Copy link
Contributor Author

xueweiz commented Jun 22, 2019

/cc @kewu1992

@xueweiz
Copy link
Contributor Author

xueweiz commented Jun 25, 2019

I was trying to find a library to help with SSHing and running commands in the VM.
I was experimenting with k8s.io/kubernetes/test/e2e/framework/ssh, but I cannot get it to work.

See the experiment code at https://github.com/xueweiz/node-problem-detector/blob/ssh-do-not-work/test/e2e_standalone/simple_test.go#L8

If you remove the k8s.io/kubernetes/test/e2e/framework/ssh import and usage, the test will run. But if we keep it, I will get an error:

xueweiz@xueweiz:~/.gvm/pkgsets/go1.11/global/src/k8s.io/node-problem-detector/test/e2e_standalone$ go test simple_test.go 
/tmp/go-build848136485/b001/books.test flag redefined: ginkgo.seed
panic: /tmp/go-build848136485/b001/books.test flag redefined: ginkgo.seed

goroutine 1 [running]:
flag.(*FlagSet).Var(0xc0000521e0, 0x14b92a0, 0x20a8100, 0xc00060e000, 0xb, 0x1374521, 0x2a)
	/usr/local/google/home/xueweiz/.gvm/gos/go1.11/src/flag/flag.go:805 +0x529
flag.(*FlagSet).Int64Var(0xc0000521e0, 0x20a8100, 0xc00060e000, 0xb, 0x5d116df3, 0x1374521, 0x2a)
	/usr/local/google/home/xueweiz/.gvm/gos/go1.11/src/flag/flag.go:630 +0x71
k8s.io/node-problem-detector/vendor/github.com/onsi/ginkgo/config.Flags(0xc0000521e0, 0xc000463e78, 0x7, 0x1217801)
	/usr/local/google/home/xueweiz/.gvm/pkgsets/go1.11/global/src/k8s.io/node-problem-detector/vendor/github.com/onsi/ginkgo/config/config.go:68 +0x112
k8s.io/node-problem-detector/vendor/github.com/onsi/ginkgo.init.0()
	/usr/local/google/home/xueweiz/.gvm/pkgsets/go1.11/global/src/k8s.io/node-problem-detector/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go:55 +0x59
FAIL	command-line-arguments	0.174s

I guess this is because the k8s.io/kubernetes/test/e2e/framework/ssh is not designed to be a generic library, and has a dependency with its vendored ginkgo. While my test is using my own vendored ginkgo. Probably the two ginkgo library are defining the .seed flag twice, and triggering this issue.

I think I'll stop trying to use the libraries from k8s.io/kubernetes/test/e2e/framework, as they should really be used in k8s-e2e test's context.

@xueweiz xueweiz mentioned this issue Jul 25, 2019
7 tasks
This was referenced Aug 10, 2019
@wangzhen127
Copy link
Member

"Standalone" is a used word already, meaning NPD is running as a native process instead of as a container. So consider use another word to cover the case where NPD runs without k8s API server. Maybe "metrics-only mode" (or something else)? After you decide, you can add that to README, so everyone knows.

@xueweiz
Copy link
Contributor Author

xueweiz commented Aug 12, 2019

SG. Let's call it "metrics-only" mode.

@xueweiz
Copy link
Contributor Author

xueweiz commented Sep 18, 2019

I just setup a CI job at kubernetes/test-infra#14369. See test results at here:
https://k8s-testgrid.appspot.com/sig-node-node-problem-detector#ci-npd-e2e-test
https://prow.k8s.io/job-history/kubernetes-jenkins/logs/ci-npd-e2e-test

I see some flakes on the test:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-npd-e2e-test/1174205959015239684

There isn't any helpful error message. And I just figured out why:
So ginkgo defaults to set each It()/BeforeEach() with a 10 minutes timeout. And when we create a GCE VM, we retry to SSH to it with a 10 minutes timeout:

for i := 0; i < 30 && !instanceRunning; i++ {
if i > 0 {
time.Sleep(time.Second * 20)
}

Until the retry finish (either succeed or fail), the test code doesn't report the error. So eventually it results in that the 10 minute test timeout will come in first and kill the test without giving it a chance to report error.

I will make a PR to reduce the SSH retry timeout, and that should give us some error message next time the flake happens.

@xueweiz
Copy link
Contributor Author

xueweiz commented Dec 11, 2019

The test has been running good now :) Except for occasional timeout.

We can consider add them as part of presubmit now.
image

This will help catch NPD regressions, and will ease the debugging of occasional timeout.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 9, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants