Add e2e test for NPD metric-only mode #296

xueweiz · 2019-06-22T00:20:56Z

Today, NPD has two test pipelines: CI and presubmits.

In both cases, they are essentially running unit tests (make test) and k8s e2e test (kubernetes_e2e.py ... --ginkgo.focus=NodeProblemDetector).

We should add more NPD-focused tests to ensure it does not regress. And many of them are best done from e2e tests, e.g.:

Trigger/simulate failures (kill -9 [docker], echo c > /proc/sysrq-trigger, stress --vm 2 --vm-bytes 128M --timeout 120s, echo "OOM" > /dev/kmsg, etc), and see if NPD can correctly behave under the situation.
Performance test. e.g. write a busy program spamming "OOM" to /dev/kmsg, and see if NPD's CPU usage is capped. (We could use cgroup to limit NPD's usage to make the test pass)
Soak test. e.g. Let NPD run for ~5 days, then see if it has an expanded memory footprint.

Obviously, these tests are best done in e2e tests, rather than unit test. Also, I think we should do the tests on standalone machines, rather than in a k8s cluster. Because the test should focus on node-level behavior, rather than cluster-level.

Action items:

Add basic logistic e2e test to ensure NPD reports some expected Prometheus metrics and Stackdriver metrics: Add a simple e2e test #323, Set SSH timeout to 5 minutes #352
Add performance e2e test to ensure NPD can handles medium/large log volume:
Setup CI jobs in Prow to run NPD e2e tests on different distros (cos-73-lts, cos-77-lts, ubuntu-1804-lts): Allow e2e test to rent project from Boskos #349, Allow e2e test to pick up test VM image using image family #353, Add ci-npd-e2e-test job to run npd e2e tests test-infra#14372, ci-npd-e2e-test: install libs required for compiling NPD test-infra#14375, ci-npd-e2e-test: set environment variables in the right way test-infra#14376, ci-npd-e2e-test: pick up test VM image using image family test-infra#14408
Add presubmit jobs in Prow to run NPD e2e tests on cos-73-lts, cos-77-lts, ubuntu-1804-lts:
Add soak test to ensure NPD's memory footprint stays stable after 1 day:
Add a CI job in Prow to continuously run the soak test:

The text was updated successfully, but these errors were encountered:

xueweiz · 2019-06-22T00:21:16Z

/assign @xueweiz
/cc @wangzhen127

xueweiz · 2019-06-22T00:28:01Z

/cc @kewu1992

xueweiz · 2019-06-25T00:55:03Z

I was trying to find a library to help with SSHing and running commands in the VM.
I was experimenting with k8s.io/kubernetes/test/e2e/framework/ssh, but I cannot get it to work.

See the experiment code at https://github.com/xueweiz/node-problem-detector/blob/ssh-do-not-work/test/e2e_standalone/simple_test.go#L8

If you remove the k8s.io/kubernetes/test/e2e/framework/ssh import and usage, the test will run. But if we keep it, I will get an error:

xueweiz@xueweiz:~/.gvm/pkgsets/go1.11/global/src/k8s.io/node-problem-detector/test/e2e_standalone$ go test simple_test.go 
/tmp/go-build848136485/b001/books.test flag redefined: ginkgo.seed
panic: /tmp/go-build848136485/b001/books.test flag redefined: ginkgo.seed

goroutine 1 [running]:
flag.(*FlagSet).Var(0xc0000521e0, 0x14b92a0, 0x20a8100, 0xc00060e000, 0xb, 0x1374521, 0x2a)
	/usr/local/google/home/xueweiz/.gvm/gos/go1.11/src/flag/flag.go:805 +0x529
flag.(*FlagSet).Int64Var(0xc0000521e0, 0x20a8100, 0xc00060e000, 0xb, 0x5d116df3, 0x1374521, 0x2a)
	/usr/local/google/home/xueweiz/.gvm/gos/go1.11/src/flag/flag.go:630 +0x71
k8s.io/node-problem-detector/vendor/github.com/onsi/ginkgo/config.Flags(0xc0000521e0, 0xc000463e78, 0x7, 0x1217801)
	/usr/local/google/home/xueweiz/.gvm/pkgsets/go1.11/global/src/k8s.io/node-problem-detector/vendor/github.com/onsi/ginkgo/config/config.go:68 +0x112
k8s.io/node-problem-detector/vendor/github.com/onsi/ginkgo.init.0()
	/usr/local/google/home/xueweiz/.gvm/pkgsets/go1.11/global/src/k8s.io/node-problem-detector/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go:55 +0x59
FAIL	command-line-arguments	0.174s

I guess this is because the k8s.io/kubernetes/test/e2e/framework/ssh is not designed to be a generic library, and has a dependency with its vendored ginkgo. While my test is using my own vendored ginkgo. Probably the two ginkgo library are defining the .seed flag twice, and triggering this issue.

I think I'll stop trying to use the libraries from k8s.io/kubernetes/test/e2e/framework, as they should really be used in k8s-e2e test's context.

wangzhen127 · 2019-08-12T18:51:46Z

"Standalone" is a used word already, meaning NPD is running as a native process instead of as a container. So consider use another word to cover the case where NPD runs without k8s API server. Maybe "metrics-only mode" (or something else)? After you decide, you can add that to README, so everyone knows.

xueweiz · 2019-08-12T23:21:56Z

SG. Let's call it "metrics-only" mode.

xueweiz · 2019-09-18T22:15:28Z

I just setup a CI job at kubernetes/test-infra#14369. See test results at here:
https://k8s-testgrid.appspot.com/sig-node-node-problem-detector#ci-npd-e2e-test
https://prow.k8s.io/job-history/kubernetes-jenkins/logs/ci-npd-e2e-test

I see some flakes on the test:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-npd-e2e-test/1174205959015239684

There isn't any helpful error message. And I just figured out why:
So ginkgo defaults to set each It()/BeforeEach() with a 10 minutes timeout. And when we create a GCE VM, we retry to SSH to it with a 10 minutes timeout:

node-problem-detector/test/e2e/lib/gce/instance.go

Lines 99 to 102 in 9828ab7

    
           for i := 0; i < 30 && !instanceRunning; i++ { 
        
           	if i > 0 { 
        
           		time.Sleep(time.Second * 20) 
        
           	}

Until the retry finish (either succeed or fail), the test code doesn't report the error. So eventually it results in that the 10 minute test timeout will come in first and kill the test without giving it a chance to report error.

I will make a PR to reduce the SSH retry timeout, and that should give us some error message next time the flake happens.

xueweiz · 2019-12-11T19:59:35Z

The test has been running good now :) Except for occasional timeout.

We can consider add them as part of presubmit now.

This will help catch NPD regressions, and will ease the debugging of occasional timeout.

fejta-bot · 2020-03-10T20:31:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-04-09T21:14:43Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-05-09T21:58:24Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-05-09T21:58:38Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot assigned xueweiz Jun 22, 2019

xueweiz mentioned this issue Jul 25, 2019

Implement NPD metrics mode #284

Closed

7 tasks

This was referenced Aug 10, 2019

Add a simple e2e test #323

Merged

vendor changes for e2e tests #324

Closed

xueweiz changed the title ~~Add e2e test for NPD standalone mode~~ Add e2e test for NPD metric-only mode Aug 12, 2019

xueweiz mentioned this issue Aug 16, 2019

REQUEST: New membership for xueweiz kubernetes/org#1111

Closed

This was referenced Sep 16, 2019

Allow e2e test to rent project from Boskos #349

Merged

Run node-problem-detector e2e test continuously kubernetes/test-infra#14369

Closed

Add ci-npd-e2e-test job to run npd e2e tests kubernetes/test-infra#14372

Merged

xueweiz mentioned this issue Sep 18, 2019

Set SSH timeout to 5 minutes #352

Merged

xueweiz mentioned this issue Nov 20, 2019

Add a few behavioral e2e tests #387

Merged

xueweiz mentioned this issue Dec 11, 2019

Add NPD e2e tests as presubmit test for NPD repository kubernetes/test-infra#15574

Merged

xueweiz mentioned this issue Jan 6, 2020

Rent Boskos project only once per test run. #405

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 9, 2020

k8s-ci-robot closed this as completed May 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add e2e test for NPD metric-only mode #296

Add e2e test for NPD metric-only mode #296

xueweiz commented Jun 22, 2019 •

edited

Loading

xueweiz commented Jun 22, 2019

xueweiz commented Jun 22, 2019

xueweiz commented Jun 25, 2019

wangzhen127 commented Aug 12, 2019

xueweiz commented Aug 12, 2019

xueweiz commented Sep 18, 2019

xueweiz commented Dec 11, 2019

fejta-bot commented Mar 10, 2020

fejta-bot commented Apr 9, 2020

fejta-bot commented May 9, 2020

k8s-ci-robot commented May 9, 2020

Add e2e test for NPD metric-only mode #296

Add e2e test for NPD metric-only mode #296

Comments

xueweiz commented Jun 22, 2019 • edited Loading

xueweiz commented Jun 22, 2019

xueweiz commented Jun 22, 2019

xueweiz commented Jun 25, 2019

wangzhen127 commented Aug 12, 2019

xueweiz commented Aug 12, 2019

xueweiz commented Sep 18, 2019

xueweiz commented Dec 11, 2019

fejta-bot commented Mar 10, 2020

fejta-bot commented Apr 9, 2020

fejta-bot commented May 9, 2020

k8s-ci-robot commented May 9, 2020

xueweiz commented Jun 22, 2019 •

edited

Loading