-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test for NPD metric-only mode #296
Comments
/assign @xueweiz |
/cc @kewu1992 |
I was trying to find a library to help with SSHing and running commands in the VM. See the experiment code at https://github.com/xueweiz/node-problem-detector/blob/ssh-do-not-work/test/e2e_standalone/simple_test.go#L8 If you remove the
I guess this is because the I think I'll stop trying to use the libraries from |
"Standalone" is a used word already, meaning NPD is running as a native process instead of as a container. So consider use another word to cover the case where NPD runs without k8s API server. Maybe "metrics-only mode" (or something else)? After you decide, you can add that to README, so everyone knows. |
SG. Let's call it "metrics-only" mode. |
I just setup a CI job at kubernetes/test-infra#14369. See test results at here: I see some flakes on the test: There isn't any helpful error message. And I just figured out why: node-problem-detector/test/e2e/lib/gce/instance.go Lines 99 to 102 in 9828ab7
Until the retry finish (either succeed or fail), the test code doesn't report the error. So eventually it results in that the 10 minute test timeout will come in first and kill the test without giving it a chance to report error. I will make a PR to reduce the SSH retry timeout, and that should give us some error message next time the flake happens. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Today, NPD has two test pipelines: CI and presubmits.
In both cases, they are essentially running unit tests (
make test
) and k8s e2e test (kubernetes_e2e.py ... --ginkgo.focus=NodeProblemDetector
).We should add more NPD-focused tests to ensure it does not regress. And many of them are best done from e2e tests, e.g.:
Trigger/simulate failures (
kill -9 [docker]
,echo c > /proc/sysrq-trigger
,stress --vm 2 --vm-bytes 128M --timeout 120s
,echo "OOM" > /dev/kmsg
, etc), and see if NPD can correctly behave under the situation.Performance test. e.g. write a busy program spamming "OOM" to
/dev/kmsg
, and see if NPD's CPU usage is capped. (We could use cgroup to limit NPD's usage to make the test pass)Soak test. e.g. Let NPD run for ~5 days, then see if it has an expanded memory footprint.
Obviously, these tests are best done in e2e tests, rather than unit test. Also, I think we should do the tests on standalone machines, rather than in a k8s cluster. Because the test should focus on node-level behavior, rather than cluster-level.
Action items:
Add basic logistic e2e test to ensure NPD reports some expected Prometheus metrics and Stackdriver metrics: Add a simple e2e test #323, Set SSH timeout to 5 minutes #352
Add performance e2e test to ensure NPD can handles medium/large log volume:
Setup CI jobs in Prow to run NPD e2e tests on different distros (cos-73-lts, cos-77-lts, ubuntu-1804-lts): Allow e2e test to rent project from Boskos #349, Allow e2e test to pick up test VM image using image family #353, Add ci-npd-e2e-test job to run npd e2e tests test-infra#14372, ci-npd-e2e-test: install libs required for compiling NPD test-infra#14375, ci-npd-e2e-test: set environment variables in the right way test-infra#14376, ci-npd-e2e-test: pick up test VM image using image family test-infra#14408
Add presubmit jobs in Prow to run NPD e2e tests on cos-73-lts, cos-77-lts, ubuntu-1804-lts:
Add soak test to ensure NPD's memory footprint stays stable after 1 day:
Add a CI job in Prow to continuously run the soak test:
The text was updated successfully, but these errors were encountered: