Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-problem-detector: report disk queue length in Prometheus format #275

Merged
merged 6 commits into from
Jun 13, 2019

Conversation

xueweiz
Copy link
Contributor

@xueweiz xueweiz commented May 14, 2019

This change is to address below items from #284 :

  • Refactor problem daemon registration and initialization, so that they are more modular, and rely on less per-daemon initialization code duplication.
  • Refactor events/condition reporting into k8s_exporter.
  • Add a SystemStatsMonitor problem daemon, to monitor some system metrics; and add an optional Prometheus metrics exporter.

Since this PR will need OpenCensus and gopsutil library, I submitted PR #289 for it.

Deprecated (but not removed yet) flags in this PR:

  • --custom-plugin-monitors
  • --system-log-monitors

New flags introduced in this PR:

  • --config.custom-plugin-monitor
  • --config.system-log-monitor
  • --config.system-stats-monitor
  • --enable-k8s-exporter
  • --prometheus-address
  • --prometheus-port

Test steps:

  1. Verifying existing command works:

make ./bin/node-problem-detector && ./bin/node-problem-detector --system-log-monitors=config/kernel-monitor.json --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --logtostderr

  1. Testing new flags for existing behavior:

make ./bin/node-problem-detector && ./bin/node-problem-detector --config.system-log-monitor=config/kernel-monitor.json --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --logtostderr

  1. Test NPD for disk monitor with no k8s_exporter

make ./bin/node-problem-detector && ./bin/node-problem-detector --config.system-log-monitor=config/kernel-monitor.json --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --logtostderr --config.system-stats-monitor=config/system-stats-monitor.json --enable-k8s-exporter=false

curl http://localhost:20257/metrics

Follow up:
I will later update test infra to use these new flags.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 14, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @xueweiz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 14, 2019
@xueweiz
Copy link
Contributor Author

xueweiz commented May 14, 2019

Just signed the CLA agreement.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 14, 2019
@xueweiz
Copy link
Contributor Author

xueweiz commented May 14, 2019

/assign Random-Liu@

@xueweiz
Copy link
Contributor Author

xueweiz commented May 14, 2019

Sorry, wrong assigning format.
/assign @Random-Liu

Hi Lantao, could you help review this PR? Thanks!
The first 5 changes are all Godep management, only the 6th one is code change.

@xueweiz
Copy link
Contributor Author

xueweiz commented May 14, 2019

The change can be tested via:
kubectl proxy --port=8080 > /dev/null &
make ./bin/node-problem-detector && ./bin/node-problem-detector --system-log-monitors=config/kernel-monitor.json --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --logtostderr
curl http://localhost:20257/metrics

I wanted to add some unit tests, but I don't think they are very useful for this PR (sure I could mock the lsblk / OpenCensus / gopsutil dependencies, but those are ~100% interaction testing and ~0% behavior testing). I think integration tests will be far more useful.

Please let me know if you feel some part should be unit-tested. I could try to re-organize the code a bit to test them in mocking. Thanks! 😃

@Random-Liu
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 14, 2019
@andyxning
Copy link
Member

@xueweiz Could you please give us(maybe @Random-Liu has known about this) some background about adding this feature? Maybe a tiny doc is ok. This makes features trackable just like when we add custom plugin support.

@andyxning
Copy link
Member

Seems flakes.

/test pull-npd-test

@xueweiz
Copy link
Contributor Author

xueweiz commented May 14, 2019

@andyxning Hi Ning, thanks for taking a look of the PR!
Yes I plan to send out a formal design doc similar to the custom plugin one. I'd still need a bit time to polish the doc, sorry for the delay.

Do you mind if I first send out a short version of the design doc, and populate it with more details later on? Thanks!

@xueweiz xueweiz force-pushed the exp branch 2 times, most recently from d5a7c66 to 023a0a2 Compare May 14, 2019 19:15
@xueweiz
Copy link
Contributor Author

xueweiz commented May 14, 2019

Seems flakes.

/test pull-npd-test

Hum interesting. In both test runs, the failure is at TestGoroutineLeak at log_monitor_test.go. The PR did not touch that code path (maybe the vendor/ changes caused that?)
What's even weirder, is that the goroutine is not leaking, it is disappearing...At the beginning of the test we have 5 goroutines, at the end of test we only have 4.

Either way, I'll add some stacktrace printing on this testcase, to help current/future debugging.

I just force pushed the PR with changes to print of stacktraces in this testcase.

/test pull-npd-test

@xueweiz
Copy link
Contributor Author

xueweiz commented May 14, 2019

Interesting. From the last test result
It claims that originally there were 5 goroutines, although pprof.Lookup("goroutine") only showed 4 goroutines.

Seems one goroutine finished right in the gap (after runtime.NumGoroutine() and before pprof.Lookup("goroutine")). So let's stop using two calls to retrieve goroutines, but use only one.

Let's see how it goes :)

/test pull-npd-test

@andyxning
Copy link
Member

@xueweiz

Do you mind if I first send out a short version of the design doc, and populate it with more details later on? Thanks!

Yes. Short version is helpful for us to review the motivation.

@xueweiz
Copy link
Contributor Author

xueweiz commented May 17, 2019

@andyxning
Hi Ning, I just finished writing up the design doc: https://docs.google.com/document/d/1SeaUz6kBavI283Dq8GBpoEUDrHA2a795xtw0OvjM568/edit?usp=sharing

Sorry for the delay. It'd be great if you can take a look and provide some feedback :) Thanks a lot!

@andyxning
Copy link
Member

Good. Will take a look at it asap.

Copy link
Contributor Author

@xueweiz xueweiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@xueweiz
Copy link
Contributor Author

xueweiz commented Jun 12, 2019

@wangzhen127 Thanks for the review Zhen! Just address them. Could you help take a look again? Thanks :)

@Random-Liu
Copy link
Member

/test pull-npd-test

// application options

// NodeName is the node name used to communicate with Kubernetes ApiServer.
NodeName string
}

func NewNodeProblemDetectorOptions() *NodeProblemDetectorOptions {
return &NodeProblemDetectorOptions{}
return &NodeProblemDetectorOptions{MonitorConfigPaths: types.ProblemDaemonConfigPathMap{}}
}

// AddFlags adds node problem detector command line options to pflag.
func (npdo *NodeProblemDetectorOptions) AddFlags(fs *pflag.FlagSet) {
fs.StringSliceVar(&npdo.SystemLogMonitorConfigPaths, "system-log-monitors",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MarkDeprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what do you mean by "MarkDeprecated?" i.e. add a comment? Or is there some good procedure we could follow?
Currently I'm documenting that it's deprecated in the help text: This option is deprecated, replaced by --config.system-log-monitor, and will be removed. NPD will panic if both --system-log-monitors and --config.system-log-monitor are set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use MarkDeprecated to handle this logic.
Now, when user use the deprecated option, a warning will be printed as this:

Flag --system-log-monitors has been deprecated, replaced by --config.system-log-monitor. NPD will panic if both --system-log-monitors and --config.system-log-monitor are set.

}

// AddFlags adds node problem detector command line options to pflag.
func (npdo *NodeProblemDetectorOptions) AddFlags(fs *pflag.FlagSet) {
fs.StringSliceVar(&npdo.SystemLogMonitorConfigPaths, "system-log-monitors",
[]string{}, "List of paths to system log monitor config files, comma separated.")
[]string{}, "List of paths to system log monitor config files, comma separated. This option is deprecated, replaced by --config.system-log-monitor, and will be removed. NPD will panic if both --system-log-monitors and --config.system-log-monitor are set.")
fs.StringSliceVar(&npdo.CustomPluginMonitorConfigPaths, "custom-plugin-monitors",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MarkDeprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}()
}
_ "k8s.io/node-problem-detector/pkg/systemstatsmonitor"
Copy link
Member

@Random-Liu Random-Liu Jun 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a plugins.go in the main package, which just import plugins?

Like this https://github.com/containerd/containerd/blob/master/cmd/containerd/builtins_linux.go

It makes it much easier to track what plugins are imported, and add/remove plugins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good :) Done.

@@ -24,20 +24,17 @@ import (
"net/url"

"github.com/spf13/pflag"

"k8s.io/node-problem-detector/pkg/custompluginmonitor"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it impossible to compile out custompluginmonitor and systemlogmonitor.

I think to handle the deprecated flag, it is fine to hard code the plugin name in this file to get rid of this unnecessary dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR is merged, I plan to change the Prow tests that's still using old NPD flags.

Once the testing changes are done, I plan to remove the deprecated flags and all the logic around it (including here). And then we will have the ability to compile out custompluginmonitor and systemlogmonitor.

By "remove", I mean that when you do ./node_problem_detector --help, those flags won't show up at all. It's like they never existed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use MarkDeprecated to achieve that, and we should keep those flags for backward compatibility. See https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-flag-or-cli

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really remove this. :p

It is unnecessary to break the compile-in plugin system just because of a constant.

I'll send a PR to address this. :)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2019
Xuewei Zhang added 2 commits June 12, 2019 18:29
Added CLI option "enable-k8s-exporter" (default to true). Users can use
this option to enable/disable exporting to Kubernetes control plane.

This commit also removes all the apiserver-specific logic from package
problemdetector.

Future exporters (e.g. to local journald, Prometheus, other control
planes) should implement types.Exporter interface.
Added package problemdaemon. All future problem daemons should be
registered by calling problemdaemon.register().

CLI interfaces will be automatically generated for all registered
problem daemons in the form of "--config.DAEMON_NAME"
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2019
@xueweiz
Copy link
Contributor Author

xueweiz commented Jun 13, 2019

Hi Lantao, thanks for the review! Just addressed the comments and rebased on top of #290 . However I didn't quite understand the "MarkDeprecated" part. Could you help elaborate a little bit? Thanks!

@wangzhen127
Copy link
Member

@xueweiz
Copy link
Contributor Author

xueweiz commented Jun 13, 2019

Thanks for the tips on the MarkDeprecated function! Just made the adjustments.
@wangzhen127 @Random-Liu , could you help take a look again? Thanks!

@wangzhen127
Copy link
Member

/lgtm
/hold
Wait for @Random-Liu's approval.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 13, 2019
@Random-Liu
Copy link
Member

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Random-Liu, wangzhen127, xueweiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Random-Liu,wangzhen127]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 975dc71 into kubernetes:master Jun 13, 2019
@xueweiz xueweiz mentioned this pull request Jun 27, 2019
rphillips pushed a commit to rphillips/node-problem-detector that referenced this pull request Mar 11, 2021
We are seeing some flakes on these tests because some goroutine
fluctuation:
kubernetes#275 (comment)

Removing the tests, as it's robust to test leakage in a soak/stress
test, rather than unit test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants