-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-problem-detector: report disk queue length in Prometheus format #275
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Hi @xueweiz. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Just signed the CLA agreement. |
/assign Random-Liu@ |
Sorry, wrong assigning format. Hi Lantao, could you help review this PR? Thanks! |
The change can be tested via: I wanted to add some unit tests, but I don't think they are very useful for this PR (sure I could mock the lsblk / OpenCensus / gopsutil dependencies, but those are ~100% interaction testing and ~0% behavior testing). I think integration tests will be far more useful. Please let me know if you feel some part should be unit-tested. I could try to re-organize the code a bit to test them in mocking. Thanks! 😃 |
/ok-to-test |
@xueweiz Could you please give us(maybe @Random-Liu has known about this) some background about adding this feature? Maybe a tiny doc is ok. This makes features trackable just like when we add custom plugin support. |
Seems flakes. /test pull-npd-test |
@andyxning Hi Ning, thanks for taking a look of the PR! Do you mind if I first send out a short version of the design doc, and populate it with more details later on? Thanks! |
d5a7c66
to
023a0a2
Compare
Hum interesting. In both test runs, the failure is at TestGoroutineLeak at log_monitor_test.go. The PR did not touch that code path (maybe the vendor/ changes caused that?) Either way, I'll add some stacktrace printing on this testcase, to help current/future debugging. I just force pushed the PR with changes to print of stacktraces in this testcase. /test pull-npd-test |
Interesting. From the last test result Seems one goroutine finished right in the gap (after Let's see how it goes :) /test pull-npd-test |
Yes. Short version is helpful for us to review the motivation. |
@andyxning Sorry for the delay. It'd be great if you can take a look and provide some feedback :) Thanks a lot! |
Good. Will take a look at it asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@wangzhen127 Thanks for the review Zhen! Just address them. Could you help take a look again? Thanks :) |
/test pull-npd-test |
// application options | ||
|
||
// NodeName is the node name used to communicate with Kubernetes ApiServer. | ||
NodeName string | ||
} | ||
|
||
func NewNodeProblemDetectorOptions() *NodeProblemDetectorOptions { | ||
return &NodeProblemDetectorOptions{} | ||
return &NodeProblemDetectorOptions{MonitorConfigPaths: types.ProblemDaemonConfigPathMap{}} | ||
} | ||
|
||
// AddFlags adds node problem detector command line options to pflag. | ||
func (npdo *NodeProblemDetectorOptions) AddFlags(fs *pflag.FlagSet) { | ||
fs.StringSliceVar(&npdo.SystemLogMonitorConfigPaths, "system-log-monitors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MarkDeprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what do you mean by "MarkDeprecated?" i.e. add a comment? Or is there some good procedure we could follow?
Currently I'm documenting that it's deprecated in the help text: This option is deprecated, replaced by --config.system-log-monitor, and will be removed. NPD will panic if both --system-log-monitors and --config.system-log-monitor are set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use MarkDeprecated to handle this logic.
Now, when user use the deprecated option, a warning will be printed as this:
Flag --system-log-monitors has been deprecated, replaced by --config.system-log-monitor. NPD will panic if both --system-log-monitors and --config.system-log-monitor are set.
} | ||
|
||
// AddFlags adds node problem detector command line options to pflag. | ||
func (npdo *NodeProblemDetectorOptions) AddFlags(fs *pflag.FlagSet) { | ||
fs.StringSliceVar(&npdo.SystemLogMonitorConfigPaths, "system-log-monitors", | ||
[]string{}, "List of paths to system log monitor config files, comma separated.") | ||
[]string{}, "List of paths to system log monitor config files, comma separated. This option is deprecated, replaced by --config.system-log-monitor, and will be removed. NPD will panic if both --system-log-monitors and --config.system-log-monitor are set.") | ||
fs.StringSliceVar(&npdo.CustomPluginMonitorConfigPaths, "custom-plugin-monitors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MarkDeprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cmd/node_problem_detector.go
Outdated
} | ||
}() | ||
} | ||
_ "k8s.io/node-problem-detector/pkg/systemstatsmonitor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create a plugins.go in the main package, which just import plugins?
Like this https://github.com/containerd/containerd/blob/master/cmd/containerd/builtins_linux.go
It makes it much easier to track what plugins are imported, and add/remove plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good :) Done.
@@ -24,20 +24,17 @@ import ( | |||
"net/url" | |||
|
|||
"github.com/spf13/pflag" | |||
|
|||
"k8s.io/node-problem-detector/pkg/custompluginmonitor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it impossible to compile out custompluginmonitor and systemlogmonitor.
I think to handle the deprecated flag, it is fine to hard code the plugin name in this file to get rid of this unnecessary dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR is merged, I plan to change the Prow tests that's still using old NPD flags.
Once the testing changes are done, I plan to remove the deprecated flags and all the logic around it (including here). And then we will have the ability to compile out custompluginmonitor and systemlogmonitor.
By "remove", I mean that when you do ./node_problem_detector --help
, those flags won't show up at all. It's like they never existed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use MarkDeprecated
to achieve that, and we should keep those flags for backward compatibility. See https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-flag-or-cli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really remove this. :p
It is unnecessary to break the compile-in plugin system just because of a constant.
I'll send a PR to address this. :)
Added CLI option "enable-k8s-exporter" (default to true). Users can use this option to enable/disable exporting to Kubernetes control plane. This commit also removes all the apiserver-specific logic from package problemdetector. Future exporters (e.g. to local journald, Prometheus, other control planes) should implement types.Exporter interface.
Added package problemdaemon. All future problem daemons should be registered by calling problemdaemon.register(). CLI interfaces will be automatically generated for all registered problem daemons in the form of "--config.DAEMON_NAME"
Hi Lantao, thanks for the review! Just addressed the comments and rebased on top of #290 . However I didn't quite understand the "MarkDeprecated" part. Could you help elaborate a little bit? Thanks! |
Thanks for the tips on the MarkDeprecated function! Just made the adjustments. |
/lgtm |
/hold cancel |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Random-Liu, wangzhen127, xueweiz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
We are seeing some flakes on these tests because some goroutine fluctuation: kubernetes#275 (comment) Removing the tests, as it's robust to test leakage in a soak/stress test, rather than unit test.
This change is to address below items from #284 :
Since this PR will need OpenCensus and gopsutil library, I submitted PR #289 for it.
Deprecated (but not removed yet) flags in this PR:
New flags introduced in this PR:
Test steps:
Follow up:
I will later update test infra to use these new flags.