Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logwatchers: add new kmsg-based kernel log watcher #41

Merged
merged 3 commits into from
May 30, 2017

Conversation

euank
Copy link
Contributor

@euank euank commented Nov 14, 2016

All the other loggers (afaik), like rsyslog and journald and what have you, simply read /dev/kmsg.

It's probably easier for everyone if we just read that directly too!

I put together a bit of code doing so over here, and this commit leverages that code for the node-problem-detector.

I plan to replace cadvisor's similar hacky log parsing with the same code as well.

So, what are the implications?
Well, one thing this can't do as well is read far into the past. This can't read further back than the ring buffer extends, so on startup it's quite possible it won't get as many old messages. That means really large lookback values will behave differently now because they'll read less.

That could have already happened depending on logrotation policy, but now it's a lot harder for the user to control (kernel dmesg size is much harder to change :).

However, I think the significant simplicity and cross-distro gains by doing this are worth it.

I also preemptively removed the old integrations as I think this makes them fully redundant, as well as the translators thingy since why would you want to translate when you're reading the one and only true source?

Edit: I added this as an additional plugin, but didn't change any defaults or such.

Fixes #14 #39

cc @Random-Liu @adohe @derekwaynecarr


This change is Reviewable

@euank euank mentioned this pull request Nov 14, 2016
@euank
Copy link
Contributor Author

euank commented Nov 14, 2016

Testing done:

I manually built and published an image (available at euank/node-problem-detector:v0.6). I then edited the daemonset to reference that image, deployed it, and ssh'd into a node.

I ran echo "BUG: unable to handle kernel NULL pointer dereference at some-location" > /dev/kmsg as root, verified it showed up in dmesg, and then verified kubectl describe no had an event of 1m 1m 1 {kernel-monitor k8s1} Warning KernelPanic BUG: unable to handle kernel NULL pointer dereference at some-location.

@Random-Liu
Copy link
Member

Random-Liu commented Nov 14, 2016

It's probably easier for everyone if we just read that directly too!

Agreed! Thanks a lot for helping on this! @euank

Well, one thing this can't do as well is read far into the past.

Yeah, this is the reason why I didn't use it before. We want to know whether the node panicked last time and the panic reason.

For example, for kernel panic (assuming the node is configured as reboot on kernel panic), we have to look back to figure out what happened. I think this is a useful feature that we want to keep, given that we already encountered some kernel panics:

Since it supports more os distro, but has less look back support. What about we make /dev/kmsg reader as the fall back log readers in #39, kernel monitor will use it once the other ones not work.

@euank Are you ok with this? :)

@euank
Copy link
Contributor Author

euank commented Nov 14, 2016

@Random-Liu That makes some sense, but an alternate solution would be checkpointing. In general, we don't care about lookbehind on the initial boot/setup, only on reboots where the daemonset starts up again.
In those cases, if we checkpoint then we can be our own super limited logdaemon for the last $lookBehind duration of logs.

That might actually be simpler than integrating journald and syslog parsers.

The only benefit those loggers have over us checkpointing is that journald at some point was gonna integrate with pstore (https://www.kernel.org/doc/Documentation/ABI/testing/pstore) and, because it plans to delete after consuming, it'll be the only source of truth for it..... but that's not being done yet so that's a future theoretical issue/feature, not a realistic current concern


Really though, integrating with journald/syslog's kern.log is almost certainly the right way to go now that I think about the lookback bit more. Should I wait for your journald code to go in and rebase this on top of it as a fallback if that's the path we're taking?

@Random-Liu
Copy link
Member

Random-Liu commented Jan 6, 2017

For example, for kernel panic (assuming the node is configured as reboot on kernel panic), we have to look back to figure out what happened. I think this is a useful feature that we want to keep, given that we already encountered some kernel panics:

I recently found out that there is no easy way to log kernel panics into logs, because mostly for safety kernel won't write anything when it's already panicked. So, my argument is wrong.

However, I still think look back is needed when NPD is firstly started or restarted. :)

@dchen1107
Copy link
Member

@euank, #39 was merged. Can you please rebase this one, so that we can continue reviewing? Thanks!

euank added 2 commits March 9, 2017 20:40
This adds a logwatcher which is able to parse kernel messages directly
from the /dev/kmsg interface. This supports any modern linux distro,
while also avoiding any dependency on libraries (e.g. as journald
needs).
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 10, 2017
@euank
Copy link
Contributor Author

euank commented Mar 10, 2017

Rebased, manually tested that it worked as I expected.

@Random-Liu
Copy link
Member

@euank Thanks a lot! Will try to review it soon.

@euank euank changed the title update kernel log parser to a kmsg based parser logwatchers: add new kmsg-based kernel log watcher Mar 10, 2017
@Random-Liu
Copy link
Member

Random-Liu commented May 22, 2017

@euank Will review this PR this week. Sorry for the delay, and thanks a lot for helping!

arbitrary file based log.
* [journald](https://github.com/kubernetes/node-problem-detector/blob/master/pkg/systemlogmonitor/logwatchers/journald): Log watcher for
journald.
* [journald](.//logwatchers/journald): Log watcher for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for journald

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -66,6 +67,7 @@ Log watcher specific configurations are configured in `pluginConfig`.
* timestampFormat: The format of the timestamp. The format string is the time
`2006-01-02T15:04:05Z07:00` in the expected format. (See
[golang timestamp format](https://golang.org/pkg/time/#pkg-constants))
* **kmsg**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kmsg: no configurations for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

cfg types.WatcherConfig
logCh chan *logtypes.Log
tomb *util.Tomb
reader *bufio.Reader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// NewKmsgWatcher creates a watcher which will read messages from /dev/kmsg
func NewKmsgWatcher(cfg types.WatcherConfig) types.LogWatcher {
kmsgparser.NewParser()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused code, remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// Discard too old messages
if k.clock.Since(msg.Timestamp) > lookback {
glog.V(5).Infof("throwing away msg %v for being too old: %v > %v", msg.Message, msg.Timestamp.String(), lookback.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/throwing/Throwing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

select {
case <-k.tomb.Stopping():
glog.Infof("Stop watching kernel log")
k.kmsgParser.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: log error of Close.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

type fakeKmsgReader struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Random-Liu
Copy link
Member

@euank Do you have time to address the comments in this PR?

@Random-Liu
Copy link
Member

Random-Liu commented May 30, 2017

The PR LGTM overall, only several small cleanups are needed.

Offline discussed with @dchen1107, since we want this PR badly for 1.7, I'll merge this first, and send another PR to address the comments. :)

@euank Thanks for working on this, this is very useful for us!

@Random-Liu Random-Liu closed this May 30, 2017
@Random-Liu Random-Liu reopened this May 30, 2017
@Random-Liu Random-Liu merged commit be6c516 into kubernetes:master May 30, 2017
@euank euank deleted the kmsg-parser branch May 30, 2017 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants