-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logwatchers: add new kmsg-based kernel log watcher #41
Conversation
Testing done: I manually built and published an image (available at I ran |
Agreed! Thanks a lot for helping on this! @euank
Yeah, this is the reason why I didn't use it before. We want to know whether the node panicked last time and the panic reason. For example, for kernel panic (assuming the node is configured as reboot on kernel panic), we have to look back to figure out what happened. I think this is a useful feature that we want to keep, given that we already encountered some kernel panics:
Since it supports more os distro, but has less look back support. What about we make @euank Are you ok with this? :) |
@Random-Liu That makes some sense, but an alternate solution would be checkpointing. In general, we don't care about lookbehind on the initial boot/setup, only on reboots where the daemonset starts up again. That might actually be simpler than integrating journald and syslog parsers. The only benefit those loggers have over us checkpointing is that journald at some point was gonna integrate with pstore (https://www.kernel.org/doc/Documentation/ABI/testing/pstore) and, because it plans to delete after consuming, it'll be the only source of truth for it..... but that's not being done yet so that's a future theoretical issue/feature, not a realistic current concern Really though, integrating with journald/syslog's kern.log is almost certainly the right way to go now that I think about the lookback bit more. Should I wait for your journald code to go in and rebase this on top of it as a fallback if that's the path we're taking? |
I recently found out that there is no easy way to log kernel panics into logs, because mostly for safety kernel won't write anything when it's already panicked. So, my argument is wrong. However, I still think look back is needed when NPD is firstly started or restarted. :) |
This adds a logwatcher which is able to parse kernel messages directly from the /dev/kmsg interface. This supports any modern linux distro, while also avoiding any dependency on libraries (e.g. as journald needs).
Rebased, manually tested that it worked as I expected. |
@euank Thanks a lot! Will try to review it soon. |
@euank Will review this PR this week. Sorry for the delay, and thanks a lot for helping! |
arbitrary file based log. | ||
* [journald](https://github.com/kubernetes/node-problem-detector/blob/master/pkg/systemlogmonitor/logwatchers/journald): Log watcher for | ||
journald. | ||
* [journald](.//logwatchers/journald): Log watcher for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for journald
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -66,6 +67,7 @@ Log watcher specific configurations are configured in `pluginConfig`. | |||
* timestampFormat: The format of the timestamp. The format string is the time | |||
`2006-01-02T15:04:05Z07:00` in the expected format. (See | |||
[golang timestamp format](https://golang.org/pkg/time/#pkg-constants)) | |||
* **kmsg** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kmsg: no configurations for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cfg types.WatcherConfig | ||
logCh chan *logtypes.Log | ||
tomb *util.Tomb | ||
reader *bufio.Reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// NewKmsgWatcher creates a watcher which will read messages from /dev/kmsg | ||
func NewKmsgWatcher(cfg types.WatcherConfig) types.LogWatcher { | ||
kmsgparser.NewParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused code, remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
// Discard too old messages | ||
if k.clock.Since(msg.Timestamp) > lookback { | ||
glog.V(5).Infof("throwing away msg %v for being too old: %v > %v", msg.Message, msg.Timestamp.String(), lookback.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/throwing/Throwing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
select { | ||
case <-k.tomb.Stopping(): | ||
glog.Infof("Stop watching kernel log") | ||
k.kmsgParser.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: log error of Close
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
} | ||
|
||
type fakeKmsgReader struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@euank Do you have time to address the comments in this PR? |
The PR LGTM overall, only several small cleanups are needed. Offline discussed with @dchen1107, since we want this PR badly for 1.7, I'll merge this first, and send another PR to address the comments. :) @euank Thanks for working on this, this is very useful for us! |
All the other loggers (afaik), like rsyslog and journald and what have you, simply read
/dev/kmsg
.It's probably easier for everyone if we just read that directly too!
I put together a bit of code doing so over here, and this commit leverages that code for the node-problem-detector.
I plan to replace cadvisor's similar hacky log parsing with the same code as well.
So, what are the implications?
Well, one thing this can't do as well is read far into the past. This can't read further back than the ring buffer extends, so on startup it's quite possible it won't get as many old messages. That means really large lookback values will behave differently now because they'll read less.
That could have already happened depending on logrotation policy, but now it's a lot harder for the user to control (kernel dmesg size is much harder to change :).
However, I think the significant simplicity and cross-distro gains by doing this are worth it.
I also preemptively removed the old integrations as I think this makes them fully redundant, as well as the translators thingy since why would you want to translate when you're reading the one and only true source?Edit: I added this as an additional plugin, but didn't change any defaults or such.
Fixes #14 #39
cc @Random-Liu @adohe @derekwaynecarr
This change is