-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect kubelet and container runtime frequent restarts #223
Conversation
/assign @Random-Liu |
b1cd976
to
b398d5f
Compare
ea26527
to
aa53d4c
Compare
@Random-Liu, I have updated this PR with delay option. PTAL |
cmd/logcounter/options/options.go
Outdated
} | ||
|
||
// AddFlags adds log counter command line options to pflag. | ||
func (fedo *LogCounterOptions) AddFlags(fs *pflag.FlagSet) { | ||
fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, docker, etc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/docker/dockerd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/logcounter/options/options.go
Outdated
fs.StringVar(&fedo.Lookback, "lookback", "", "The time log watcher looks up") | ||
fs.StringVar(&fedo.Delay, "delay", "", "The time duration log watcher delays after node boot time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a bit more explanation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
glog.Fatalf("Failed to get system info: %v", err) | ||
startTime, err := util.GetDelayedStartTime(cfg.Delay) | ||
if err != nil { | ||
glog.Fatalf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the error log more informative: "failed to get xxx"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
// NewJournaldWatcher is the create function of journald watcher. | ||
func NewJournaldWatcher(cfg types.WatcherConfig) types.LogWatcher { | ||
startTime, err := util.GetDelayedStartTime(cfg.Delay) | ||
if err != nil { | ||
glog.Fatalf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if lookback > uptime { | ||
lookback = uptime | ||
glog.Infof("Lookback changed to system uptime: %v", lookback) | ||
} | ||
// Seek journal client based on the lookback duration. | ||
start := time.Now().Add(-lookback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now considering lookback when getting start time.
pkg/util/helpers.go
Outdated
@@ -31,3 +32,23 @@ func GenerateConditionChangeEvent(t string, status types.ConditionStatus, reason | |||
Message: fmt.Sprintf("Node condition %s is now: %s, reason: %s", t, status, reason), | |||
} | |||
} | |||
|
|||
func GetDelayedStartTime(delay string) (time.Time, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this can be combined with lookback, just pass in both delay and lookback, and return starttime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"--log-path=/var/log/journal", | ||
"--lookback=20m", | ||
"--count=5", | ||
"--pattern=Starting Docker Application Container Engine..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use Starting
for docker and containerd, but Started
for kubelet? Any particular reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is no Starting
message for kubelet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting.
aa53d4c
to
a7b6902
Compare
a7b6902
to
48cfb6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Random-Liu, I have updated this PR. PTAL.
cmd/logcounter/options/options.go
Outdated
} | ||
|
||
// AddFlags adds log counter command line options to pflag. | ||
func (fedo *LogCounterOptions) AddFlags(fs *pflag.FlagSet) { | ||
fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, docker, etc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/logcounter/options/options.go
Outdated
fs.StringVar(&fedo.Lookback, "lookback", "", "The time log watcher looks up") | ||
fs.StringVar(&fedo.Delay, "delay", "", "The time duration log watcher delays after node boot time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"--log-path=/var/log/journal", | ||
"--lookback=20m", | ||
"--count=5", | ||
"--pattern=Starting Docker Application Container Engine..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is no Starting
message for kubelet.
glog.Fatalf("Failed to get system info: %v", err) | ||
startTime, err := util.GetDelayedStartTime(cfg.Delay) | ||
if err != nil { | ||
glog.Fatalf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
// NewJournaldWatcher is the create function of journald watcher. | ||
func NewJournaldWatcher(cfg types.WatcherConfig) types.LogWatcher { | ||
startTime, err := util.GetDelayedStartTime(cfg.Delay) | ||
if err != nil { | ||
glog.Fatalf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if lookback > uptime { | ||
lookback = uptime | ||
glog.Infof("Lookback changed to system uptime: %v", lookback) | ||
} | ||
// Seek journal client based on the lookback duration. | ||
start := time.Now().Add(-lookback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now considering lookback when getting start time.
pkg/util/helpers.go
Outdated
@@ -31,3 +32,23 @@ func GenerateConditionChangeEvent(t string, status types.ConditionStatus, reason | |||
Message: fmt.Sprintf("Node condition %s is now: %s, reason: %s", t, status, reason), | |||
} | |||
} | |||
|
|||
func GetDelayedStartTime(delay string) (time.Time, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/retest |
48cfb6d
to
e5ccf98
Compare
Rebased |
Just tested on a GKE node and it works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nits
if s.clock.Since(log.Timestamp) > lookback || log.Timestamp.Before(s.uptime) { | ||
// Discard messages before start time. | ||
if log.Timestamp.Before(s.startTime) { | ||
glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", log.Message, log.Timestamp, s.startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/%v/%q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -112,6 +124,11 @@ func (j *journaldWatcher) watchLoop() { | |||
continue | |||
} | |||
|
|||
if entry.RealtimeTimestamp < startTimestamp { | |||
glog.V(5).Infof("Throwing away journal entry before start time: %v < %v", entry.RealtimeTimestamp, startTimestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include message content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
glog.V(5).Infof("Throwing away msg %v for being too old: %v > %v", msg.Message, msg.Timestamp.String(), lookback.String()) | ||
// Discard messages before start time. | ||
if msg.Timestamp.Before(k.startTime) { | ||
glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", msg.Message, msg.Timestamp, k.startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/%v/%q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
e5ccf98
to
1f63638
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. PTAL
if s.clock.Since(log.Timestamp) > lookback || log.Timestamp.Before(s.uptime) { | ||
// Discard messages before start time. | ||
if log.Timestamp.Before(s.startTime) { | ||
glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", log.Message, log.Timestamp, s.startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -112,6 +124,11 @@ func (j *journaldWatcher) watchLoop() { | |||
continue | |||
} | |||
|
|||
if entry.RealtimeTimestamp < startTimestamp { | |||
glog.V(5).Infof("Throwing away journal entry before start time: %v < %v", entry.RealtimeTimestamp, startTimestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
glog.V(5).Infof("Throwing away msg %v for being too old: %v > %v", msg.Message, msg.Timestamp.String(), lookback.String()) | ||
// Discard messages before start time. | ||
if msg.Timestamp.Before(k.startTime) { | ||
glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", msg.Message, msg.Timestamp, k.startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Random-Liu, wangzhen127 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR adds customer plugin that watch systemd logs for frequent kubelet and container runtime restarts. It reuses log counter for frequent restart detection. Notice that kubelet and container runtime restart logs are actually in systemd logs, not in kubelet or container runtime logs.
Noticeable changes:
dockerd
as the source name.