Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect kubelet and container runtime frequent restarts #223

Merged
merged 1 commit into from
Nov 27, 2018

Conversation

wangzhen127
Copy link
Member

@wangzhen127 wangzhen127 commented Nov 20, 2018

This PR adds customer plugin that watch systemd logs for frequent kubelet and container runtime restarts. It reuses log counter for frequent restart detection. Notice that kubelet and container runtime restart logs are actually in systemd logs, not in kubelet or container runtime logs.

Noticeable changes:

  • Add delay option to log watchers, so that they can delay watching until the node is stable.
  • Change log counter to watch journald instead of kmsg.
  • Update docker monitor to use dockerd as the source name.
  • Fix vet errors
  • Update README on injecting systemd logs

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 20, 2018
@wangzhen127
Copy link
Member Author

/assign @Random-Liu
/cc @dashpole
/cc @jiayingz

@wangzhen127 wangzhen127 changed the title Detect kubelet and container runtime frequent crashes Detect kubelet and container runtime frequent restarts Nov 20, 2018
@wangzhen127 wangzhen127 force-pushed the kubelet-crash branch 2 times, most recently from ea26527 to aa53d4c Compare November 21, 2018 21:15
@wangzhen127
Copy link
Member Author

@Random-Liu, I have updated this PR with delay option. PTAL

}

// AddFlags adds log counter command line options to pflag.
func (fedo *LogCounterOptions) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, docker, etc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/docker/dockerd?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

fs.StringVar(&fedo.Lookback, "lookback", "", "The time log watcher looks up")
fs.StringVar(&fedo.Delay, "delay", "", "The time duration log watcher delays after node boot time")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a bit more explanation here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

glog.Fatalf("Failed to get system info: %v", err)
startTime, err := util.GetDelayedStartTime(cfg.Delay)
if err != nil {
glog.Fatalf("%v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the error log more informative: "failed to get xxx"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// NewJournaldWatcher is the create function of journald watcher.
func NewJournaldWatcher(cfg types.WatcherConfig) types.LogWatcher {
startTime, err := util.GetDelayedStartTime(cfg.Delay)
if err != nil {
glog.Fatalf("%v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if lookback > uptime {
lookback = uptime
glog.Infof("Lookback changed to system uptime: %v", lookback)
}
// Seek journal client based on the lookback duration.
start := time.Now().Add(-lookback)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now considering lookback when getting start time.

@@ -31,3 +32,23 @@ func GenerateConditionChangeEvent(t string, status types.ConditionStatus, reason
Message: fmt.Sprintf("Node condition %s is now: %s, reason: %s", t, status, reason),
}
}

func GetDelayedStartTime(delay string) (time.Time, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this can be combined with lookback, just pass in both delay and lookback, and return starttime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"--log-path=/var/log/journal",
"--lookback=20m",
"--count=5",
"--pattern=Starting Docker Application Container Engine..."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use Starting for docker and containerd, but Started for kubelet? Any particular reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is no Starting message for kubelet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 22, 2018
Copy link
Member Author

@wangzhen127 wangzhen127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Random-Liu, I have updated this PR. PTAL.

}

// AddFlags adds log counter command line options to pflag.
func (fedo *LogCounterOptions) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, docker, etc")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

fs.StringVar(&fedo.Lookback, "lookback", "", "The time log watcher looks up")
fs.StringVar(&fedo.Delay, "delay", "", "The time duration log watcher delays after node boot time")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"--log-path=/var/log/journal",
"--lookback=20m",
"--count=5",
"--pattern=Starting Docker Application Container Engine..."
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is no Starting message for kubelet.

glog.Fatalf("Failed to get system info: %v", err)
startTime, err := util.GetDelayedStartTime(cfg.Delay)
if err != nil {
glog.Fatalf("%v", err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// NewJournaldWatcher is the create function of journald watcher.
func NewJournaldWatcher(cfg types.WatcherConfig) types.LogWatcher {
startTime, err := util.GetDelayedStartTime(cfg.Delay)
if err != nil {
glog.Fatalf("%v", err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if lookback > uptime {
lookback = uptime
glog.Infof("Lookback changed to system uptime: %v", lookback)
}
// Seek journal client based on the lookback duration.
start := time.Now().Add(-lookback)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now considering lookback when getting start time.

@@ -31,3 +32,23 @@ func GenerateConditionChangeEvent(t string, status types.ConditionStatus, reason
Message: fmt.Sprintf("Node condition %s is now: %s, reason: %s", t, status, reason),
}
}

func GetDelayedStartTime(delay string) (time.Time, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@wangzhen127
Copy link
Member Author

/retest

@wangzhen127
Copy link
Member Author

Rebased

@wangzhen127
Copy link
Member Author

Just tested on a GKE node and it works.

Copy link
Member

@Random-Liu Random-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits

if s.clock.Since(log.Timestamp) > lookback || log.Timestamp.Before(s.uptime) {
// Discard messages before start time.
if log.Timestamp.Before(s.startTime) {
glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", log.Message, log.Timestamp, s.startTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/%v/%q

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -112,6 +124,11 @@ func (j *journaldWatcher) watchLoop() {
continue
}

if entry.RealtimeTimestamp < startTimestamp {
glog.V(5).Infof("Throwing away journal entry before start time: %v < %v", entry.RealtimeTimestamp, startTimestamp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include message content?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

glog.V(5).Infof("Throwing away msg %v for being too old: %v > %v", msg.Message, msg.Timestamp.String(), lookback.String())
// Discard messages before start time.
if msg.Timestamp.Before(k.startTime) {
glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", msg.Message, msg.Timestamp, k.startTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/%v/%q

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

@wangzhen127 wangzhen127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. PTAL

if s.clock.Since(log.Timestamp) > lookback || log.Timestamp.Before(s.uptime) {
// Discard messages before start time.
if log.Timestamp.Before(s.startTime) {
glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", log.Message, log.Timestamp, s.startTime)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -112,6 +124,11 @@ func (j *journaldWatcher) watchLoop() {
continue
}

if entry.RealtimeTimestamp < startTimestamp {
glog.V(5).Infof("Throwing away journal entry before start time: %v < %v", entry.RealtimeTimestamp, startTimestamp)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

glog.V(5).Infof("Throwing away msg %v for being too old: %v > %v", msg.Message, msg.Timestamp.String(), lookback.String())
// Discard messages before start time.
if msg.Timestamp.Before(k.startTime) {
glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", msg.Message, msg.Timestamp, k.startTime)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Random-Liu
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 27, 2018
@Random-Liu
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Random-Liu, wangzhen127

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit e807235 into kubernetes:master Nov 27, 2018
@wangzhen127 wangzhen127 deleted the kubelet-crash branch November 27, 2018 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants