FrequentDockerRestart rule fine tuning #603

SergeyKanzhelev · 2021-08-04T22:57:28Z

Looking at the rule detecting frequent restarts, it looks like detection parameters were picked initially and never changed since: #223

I observed an issue when docker was restarting every ~5 minutes consistently. And this behavior was not caught by NPD as troublesome. So I wanted to discuss whether these parameters needs to be tuned.

One suggestion would be to change detection to count=3 in 20 minutes. I don't think it's expected to see 3 restarts in 20 minutes. Another suggestion that will affect perf - do longer period. Like 40 minutes and expect no more than 5 restarts.

So I wonder if there are valid scenarios where 3 restarts in 20 minutes is an expected behavior.

The text was updated successfully, but these errors were encountered:

mmiranda96 · 2021-08-09T23:27:02Z

I'm not sure if NPD behaves properly with this, but we could try creating two or more rules, each one with different time period and count. In that way, we could catch problems which might not be detected with a single rule.

If config does not support two or more rules for the same condition, I agree we should decrement the count, maybe setting an initial delay for booting nodes.

SergeyKanzhelev · 2021-08-09T23:30:45Z

I think it would be the best to have multiple rules.

If config does not support two or more rules for the same condition

Can you check on this?

mmiranda96 · 2021-08-09T23:31:31Z

Sure. I'll take a dive and update with my findings.

elfinhe · 2021-08-09T23:50:42Z

I'm not sure if NPD behaves properly with this, but we could try creating two or more rules, each one with different time period and count. In that way, we could catch problems which might not be detected with a single rule.

If config does not support two or more rules for the same condition, I agree we should decrement the count, maybe setting an initial delay for booting nodes.

Good idea! Thanks Mike.

mmiranda96 · 2021-08-17T20:44:18Z

After some research, I've found that there is no restriction for a plugin to have more than one rule for a particular condition. However, only the first rule will be executed and all the other ones with the same condition get ignored.

Knowing this, these are the options we have:

Add support for log counter to handle more than one lookback/delay/count tuple. This could be done without altering the current interface, allowing passing flags as comma-separated values. Eg: --lookback=20m --count=5 would be replaced by --lookback=20m,40m --count=5,8
- Pros:
  - Allows for more flexible rules.
  - Existing rules don't require any changes. It would allow using a single value.
- Cons:
  - Configuration can become hard to maintain.
  - Would require a considerate refactor + new tests.
  - Would require some complex rules to match values and consider empty cases.
Create a different condition and rule.
- Pros:
  - Trivial to set, requires no changes to code.
- Cons:
  - Having different conditions for the same symptom might not be accurate.
  - If we need to add more cases, it becomes noisy.
Adjust existing rule:
- Pros:
  - Cleanest option.
  - Trivial to set, requires no changes to code.
- Cons:
  - Could potentially break existing use cases (reporting either false positives or negatives).

I would like to open the floor for discussion. What would be our best option here?

k8s-triage-robot · 2021-11-15T21:19:16Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-12-15T22:13:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-01-14T22:23:25Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-01-14T22:23:43Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 15, 2021

k8s-ci-robot closed this as completed Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FrequentDockerRestart rule fine tuning #603

FrequentDockerRestart rule fine tuning #603

SergeyKanzhelev commented Aug 4, 2021

mmiranda96 commented Aug 9, 2021

SergeyKanzhelev commented Aug 9, 2021

mmiranda96 commented Aug 9, 2021

elfinhe commented Aug 9, 2021

mmiranda96 commented Aug 17, 2021

k8s-triage-robot commented Nov 15, 2021

k8s-triage-robot commented Dec 15, 2021

k8s-triage-robot commented Jan 14, 2022

k8s-ci-robot commented Jan 14, 2022

FrequentDockerRestart rule fine tuning #603

FrequentDockerRestart rule fine tuning #603

Comments

SergeyKanzhelev commented Aug 4, 2021

mmiranda96 commented Aug 9, 2021

SergeyKanzhelev commented Aug 9, 2021

mmiranda96 commented Aug 9, 2021

elfinhe commented Aug 9, 2021

mmiranda96 commented Aug 17, 2021

k8s-triage-robot commented Nov 15, 2021

k8s-triage-robot commented Dec 15, 2021

k8s-triage-robot commented Jan 14, 2022

k8s-ci-robot commented Jan 14, 2022