Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FrequentDockerRestart rule fine tuning #603

Closed
SergeyKanzhelev opened this issue Aug 4, 2021 · 9 comments
Closed

FrequentDockerRestart rule fine tuning #603

SergeyKanzhelev opened this issue Aug 4, 2021 · 9 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@SergeyKanzhelev
Copy link
Member

Looking at the rule detecting frequent restarts, it looks like detection parameters were picked initially and never changed since: #223

I observed an issue when docker was restarting every ~5 minutes consistently. And this behavior was not caught by NPD as troublesome. So I wanted to discuss whether these parameters needs to be tuned.

One suggestion would be to change detection to count=3 in 20 minutes. I don't think it's expected to see 3 restarts in 20 minutes. Another suggestion that will affect perf - do longer period. Like 40 minutes and expect no more than 5 restarts.

So I wonder if there are valid scenarios where 3 restarts in 20 minutes is an expected behavior.

@mmiranda96
Copy link
Contributor

I'm not sure if NPD behaves properly with this, but we could try creating two or more rules, each one with different time period and count. In that way, we could catch problems which might not be detected with a single rule.

If config does not support two or more rules for the same condition, I agree we should decrement the count, maybe setting an initial delay for booting nodes.

@SergeyKanzhelev
Copy link
Member Author

I think it would be the best to have multiple rules.

If config does not support two or more rules for the same condition

Can you check on this?

@mmiranda96
Copy link
Contributor

Sure. I'll take a dive and update with my findings.

@elfinhe
Copy link

elfinhe commented Aug 9, 2021

I'm not sure if NPD behaves properly with this, but we could try creating two or more rules, each one with different time period and count. In that way, we could catch problems which might not be detected with a single rule.

If config does not support two or more rules for the same condition, I agree we should decrement the count, maybe setting an initial delay for booting nodes.

Good idea! Thanks Mike.

@mmiranda96
Copy link
Contributor

After some research, I've found that there is no restriction for a plugin to have more than one rule for a particular condition. However, only the first rule will be executed and all the other ones with the same condition get ignored.

Knowing this, these are the options we have:

  1. Add support for log counter to handle more than one lookback/delay/count tuple. This could be done without altering the current interface, allowing passing flags as comma-separated values. Eg: --lookback=20m --count=5 would be replaced by --lookback=20m,40m --count=5,8
    • Pros:
      • Allows for more flexible rules.
      • Existing rules don't require any changes. It would allow using a single value.
    • Cons:
      • Configuration can become hard to maintain.
      • Would require a considerate refactor + new tests.
      • Would require some complex rules to match values and consider empty cases.
  2. Create a different condition and rule.
    • Pros:
      • Trivial to set, requires no changes to code.
    • Cons:
      • Having different conditions for the same symptom might not be accurate.
      • If we need to add more cases, it becomes noisy.
  3. Adjust existing rule:
    • Pros:
      • Cleanest option.
      • Trivial to set, requires no changes to code.
    • Cons:
      • Could potentially break existing use cases (reporting either false positives or negatives).

I would like to open the floor for discussion. What would be our best option here?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 15, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants