Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve waiting behaviour in case of potential container restarts #2933

Open
gdemonet opened this issue Nov 17, 2020 · 0 comments
Open

Improve waiting behaviour in case of potential container restarts #2933

gdemonet opened this issue Nov 17, 2020 · 0 comments
Labels
complexity:medium Something that requires one or few days to fix kind:debt Technical debt topic:flakiness Some test are flaky and cause CI to do transient failing topic:lifecycle Issues related to upgrade or downgrade of MetalK8s

Comments

@gdemonet
Copy link
Contributor

Component: salt, scripts, kubelet

Why this is needed:

In #2928 we introduce a sleep 20 to the upgrade script, after local kubelet is upgraded, to make sure any container restart is complete (especially for Salt master).

This kind of hardcoded waiting time is however problematic:

  • if the selected value is too large for the system, we are wasting time and slow down user experience
  • if the selected value is too small for the system (e.g. in CI), we risk having unwanted failures while we could have waited a little longer
  • in any case, we cannot optimize for both extremes with this approach

What should be done:

The issue at hand is a case of "waiting for something that may happen", because a kubelet restart may or may not happen (e.g. this script is re-run after a flaky), and if kubelet restarts, Pods may or may not change (if kubelet is upgraded, it may add labels/annotations, but maybe even in other situations?).

Implementation proposal:

Here's a wild suggestion:

  1. Determine whether kubelet has restarted or not

    An option could be to parse the output of state.sls metalk8s.kubernetes.kubelet.standalone to check if the service.running state for kubelet has changes (not sure if that's enough, maybe we should check differently)

  2. If kubelet has restarted, determine if it has reconciled the Pod of interest

    We can look at status.startTime on the Pod, which is updated on restart of kubelet - not sure if enough either, but I'd expect a single reconciliation pass for the Pod to include whatever new labels/annotations it needs

  3. Once the Pod is reconciled, check if it changed

    We can compare the Pod's hash with its previous one (visible via crictl if needed) - we would need to remember it from before the attempt to update kubelet

  4. If the Pod changed, wait for the container to be up

Test plan:

@gdemonet gdemonet added topic:flakiness Some test are flaky and cause CI to do transient failing kind:debt Technical debt topic:lifecycle Issues related to upgrade or downgrade of MetalK8s complexity:medium Something that requires one or few days to fix labels Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity:medium Something that requires one or few days to fix kind:debt Technical debt topic:flakiness Some test are flaky and cause CI to do transient failing topic:lifecycle Issues related to upgrade or downgrade of MetalK8s
Projects
None yet
Development

No branches or pull requests

1 participant