Improve waiting behaviour in case of potential container restarts #2933

gdemonet · 2020-11-17T14:30:38Z

Component: salt, scripts, kubelet

Why this is needed:

In #2928 we introduce a sleep 20 to the upgrade script, after local kubelet is upgraded, to make sure any container restart is complete (especially for Salt master).

This kind of hardcoded waiting time is however problematic:

if the selected value is too large for the system, we are wasting time and slow down user experience
if the selected value is too small for the system (e.g. in CI), we risk having unwanted failures while we could have waited a little longer
in any case, we cannot optimize for both extremes with this approach

What should be done:

The issue at hand is a case of "waiting for something that may happen", because a kubelet restart may or may not happen (e.g. this script is re-run after a flaky), and if kubelet restarts, Pods may or may not change (if kubelet is upgraded, it may add labels/annotations, but maybe even in other situations?).

Implementation proposal:

Here's a wild suggestion:

Determine whether kubelet has restarted or not

An option could be to parse the output of state.sls metalk8s.kubernetes.kubelet.standalone to check if the service.running state for kubelet has changes (not sure if that's enough, maybe we should check differently)
If kubelet has restarted, determine if it has reconciled the Pod of interest

We can look at status.startTime on the Pod, which is updated on restart of kubelet - not sure if enough either, but I'd expect a single reconciliation pass for the Pod to include whatever new labels/annotations it needs
Once the Pod is reconciled, check if it changed

We can compare the Pod's hash with its previous one (visible via crictl if needed) - we would need to remember it from before the attempt to update kubelet
If the Pod changed, wait for the container to be up

Test plan:

The text was updated successfully, but these errors were encountered:

gdemonet added topic:flakiness Some test are flaky and cause CI to do transient failing kind:debt Technical debt topic:lifecycle Issues related to upgrade or downgrade of MetalK8s complexity:medium Something that requires one or few days to fix labels Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve waiting behaviour in case of potential container restarts #2933

Improve waiting behaviour in case of potential container restarts #2933

gdemonet commented Nov 17, 2020

Improve waiting behaviour in case of potential container restarts #2933

Improve waiting behaviour in case of potential container restarts #2933

Comments

gdemonet commented Nov 17, 2020