Improve waiting behaviour in case of potential container restarts #2933
Labels
complexity:medium
Something that requires one or few days to fix
kind:debt
Technical debt
topic:flakiness
Some test are flaky and cause CI to do transient failing
topic:lifecycle
Issues related to upgrade or downgrade of MetalK8s
Component: salt, scripts, kubelet
Why this is needed:
In #2928 we introduce a
sleep 20
to the upgrade script, after local kubelet is upgraded, to make sure any container restart is complete (especially for Salt master).This kind of hardcoded waiting time is however problematic:
What should be done:
The issue at hand is a case of "waiting for something that may happen", because a kubelet restart may or may not happen (e.g. this script is re-run after a flaky), and if kubelet restarts, Pods may or may not change (if kubelet is upgraded, it may add labels/annotations, but maybe even in other situations?).
Implementation proposal:
Here's a wild suggestion:
Determine whether kubelet has restarted or not
An option could be to parse the output of
state.sls metalk8s.kubernetes.kubelet.standalone
to check if theservice.running
state for kubelet has changes (not sure if that's enough, maybe we should check differently)If kubelet has restarted, determine if it has reconciled the Pod of interest
We can look at
status.startTime
on the Pod, which is updated on restart of kubelet - not sure if enough either, but I'd expect a single reconciliation pass for the Pod to include whatever new labels/annotations it needsOnce the Pod is reconciled, check if it changed
We can compare the Pod's hash with its previous one (visible via
crictl
if needed) - we would need to remember it from before the attempt to update kubeletIf the Pod changed, wait for the container to be up
Test plan:
The text was updated successfully, but these errors were encountered: