Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart device plugin pod after policy apply #173

Merged
merged 1 commit into from
Sep 23, 2021

Conversation

e0ne
Copy link
Collaborator

@e0ne e0ne commented Aug 2, 2021

We need to restart device plugin pod after node policy applied
and all SR-IOV Network Config Daemon plugins finished. E.g.:

Generic plugin applies SR-IOV Network Node Policy and creates
VFs. That's mean we need to restart device plugin pod to found
newly created devices.

Related k8snetworkplumbingwg/sriov-network-device-plugin#276

Until k8snetworkplumbingwg/sriov-network-device-plugin#276
wiill be fixed we need to restart device plugin pod each time after
SR-IOV Network Operator plugin applied. It's needed because plugin
could change a number of VF resources even if config is not changed.
@zshi-redhat
Copy link
Collaborator

Until k8snetworkplumbingwg/sriov-network-device-plugin#276
wiill be fixed we need to restart device plugin pod each time after
SR-IOV Network Operator plugin applied. It's needed because plugin
could change a number of VF resources even if config is not changed.

What are those changes a plugin may do without sriov policy being updated?

@adrianchiris
Copy link
Collaborator

I am not sure this change is needed. we have encountered some issues in our CI. but it may be env issue.
@e0ne is checking and will update.

for now this should be on hold

@e0ne e0ne marked this pull request as draft September 7, 2021 21:39
@e0ne
Copy link
Collaborator Author

e0ne commented Sep 8, 2021

What are those changes a plugin may do without sriov policy being updated?

If my understanding is correct, if policy is not updated this code won't be executed because the generation number will be the same

@zshi-redhat
Copy link
Collaborator

What are those changes a plugin may do without sriov policy being updated?

If my understanding is correct, if policy is not updated this code won't be executed because the generation number will be the same

Correct, only new generation of policy will trigger the device plugin restart.

@zshi-redhat
Copy link
Collaborator

What are those changes a plugin may do without sriov policy being updated?

If my understanding is correct, if policy is not updated this code won't be executed because the generation number will be the same

Correct, only new generation of policy will trigger the device plugin restart.

Do you see a situation that devices are updated but the conditional checks for reqDrain and DpConfigVersion are not met?

@e0ne
Copy link
Collaborator Author

e0ne commented Sep 8, 2021

@zshi-redhat yes, you can see logs of network-operator CI where this issue is reproduced: http://13.74.249.42/nic_operator_helm-ci/405/

@e0ne
Copy link
Collaborator Author

e0ne commented Sep 8, 2021

@zshi-redhat we've to a pretty specific use-case when we've got custom SriovOperatorConfig with configDaemonNodeSelector and labels could be dynamically added or removed

@e0ne e0ne marked this pull request as ready for review September 8, 2021 14:04
@e0ne
Copy link
Collaborator Author

e0ne commented Sep 14, 2021

@zshi-redhat I've got such steps to reproduce the issue:

  • SR-IOV should be disable in a firmware, so reboot is required
  • once node is rebooted and SR-IOV is enabled, we apply generic plugin but reqDrain is false and DpConfigVersion is not changed

so my patch just adds reboot each time we do changes to SR-IOV configuration

@e0ne e0ne changed the title Always restart device plugin pod Restart device plugin pod after policy apply Sep 21, 2021
@@ -519,14 +519,11 @@ func (dn *Daemon) nodeStateSyncHandler(generation int64) error {
}

// restart device plugin pod
if reqDrain || latestState.Spec.DpConfigVersion != dn.nodeState.Spec.DpConfigVersion {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@e0ne With your fix, the device plugin pod will be restarted in every reconcile loop. It will cause many unnecessary pod restarts.

The issue you mentioned only happens when the numVFs is changed from 0 to a non-zero value, as we don't need to drain the node when VFs are fresh created

So a better solution should be to add this condition to trigger the device plugin pod restart, instead of always triggering a restart.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pliurh
Copy link
Collaborator

pliurh commented Sep 23, 2021

@zshi-redhat This PR looks good to me. Do you have any other comments?

@zshi-redhat
Copy link
Collaborator

@zshi-redhat This PR looks good to me. Do you have any other comments?

no, looks good to me.

@pliurh pliurh merged commit 14bd335 into k8snetworkplumbingwg:master Sep 23, 2021
@e0ne e0ne deleted the device-plugin-restart branch September 23, 2021 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants