Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple SriovNetworkNodePolicy creation fails #230

Closed
e0ne opened this issue Jan 19, 2022 · 9 comments · Fixed by #232
Closed

Multiple SriovNetworkNodePolicy creation fails #230

e0ne opened this issue Jan 19, 2022 · 9 comments · Fixed by #232

Comments

@e0ne
Copy link
Collaborator

e0ne commented Jan 19, 2022

Environment details:

  • the latest SR-IOV Network Operator
  • at least two worker nodes with 2 SR-IOV NICs on each of them: worker1 nic1, worker1 nic2, worker2 nic1, worker2 nic2
  • SR-IOV is disabled on NICs

Steps to reproduce:

  • create two yaml files with policies definition:
    • file1.yaml: it should contain policies for worker1 nic1 and worker2 nic1
    • file2.yaml: it should contain policies for worker1 nic2 and worker2 nic2
  • apply file1.yaml
  • wait until worker1 starts reboot
  • apply file2.yaml
  • wait until worker1 started

Expected results:
All policies are applied

Actual results:
Part of policies applied. Leader elected for worker 2. Worker1 still has 'Draining' annotation, so config daemons do not proceed any configurations

@SchSeba
Copy link
Collaborator

SchSeba commented Jan 19, 2022

Hi @e0ne can you please check the sriov-network-config-daemon logs?

@adrianchiris
Copy link
Collaborator

adrianchiris commented Jan 19, 2022

I believe we essentially hitting a deadlock in case worker1, after reboot, needs to drain the node and worker 2 is holding the drain lock (is leader and waiting for draining annotation to be removed from node1).

Or generally speaking, any case where nodeStateSyncHandler is called in sriov-network-config-daemon while node has draining annot and during execution reqDrain is true and disableDrain is false.

e0ne added a commit to e0ne/sriov-network-operator that referenced this issue Jan 20, 2022
Since drain operation started we don't need to requires
drain lock for this node because node already has required
annotation.

It's safe to continue node drain procedure without lock.

Closes: k8snetworkplumbingwg#230

Signed-off-by: Ivan Kolodyazhny <[email protected]>
@zshi-redhat
Copy link
Collaborator

I believe we essentially hitting a deadlock in case worker1, after reboot, needs to drain the node and worker 2 is holding the drain lock (is leader and waiting for draining annotation to be removed from node1).

Wondering why worker-2 is able to get the drain lock while worker-1 still holds Draining annotation?

@adrianchiris
Copy link
Collaborator

Wondering why worker-2 is able to get the drain lock while worker-1 still holds Draining annotation?

as far as i saw there is no place in the daemon code that prevents it. as soon as nodeStateSyncHandler() config daemon releases leadership

@zshi-redhat
Copy link
Collaborator

Wondering why worker-2 is able to get the drain lock while worker-1 still holds Draining annotation?

as far as i saw there is no place in the daemon code that prevents it. as soon as nodeStateSyncHandler() config daemon releases leadership

It gets the drainLock when dn.drainable is equal to true, but dn.drainable requires other nodes to not have Draining annotation.

@adrianchiris
Copy link
Collaborator

It gets the drainLock when dn.drainable is equal to true, but dn.drainable requires other nodes to not have Draining annotation.

i see in daemon.go-L#803 that its called within OnStartedLeading so at this point it already got the lock i.e started leading.

it will loop until it becomes drainable or context is canceled which never happens in this case.

am i missing something?

@zshi-redhat
Copy link
Collaborator

It gets the drainLock when dn.drainable is equal to true, but dn.drainable requires other nodes to not have Draining annotation.

i see in daemon.go-L#803 that its called within OnStartedLeading so at this point it already got the lock i.e started leading.

it will loop until it becomes drainable or context is canceled which never happens in this case.

dn.drainable will be set once other nodes complete drain, right? so it will eventually be able to proceed.

am i missing something?

@adrianchiris
Copy link
Collaborator

dn.drainable will be set once other nodes complete drain, right? so it will eventually be able to proceed.

yes but other nodes will not be able to complete drain if it also attempt to get drain lock (get leadership)

PR #232 addresses that by skipping the lock in daemon nodeStateSyncHandler() in case the node is already draining which will allow daemon to complete the drain

@pliurh
Copy link
Collaborator

pliurh commented Jan 21, 2022

I've reproduce this issue on OCP.

zeeke pushed a commit to zeeke/sriov-network-operator-1 that referenced this issue Mar 7, 2022
Since drain operation started we don't need to requires
drain lock for this node because node already has required
annotation.

It's safe to continue node drain procedure without lock.

Closes: k8snetworkplumbingwg#230

Signed-off-by: Ivan Kolodyazhny <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants