Multiple SriovNetworkNodePolicy creation fails #230

e0ne · 2022-01-19T14:52:31Z

Environment details:

the latest SR-IOV Network Operator
at least two worker nodes with 2 SR-IOV NICs on each of them: worker1 nic1, worker1 nic2, worker2 nic1, worker2 nic2
SR-IOV is disabled on NICs

Steps to reproduce:

create two yaml files with policies definition:
- file1.yaml: it should contain policies for worker1 nic1 and worker2 nic1
- file2.yaml: it should contain policies for worker1 nic2 and worker2 nic2
apply file1.yaml
wait until worker1 starts reboot
apply file2.yaml
wait until worker1 started

Expected results:
All policies are applied

Actual results:
Part of policies applied. Leader elected for worker 2. Worker1 still has 'Draining' annotation, so config daemons do not proceed any configurations

SchSeba · 2022-01-19T15:45:48Z

Hi @e0ne can you please check the sriov-network-config-daemon logs?

adrianchiris · 2022-01-19T17:07:35Z

I believe we essentially hitting a deadlock in case worker1, after reboot, needs to drain the node and worker 2 is holding the drain lock (is leader and waiting for draining annotation to be removed from node1).

Or generally speaking, any case where nodeStateSyncHandler is called in sriov-network-config-daemon while node has draining annot and during execution reqDrain is true and disableDrain is false.

Since drain operation started we don't need to requires drain lock for this node because node already has required annotation. It's safe to continue node drain procedure without lock. Closes: k8snetworkplumbingwg#230 Signed-off-by: Ivan Kolodyazhny <[email protected]>

zshi-redhat · 2022-01-20T11:43:45Z

I believe we essentially hitting a deadlock in case worker1, after reboot, needs to drain the node and worker 2 is holding the drain lock (is leader and waiting for draining annotation to be removed from node1).

Wondering why worker-2 is able to get the drain lock while worker-1 still holds Draining annotation?

adrianchiris · 2022-01-20T11:57:08Z

Wondering why worker-2 is able to get the drain lock while worker-1 still holds Draining annotation?

as far as i saw there is no place in the daemon code that prevents it. as soon as nodeStateSyncHandler() config daemon releases leadership

zshi-redhat · 2022-01-20T12:03:36Z

Wondering why worker-2 is able to get the drain lock while worker-1 still holds Draining annotation?

as far as i saw there is no place in the daemon code that prevents it. as soon as nodeStateSyncHandler() config daemon releases leadership

It gets the drainLock when dn.drainable is equal to true, but dn.drainable requires other nodes to not have Draining annotation.

adrianchiris · 2022-01-20T12:11:07Z

It gets the drainLock when dn.drainable is equal to true, but dn.drainable requires other nodes to not have Draining annotation.

i see in daemon.go-L#803 that its called within OnStartedLeading so at this point it already got the lock i.e started leading.

it will loop until it becomes drainable or context is canceled which never happens in this case.

am i missing something?

zshi-redhat · 2022-01-20T12:15:38Z

It gets the drainLock when dn.drainable is equal to true, but dn.drainable requires other nodes to not have Draining annotation.

i see in daemon.go-L#803 that its called within OnStartedLeading so at this point it already got the lock i.e started leading.

it will loop until it becomes drainable or context is canceled which never happens in this case.

dn.drainable will be set once other nodes complete drain, right? so it will eventually be able to proceed.

am i missing something?

adrianchiris · 2022-01-20T13:36:13Z

dn.drainable will be set once other nodes complete drain, right? so it will eventually be able to proceed.

yes but other nodes will not be able to complete drain if it also attempt to get drain lock (get leadership)

PR #232 addresses that by skipping the lock in daemon nodeStateSyncHandler() in case the node is already draining which will allow daemon to complete the drain

pliurh · 2022-01-21T03:06:46Z

I've reproduce this issue on OCP.

Since drain operation started we don't need to requires drain lock for this node because node already has required annotation. It's safe to continue node drain procedure without lock. Closes: k8snetworkplumbingwg#230 Signed-off-by: Ivan Kolodyazhny <[email protected]>

e0ne mentioned this issue Jan 20, 2022

Continue node drain after reboot #232

Merged

adrianchiris closed this as completed in #232 Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple SriovNetworkNodePolicy creation fails #230

Multiple SriovNetworkNodePolicy creation fails #230

e0ne commented Jan 19, 2022

SchSeba commented Jan 19, 2022

adrianchiris commented Jan 19, 2022 •

edited

Loading

zshi-redhat commented Jan 20, 2022

adrianchiris commented Jan 20, 2022

zshi-redhat commented Jan 20, 2022

adrianchiris commented Jan 20, 2022

zshi-redhat commented Jan 20, 2022

adrianchiris commented Jan 20, 2022

pliurh commented Jan 21, 2022

Multiple SriovNetworkNodePolicy creation fails #230

Multiple SriovNetworkNodePolicy creation fails #230

Comments

e0ne commented Jan 19, 2022

SchSeba commented Jan 19, 2022

adrianchiris commented Jan 19, 2022 • edited Loading

zshi-redhat commented Jan 20, 2022

adrianchiris commented Jan 20, 2022

zshi-redhat commented Jan 20, 2022

adrianchiris commented Jan 20, 2022

zshi-redhat commented Jan 20, 2022

adrianchiris commented Jan 20, 2022

pliurh commented Jan 21, 2022

adrianchiris commented Jan 19, 2022 •

edited

Loading