Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPL Controller shuts down if iptables-restore operation fails #2554

Closed
antoninbas opened this issue Aug 6, 2021 · 0 comments · Fixed by #2555
Closed

NPL Controller shuts down if iptables-restore operation fails #2554

antoninbas opened this issue Aug 6, 2021 · 0 comments · Fixed by #2555
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
Thanks @alokmaurya88 for reporting this issue.

When restarting the Antrea Agent in a large scale cluster, the iptables-restore operation used to restore all the DNAT rules previously installed by the NPL Controller (and saved as Pod installations) may fail because of contention. In this case, the following is observed in the Antrea Agent logs:

-A ANTREA-NODE-PORT-LOCAL -p tcp -m tcp --dport 61013 -j DNAT --to-destination 10.129.243.196:81
COMMIT

stderr:
Another app is currently holding the xtables lock; still 8s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock; still 6s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock; still 4s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock; still 2s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock; still 0s 200000us time ahead to have a chance to grab the lock...
Another app is currently holding the xtables lock. Stopped waiting after 10s.
E0806 05:13:30.313148       1 npl_controller.go:136] Error in getting Pods and generating rules: error executing iptables-restore: exit status 4
I0806 05:13:30.313167       1 npl_controller.go:124] Shutting down AntreaAgentNPLController

To Reproduce
Requires some contention with other components / processes that need access to iptables, e.g. kube-proxy. The lock needs to be held by someone else for a long time (10s) for it to fail, so in the case of kube-proxy it would require a large number of Services / Endpoints.

Expected
In case of contention, the NPL Controller should either:

  1. keep retrying until the iptables-restore operation is successful, or
  2. let the controller event handler recover from the failed initialization, or
  3. cause agent initialization to fail so that the agent is restarted automatically

I plan to implement 1) as part of a bug fix patch. 2) can be considered in the future.

Actual behavior
In this scenario, the NPL Controller shuts down and will not be restarted unless the Antrea Agent is restarted itself. This is not acceptable as the desired NPL behavior is not realized.

Versions:
Antrea version: v1.2.0, v1.2.1, main

@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 6, 2021
antoninbas added a commit to antoninbas/antrea that referenced this issue Aug 6, 2021
Add a retry mechanism in the Controller initialization, which will keep
trying to sync iptables rules until the operation is successful. On
success, the NPL Controller is notified through a channel and can start
its event handlers.

Fixes antrea-io#2554

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Aug 10, 2021
Add a retry mechanism in the Controller initialization, which will keep
trying to sync iptables rules until the operation is successful. On
success, the NPL Controller is notified through a channel and can start
its event handlers.

Fixes antrea-io#2554

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this issue Aug 11, 2021
Add a retry mechanism in the Controller initialization, which will keep
trying to sync iptables rules until the operation is successful. On
success, the NPL Controller is notified through a channel and can start
its event handlers.

Fixes #2554

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Aug 11, 2021
Add a retry mechanism in the Controller initialization, which will keep
trying to sync iptables rules until the operation is successful. On
success, the NPL Controller is notified through a channel and can start
its event handlers.

Fixes antrea-io#2554

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this issue Aug 11, 2021
…ectly in NPL (#2575)

Add a retry mechanism in the Controller initialization, which will keep
trying to sync iptables rules until the operation is successful. On
success, the NPL Controller is notified through a channel and can start
its event handlers.

Fixes #2554

Signed-off-by: Antonin Bas <[email protected]>
annakhm pushed a commit to annakhm/antrea that referenced this issue Aug 16, 2021
…io#2555)

Add a retry mechanism in the Controller initialization, which will keep
trying to sync iptables rules until the operation is successful. On
success, the NPL Controller is notified through a channel and can start
its event handlers.

Fixes antrea-io#2554

Signed-off-by: Antonin Bas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant