Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NPM] [Linux] another race condition when editing a NetPol or deleting then readding it #2977

Closed
huntergregory opened this issue Aug 29, 2024 · 3 comments · Fixed by #2978 or #2990
Closed
Labels
bug linux npm Related to NPM.

Comments

@huntergregory
Copy link
Contributor

huntergregory commented Aug 29, 2024

NOTE: v1.5.37 will fix completely. In v1.5.36, this issue is fixed only for NetPols with CIDR rules

Similar issue: #2963

EDIT (9/5): based on #2977 (comment)

EDIT (9/9): based on #2977 (comment)

Summary

Under this race condition, NPM applies a single NetworkPolicy incorrectly.

This race can occur when editing a NetworkPolicy with "enough" rules or when deleting such a NetPol then readding it before any other changes occur to Pods, Namespaces, or NetworkPolicies in the cluster. This race condition occurs when the kernel is slow to process changes.

Symptoms

Unexpected connectivity after editing a NetPol, or after removing a NetPol then readding the NetworkPolicy.

NPM logs include the following lines in succession:

I0826 05:49:10.741331       1 networkPolicyController.go:225] Network Policy demo/demo-policy is not found, may be it is deleted
I0826 05:49:10.741348       1 dataplane.go:545] [DataPlane] Remove Policy called for demo/demo-policy
...
E0826 05:49:10.807012       1 networkPolicyController.go:195] error syncing 'demo/demo-policy': [syncNetPol] error: [cleanUpNetworkPolicy] Error: failed to remove policy due to [DataPlane] [APPLY-DP] error while applying IPSets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: Set cannot be destroyed: it is in use by a kernel component
] when network policy is not found, requeuing
I0826 05:49:10.812231       1 networkPolicyController.go:225] Network Policy demo/demo-policy is not found, may be it is deleted
I0826 05:49:10.812244       1 dataplane.go:545] [DataPlane] Remove Policy called for demo/demo-policy
I0826 05:49:10.812248       1 dataplane.go:560] [DataPlane] Policy demo/demo-policy is not found. Might been deleted already
I0826 05:49:10.812304       1 networkPolicyController.go:191] Successfully synced 'demo/demo-policy'
I0826 05:49:56.030931       1 dataplane.go:612] [DataPlane] Update Policy called for demo/demo-policy
I0826 05:49:56.030946       1 dataplane.go:615] [DataPlane] Policy demo/demo-policy is not found.
I0826 05:49:56.030951       1 dataplane.go:394] [DataPlane] Add Policy called for demo/demo-policy

Mitigation

Try deleting and readding the NetworkPolicy again. If that does not work, restart all NPM Pods.

To Avoid the Issue

  • Option 1: avoid editing/deleting NetworkPolicies, especially those with many CIDR rules or MatchExpression rules.
  • Option 2: Break a policy into multiple policies with no more than 4 rules (total, across ingress and egress).

Cause

When removing a NetPol, NPM deletes the Policy firewall rules, then flushes the IPs from the policy's CIDR and MatchExpression IPSets, then tries to delete those IPSets as well as any other policy IPSets which are empty (no associated Pod IPs). Sometimes the kernel still thinks that firewall rules reference the IPsets (hence the log line Set cannot be destroyed: it is in use by a kernel component), so after 5 attempts, NPM will fail to delete the IPSets. The previous "slow kernel" issue is temporary. NPM will not retry the IPSet operations until the next change in Pods, Namespaces, or NetworkPolicies. If the next change is adding back the same NetworkPolicy, then NPM may apply the NetworkPolicy incorrectly leading to unexpected connectivity. If the next change is anything else, then the IPSets will be properly deleted, and no issue occurs.

If the next change is adding back the same NetworkPolicy, then NPM may apply the NetworkPolicy incorrectly leading to unexpected connectivity

Why? Because the NetworkPolicy might use the same IPSets, and NPM thinks those IPSets still exist in the kernel as they did before (with all IPs included); however, NPM has already flushed all IPs from the IPSets, so NPM might not add back IPs that it should.

For instance, if the NetworkPolicy is added back with no edits, then NPM will think that every CIDR IPSet is properly configured in the kernel. However, all of the IPSets will be missing their IPs. Therefore, no IPs will be allowed by the policy's firewall rules.

Logs

I0826 05:49:10.741331       1 networkPolicyController.go:225] Network Policy demo/demo-policy is not found, may be it is deleted
I0826 05:49:10.741348       1 dataplane.go:545] [DataPlane] Remove Policy called for demo/demo-policy
I0826 05:49:10.741417       1 chain-management_linux.go:366] Executing iptables command with args [-w 60 -D AZURE-NPM-INGRESS -j AZURE-NPM-INGRESS-2461490616 -m set --match-set azure-npm-4137662606 dst -m set --match-set azure-npm-3252124261 dst -m comment --comment INGRESS-POLICY-demo/demo-policy-TO-podlabel-app:demo-AND-ns-demo-IN-ns-demo]
I0826 05:49:10.771495       1 chain-management_linux.go:366] Executing iptables command with args [-w 60 -D AZURE-NPM-EGRESS -j AZURE-NPM-EGRESS-2461490616 -m set --match-set azure-npm-4137662606 src -m set --match-set azure-npm-3252124261 src -m comment --comment EGRESS-POLICY-demo/demo-policy-app-FROM-podlabel-app:demo-AND-ns-demo-IN-ns-demo]
I0826 05:49:10.799526       1 restore.go:188] running this restore command: [iptables-nft-restore -w 60 -T filter --noflush]
I0826 05:49:10.801220       1 dataplane.go:330] [DataPlane] [ApplyDataPlane] [APPLY-DP] starting to apply ipsets
I0826 05:49:10.801276       1 ipsetmanager.go:455] [IPSetManager] dirty caches. toAddUpdateCache: to create: [], to update: [], toDeleteCache: map[cidr-demo-policy-in-ns-demo-0-0IN:0xc000ba91f0 cidr-demo-policy-in-ns-demo-1-10OUT:0xc000ba95a0 cidr-demo-policy-in-ns-demo-1-11OUT:0xc000ba95f0 cidr-demo-policy-in-ns-demo-1-12OUT:0xc000ba9640 cidr-demo-policy-in-ns-demo-1-13OUT:0xc000ba9690 cidr-demo-policy-in-ns-demo-1-14OUT:0xc000ba96e0 cidr-demo-policy-in-ns-demo-1-15OUT:0xc000ba9730 cidr-demo-policy-in-ns-demo-1-16OUT:0xc000ba9780 cidr-demo-policy-in-ns-demo-1-17OUT:0xc000ba97d0 cidr-demo-policy-in-ns-demo-1-18OUT:0xc000ba9820 cidr-demo-policy-in-ns-demo-1-19OUT:0xc000ba9870 cidr-demo-policy-in-ns-demo-1-1OUT:0xc000ba9280 cidr-demo-policy-in-ns-demo-1-20OUT:0xc000ba98c0 cidr-demo-policy-in-ns-demo-1-21OUT:0xc000ba9910 cidr-demo-policy-in-ns-demo-1-22OUT:0xc000ba9960 cidr-demo-policy-in-ns-demo-1-23OUT:0xc000ba9a00 cidr-demo-policy-in-ns-demo-1-24OUT:0xc000ba9aa0 cidr-demo-policy-in-ns-demo-1-2OUT:0xc000ba92d0 cidr-demo-policy-in-ns-demo-1-3OUT:0xc000ba9320 cidr-demo-policy-in-ns-demo-1-4OUT:0xc000ba9370 cidr-demo-policy-in-ns-demo-1-5OUT:0xc000ba93c0 cidr-demo-policy-in-ns-demo-1-6OUT:0xc000ba9410 cidr-demo-policy-in-ns-demo-1-7OUT:0xc000ba9460 cidr-demo-policy-in-ns-demo-1-8OUT:0xc000ba94b0 cidr-demo-policy-in-ns-demo-1-9OUT:0xc000ba9550]
I0826 05:49:10.801404       1 restore.go:188] running this restore command: [ipset restore]
I0826 05:49:10.802887       1 restore.go:299] continuing after line 26 for command [ipset restore]
2024/08/26 05:49:10 [1] skipping destroy line for set cidr-demo-policy-in-ns-demo-1-6OUT since the set is in use by a kernel component
I0826 05:49:10.803026       1 restore.go:188] running this restore command: [ipset restore]
2024/08/26 05:49:10 [1] error: on try number 1, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-X azure-npm-4259919934]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 26: Set cannot be destroyed: it is in use by a kernel component
]]
I0826 05:49:10.803870       1 restore.go:299] continuing after line 1 for command [ipset restore]
2024/08/26 05:49:10 [1] skipping destroy line for set cidr-demo-policy-in-ns-demo-1-8OUT since the set is in use by a kernel component
2024/08/26 05:49:10 [1] error: on try number 2, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-X azure-npm-2353438532]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: Set cannot be destroyed: it is in use by a kernel component
]]
I0826 05:49:10.803954       1 restore.go:188] running this restore command: [ipset restore]
I0826 05:49:10.804907       1 restore.go:299] continuing after line 1 for command [ipset restore]
2024/08/26 05:49:10 [1] skipping destroy line for set cidr-demo-policy-in-ns-demo-1-17OUT since the set is in use by a kernel component
2024/08/26 05:49:10 [1] error: on try number 3, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-X azure-npm-1803963672]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: Set cannot be destroyed: it is in use by a kernel component
]]
I0826 05:49:10.805084       1 restore.go:188] running this restore command: [ipset restore]
I0826 05:49:10.805840       1 restore.go:299] continuing after line 1 for command [ipset restore]
I0826 05:49:10.805934       1 restore.go:188] running this restore command: [ipset restore]
2024/08/26 05:49:10 [1] skipping destroy line for set cidr-demo-policy-in-ns-demo-1-12OUT since the set is in use by a kernel component
2024/08/26 05:49:10 [1] error: on try number 4, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-X azure-npm-43822893]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: Set cannot be destroyed: it is in use by a kernel component
]]
2024/08/26 05:49:10 [1] error: failed to apply ipsets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: Set cannot be destroyed: it is in use by a kernel component
]
E0826 05:49:10.807012       1 networkPolicyController.go:195] error syncing 'demo/demo-policy': [syncNetPol] error: [cleanUpNetworkPolicy] Error: failed to remove policy due to [DataPlane] [APPLY-DP] error while applying IPSets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: Set cannot be destroyed: it is in use by a kernel component
] when network policy is not found, requeuing
2024/08/26 05:49:10 [1] syncNetPol error due to error syncing 'demo/demo-policy': [syncNetPol] error: [cleanUpNetworkPolicy] Error: failed to remove policy due to [DataPlane] [APPLY-DP] error while applying IPSets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: Set cannot be destroyed: it is in use by a kernel component
] when network policy is not found, requeuing
I0826 05:49:10.812231       1 networkPolicyController.go:225] Network Policy demo/demo-policy is not found, may be it is deleted
I0826 05:49:10.812244       1 dataplane.go:545] [DataPlane] Remove Policy called for demo/demo-policy
I0826 05:49:10.812248       1 dataplane.go:560] [DataPlane] Policy demo/demo-policy is not found. Might been deleted already
I0826 05:49:10.812304       1 networkPolicyController.go:191] Successfully synced 'demo/demo-policy'
I0826 05:49:56.030931       1 dataplane.go:612] [DataPlane] Update Policy called for demo/demo-policy
I0826 05:49:56.030946       1 dataplane.go:615] [DataPlane] Policy demo/demo-policy is not found.
I0826 05:49:56.030951       1 dataplane.go:394] [DataPlane] Add Policy called for demo/demo-policy
I0826 05:49:56.030956       1 types.go:214] [DataPlane] enqueuing policy demo/demo-policy in netPolQueue
I0826 05:49:56.030959       1 dataplane.go:408] [DataPlane] [ADD-NETPOL] new pending netpol count: 1
I0826 05:49:56.030971       1 networkPolicyController.go:191] Successfully synced 'demo/demo-policy'
I0826 05:49:56.175624       1 dataplane.go:421] [DataPlane] adding policies [0xc00028e210]
I0826 05:49:56.175902       1 dataplane.go:330] [DataPlane] [ApplyDataPlane] [ADD-NETPOL] starting to apply ipsets
I0826 05:49:56.175970       1 ipsetmanager.go:455] [IPSetManager] dirty caches. toAddUpdateCache: to create: [], to update: [cidr-demo-policy-in-ns-demo-1-4OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-5OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-17OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-24OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-2OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-9OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-23OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-0-0IN: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-6OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-7OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-13OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-21OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-12OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-18OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-10OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-11OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-14OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-22OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-8OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-20OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-15OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-1OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-3OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-16OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-demo-policy-in-ns-demo-1-19OUT: &{membersToAdd:map[] membersToDelete:map[]}], toDeleteCache: map[]
I0826 05:49:56.176077       1 restore.go:188] running this restore command: [ipset restore]
I0826 05:49:56.177952       1 dataplane.go:335] [DataPlane] [ApplyDataPlane] [ADD-NETPOL] finished applying ipsets
I0826 05:49:56.178168       1 restore.go:188] running this restore command: [iptables-nft-restore -w 60 -T filter --noflush]
I0826 05:49:56.186682       1 dataplane.go:426] [DataPlane] [BACKGROUND] added policies successfully
@huntergregory huntergregory added npm Related to NPM. linux bug labels Aug 29, 2024
huntergregory added a commit that referenced this issue Sep 4, 2024
huntergregory added a commit that referenced this issue Sep 4, 2024
@huntergregory huntergregory changed the title [NPM] [Linux] race condition when deleting a NetPol with CIDR rules then readding it [NPM] [Linux] race condition when deleting a NetPol then readding it Sep 5, 2024
@huntergregory
Copy link
Contributor Author

This issue can actually occur for NetPol without CIDR rule. Most likely to occur for NetPol with CIDR rule or MatchExpressions. Yet could happen for other IPSets if they are not associated with any Pod IP.

@huntergregory huntergregory changed the title [NPM] [Linux] race condition when deleting a NetPol then readding it [NPM] [Linux] another race condition when editing a NetPol or deleting then readding it Sep 9, 2024
@huntergregory
Copy link
Contributor Author

This can also occur when editing the NetPol, not just deleting then readding the NetPol

@huntergregory
Copy link
Contributor Author

Pending fix for NetPols without CIDR rules

@huntergregory huntergregory reopened this Sep 9, 2024
github-merge-queue bot pushed a commit that referenced this issue Sep 11, 2024
* fix: [Linux] [NPM] handle #2977 for netpols without cidrs

Signed-off-by: Hunter Gregory <[email protected]>

* fix: lock and no need to track policy key

Signed-off-by: Hunter Gregory <[email protected]>

* style: remove dead code

Signed-off-by: Hunter Gregory <[email protected]>

---------

Signed-off-by: Hunter Gregory <[email protected]>
huntergregory added a commit that referenced this issue Sep 11, 2024
…2990)

* fix: [Linux] [NPM] handle #2977 for netpols without cidrs

Signed-off-by: Hunter Gregory <[email protected]>

* fix: lock and no need to track policy key

Signed-off-by: Hunter Gregory <[email protected]>

* style: remove dead code

Signed-off-by: Hunter Gregory <[email protected]>

---------

Signed-off-by: Hunter Gregory <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Sep 12, 2024
…3006)

[backport] fix: [Linux] [NPM] handle #2977 for netpols without cidrs (#2990)

* fix: [Linux] [NPM] handle #2977 for netpols without cidrs



* fix: lock and no need to track policy key



* style: remove dead code



---------

Signed-off-by: Hunter Gregory <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug linux npm Related to NPM.
Projects
None yet
1 participant