Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate IP case for NetworkPolicy #3467

Merged
merged 1 commit into from
Mar 18, 2022

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Mar 17, 2022

After a Node restarts, IP allocation storage used by host-local plugin
is reset and all Pods on the Node will be recreated. It could happen
that two Pods have same IP in K8s API if the Pod previously owning the
IP hasn't been recreated and another Pod gets the IP assigned when it's
recreated. This leads to an issue in NetworkPolicy as it calculated IPs
to add and remove by getting added and removed Pods first, then getting
IPs from the Pods.

NetworkPolicy rule reconciler didn't handle this case correctly as it
calculated added and removed Pods first, then got IPs to add and remove
from the added and removed Pods. For example, if both Pod A and Pod B
have IP 1.1.1.1, removing Pod B would cause IP 1.1.1.1 to be removed
from dataplane.

This patch changes to calculate IPs to add and remove based on the
difference between the old IPs and the new IPs directly.

Fixes #3468

Signed-off-by: Quan Tian [email protected]

@codecov-commenter
Copy link

codecov-commenter commented Mar 17, 2022

Codecov Report

Merging #3467 (a3f2edb) into main (1d27432) will decrease coverage by 11.82%.
The diff coverage is 100.00%.

❗ Current head a3f2edb differs from pull request most recent head f96733c. Consider uploading reports for the commit f96733c to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #3467       +/-   ##
===========================================
- Coverage   65.34%   53.51%   -11.83%     
===========================================
  Files         268      239       -29     
  Lines       26901    34253     +7352     
===========================================
+ Hits        17578    18331      +753     
- Misses       7415    14145     +6730     
+ Partials     1908     1777      -131     
Flag Coverage Δ
e2e-tests 53.51% <100.00%> (?)
kind-e2e-tests ?
unit-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/controller/networkpolicy/reconciler.go 65.67% <100.00%> (-11.32%) ⬇️
pkg/controller/egress/controller.go 0.00% <0.00%> (-88.45%) ⬇️
pkg/controller/networkpolicy/endpoint_querier.go 4.58% <0.00%> (-86.85%) ⬇️
pkg/controller/ipam/validate.go 0.00% <0.00%> (-82.26%) ⬇️
pkg/agent/util/iptables/lock.go 0.00% <0.00%> (-81.82%) ⬇️
pkg/controller/ipam/antrea_ipam_controller.go 0.00% <0.00%> (-80.29%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam_controller.go 0.00% <0.00%> (-79.52%) ⬇️
pkg/controller/externalippool/validate.go 0.00% <0.00%> (-76.20%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go 3.47% <0.00%> (-75.70%) ⬇️
pkg/cni/client.go 0.00% <0.00%> (-75.52%) ⬇️
... and 259 more

After a Node restarts, IP allocation storage used by host-local plugin
is reset and all Pods on the Node will be recreated. It could happen
that two Pods have same IP in K8s API if the Pod previously owning the
IP hasn't been recreated and another Pod gets the IP assigned when it's
recreated. This leads to an issue in NetworkPolicy as it calculated IPs
to add and remove by getting added and removed Pods first, then getting
IPs from the Pods.

NetworkPolicy rule reconciler didn't handle this case correctly as it
calculated added and removed Pods first, then got IPs to add and remove
from the added and removed Pods. For example, if both Pod A and Pod B
have IP 1.1.1.1, removing Pod B would cause IP 1.1.1.1 to be removed
from dataplane.

This patch changes to calculate IPs to add and remove based on the
difference between the old IPs and the new IPs directly.

Signed-off-by: Quan Tian <[email protected]>
@tnqn tnqn force-pushed the handle-duplicate-ip branch from b7d1738 to f96733c Compare March 17, 2022 12:55
@tnqn
Copy link
Member Author

tnqn commented Mar 17, 2022

/test-all

@tnqn tnqn requested review from Dyanngg and GraysonWu March 17, 2022 13:11
@tnqn tnqn added this to the Antrea v1.6 release milestone Mar 17, 2022
@tnqn tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Mar 17, 2022
Copy link
Contributor

@Dyanngg Dyanngg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed LGTM. Thanks Quan for the quick fix. I do think the there's something we can follow up on for this issue, as one of the causes is that in updating OpenFlow rules, we add addresses first then delete addresses, w/o checking for address duplication. https://github.com/antrea-io/antrea/blob/main/pkg/agent/controller/networkpolicy/reconciler.go#L787 might still be valid and @GraysonWu and I will look into simplifying OF interfaces

Copy link
Contributor

@GraysonWu GraysonWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Sure, let's take a look into that @Dyanngg .

@tnqn
Copy link
Member Author

tnqn commented Mar 18, 2022

@Dyanngg @GraysonWu thanks for your review and follow-up. For this issue, it may still happen even if the TODO is resolved because it could happen we have two Pods with same IP in one AddressGroup first, then one of them is removed, so calculating difference between realized IPs and desired IPs is necessary:

// TODO: This might be unnecessarily complex and hard for error handling, consider revising the Openflow interfaces.

@tnqn
Copy link
Member Author

tnqn commented Mar 18, 2022

/skip-conformance
/skip-networkpolicy
They succeeded but result report failed.

@tnqn
Copy link
Member Author

tnqn commented Mar 18, 2022

/test-integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NetworkPolicy may be not enforced correctly after restarting a Node
4 participants