Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid CmdAdd rollback in antrea-agent #5547

Closed
antoninbas opened this issue Oct 5, 2023 · 0 comments · Fixed by #5548
Closed

Invalid CmdAdd rollback in antrea-agent #5547

antoninbas opened this issue Oct 5, 2023 · 0 comments · Fixed by #5548
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. reported-by/end-user Issues reported by end users.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
This was observed in antrea-agent logs:

I1002 10:15:34.581920       1 server.go:388] Received CmdAdd request cni_args:{container_id:"1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67"  netns:"/var/run/netns/cni-875c6469-71d1-5727-9132-29273c888716"  ifname:"eth0"  args:"K8S_POD_UID=3bd3551e-3bc0-4dc9-a251-9b2bd95abcd4;IgnoreUnknown=1;K8S_POD_NAMESPACE=<NAMESPACE>;K8S_POD_NAME=<NAME>;K8S_POD_INFRA_CONTAINER_ID=1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67"  path:"/opt/cni/bin"  network_configuration:"{\"cniVersion\":\"0.3.0\",\"ipam\":{\"type\":\"host-local\"},\"name\":\"antrea\",\"type\":\"antrea\"}"}
E1002 10:15:34.619728       1 server.go:445] Failed to request IP addresses for container 1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67: failed to allocate for range 0: no IP addresses available in range set: 198.18.137.1-198.18.137.254
W1002 10:15:34.619767       1 server.go:411] CmdAdd for container 1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67 failed, and try to rollback
I1002 10:15:34.619776       1 server.go:491] Received CmdDel request cni_args:{container_id:"1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67"  netns:"/var/run/netns/cni-875c6469-71d1-5727-9132-29273c888716"  ifname:"eth0"  args:"K8S_POD_UID=3bd3551e-3bc0-4dc9-a251-9b2bd95abcd4;IgnoreUnknown=1;K8S_POD_NAMESPACE=<NAMESPACE>;K8S_POD_NAME=<NAME>;K8S_POD_INFRA_CONTAINER_ID=1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67"  path:"/opt/cni/bin"  network_configuration:"{\"cniVersion\":\"0.3.0\",\"name\":\"antrea\",\"type\":\"antrea\",\"deviceID\":\"\",\"dns\":{},\"ipam\":{\"type\":\"host-local\",\"ranges\":[[{\"subnet\":\"198.18.137.0/24\",\"gateway\":\"198.18.137.1\"}]]},\"runtimeConfig\":{\"dns\":{}}}"}
E1002 10:15:34.622363       1 server.go:507] Failed to delete IP addresses for container 1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67: range set 0 overlaps with 1
I1002 10:15:34.647399       1 server.go:491] Received CmdDel request cni_args:{container_id:"1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67"  netns:"/var/run/netns/cni-875c6469-71d1-5727-9132-29273c888716"  ifname:"eth0"  args:"K8S_POD_UID=3bd3551e-3bc0-4dc9-a251-9b2bd95abcd4;IgnoreUnknown=1;K8S_POD_NAMESPACE=<NAMESPACE>;K8S_POD_NAME=<NAME>;K8S_POD_INFRA_CONTAINER_ID=1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67"  path:"/opt/cni/bin"  network_configuration:"{\"cniVersion\":\"0.3.0\",\"ipam\":{\"type\":\"host-local\"},\"name\":\"antrea\",\"type\":\"antrea\"}"}
I1002 10:15:34.659531       1 server.go:510] Deleted IP addresses for container 1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67
I1002 10:15:34.659555       1 server.go:516] CmdDel for container 1b220ec6347e4e28ac48f9dadbd8e0430b68734f04d4ce4cb2de4dade0dfaa67 succeeded

The sequence of events is as follows:

  • A CmdAdd fails (in this case because of an IP address shortage, but could be for a number of other reason)
  • Antrea (cniserver) attempts a rollback, by generating an internal / local call to CniDel. However, the network config included in the request already includes the local PodCIDR. The Antrea code unconditionally appends the local PodCIDR to the Ranges slice in the network config1, causing a duplicate and an error in host-local ipam2.
  • The container runtime sends a CmdDel request, which succeeds.

To Reproduce
Not straightforward: one needs to create a scenario in which the CmdAdd will fail (e.g. by creating enough Pods on a Node to run out of local IPs).

Expected
The rollback should be able to succeed. While this doesn't have a big impact on the system, it can make the logs confusing and harder to understand when troubleshooting an iisue.

Actual behavior
The rollback fails and an error is logged, which can be confusing for users (range set 0 overlaps with 1)

Versions:

  • Antrea version: probably all versions, including the latest one (v1.13)

Footnotes

  1. https://github.com/antrea-io/antrea/blob/bbec9d04dd426bdcf1a7af0024ac93fdc61debda/pkg/agent/cniserver/server.go#L264-L276.

  2. https://github.com/containernetworking/plugins/blob/f95505231a8209b1b41eac2d5c2399c6ab444785/plugins/ipam/host-local/backend/allocator/config.go#L157-L165

@antoninbas antoninbas added the kind/bug Categorizes issue or PR as related to a bug. label Oct 5, 2023
@antoninbas antoninbas self-assigned this Oct 5, 2023
@antoninbas antoninbas added this to the Antrea v1.14 release milestone Oct 5, 2023
antoninbas added a commit to antoninbas/antrea that referenced this issue Oct 5, 2023
* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes antrea-io#5547

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Oct 9, 2023
* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes antrea-io#5547

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this issue Oct 9, 2023
* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes #5547

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Oct 9, 2023
* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes antrea-io#5547

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Oct 9, 2023
* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes antrea-io#5547

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this issue Oct 9, 2023
* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes antrea-io#5547

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this issue Oct 10, 2023
…failure in CNI (#5558)

* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes #5547

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this issue Oct 11, 2023
…failure in CNI (#5559)

* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes #5547

Signed-off-by: Antonin Bas <[email protected]>
tnqn pushed a commit that referenced this issue Oct 16, 2023
…failure in CNI server (#5560)

* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes #5547

Signed-off-by: Antonin Bas <[email protected]>
@tnqn tnqn added the reported-by/end-user Issues reported by end users. label Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants