-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rollback invocation after CmdAdd failure in CNI server #5548
Fix rollback invocation after CmdAdd failure in CNI server #5548
Conversation
/test-all |
Also tested manually on a K8s cluster, by injecting an error in the CNI server during interface configuration (after IPAM):
Notice how the rollback is now working (we release the IP address right away). |
if _, err := s.CmdDel(ctx, request); err != nil { | ||
klog.Warningf("Failed to rollback after CNI add failure: %v", err) | ||
klog.InfoS("CmdAdd for container failed, trying to rollback", "container", cniConfig.ContainerId) | ||
if _, err := s.cmdDel(ctx, cniConfig); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is an improvement to directly use the existing "cniConfig" in cmdDel to avoid unnecessary call on loadNetworkConfig
.
However, I didn't understand why the original code can append a duplicate PodCIDR range in the IPAM configuration. When we call function loadNetworkConfig
in CmdDel/CmdAdd/CmdCheck, it has generated a new CNIConfig
object, and the cniConfig.NetworkConfig
is unmarshaled from the json stringrequest.CniArgs.NetworkConfiguration
. So even if we call CmdDel in CmdAdd rollback, the cniConfig passed to IPAM plugin from a new call with loadNetworkConfig
is supposed to equal to the values existing in CmdAdd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code was calling validateRequest
for the rollback (because CmdDel
calls validateRequest
), using the same request
object. validateRequest
actually mutates the request
but it is not obvious when looking at the code:
loadNetworkConfig
has the following assignment:antrea/pkg/agent/cniserver/server.go
Line 198 in bbec9d0
cniConfig.CniCmdArgs = request.CniArgs cniConfig.CniCmdArgs
(a pointer to a protobuf struct) will mutate the request.updateLocalIPAMSubnet
has the following assignment:antrea/pkg/agent/cniserver/server.go
Line 275 in bbec9d0
cniConfig.NetworkConfiguration, _ = json.Marshal(cniConfig.NetworkConfig) cniConfig.NetworkConfiguration
is actually the same ascniConfig.CniCmdArgs.NetworkConfiguration
(it is a byte slice). So at this point we have mutated the request. The next time we callvalidateRequestMessage
(which we no longer do with my patch), the network configuration of the request already includes the Node subnets in the IPAM section.
I confirmed all of this with my unit test.
Note that I don't know why we are mutating the request in the first place; I don't know if it's intentional, but I think we have been doing it for 4 years. Since using the existing cniConfig
seemed like the right thing to do, that's what I did in this patch. We could consider a follow-up patch to clean up existing logic, but I didn't want to risk introducing a new bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify the statement above:
cniConfig.NetworkConfiguration
is actually the same ascniConfig.CniCmdArgs.NetworkConfiguration
This is because CniCmdArgs
is an embedded field:
antrea/pkg/agent/cniserver/server.go
Lines 127 to 136 in bbec9d0
type CNIConfig struct { | |
*types.NetworkConfig | |
// AntreaIPAM for an interface not managed by Antrea CNI. | |
secondaryNetworkIPAM bool | |
// CniCmdArgs received from the CNI plugin. IPAM data in CniCmdArgs can be updated with the | |
// Node's Pod CIDRs for NodeIPAM. | |
*cnipb.CniCmdArgs | |
// K8s CNI_ARGS passed to the CNI plugin. | |
*types.K8sArgs | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
pkg/agent/cniserver/server.go
Outdated
@@ -453,12 +453,12 @@ func (s *CNIServer) CmdAdd(ctx context.Context, request *cnipb.CniCmdRequest) (* | |||
// Rollback to delete configurations once ADD is failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your code, but change "is failure" to "fails"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* When performing configuration rollback after an error in CmdAdd, we do not invoke CmdDel directly. Instead, we invoke an internal version of it which does not log a "Received CmdDel request" message (the message is confusing otherwise as it implies that we received a new CNI DEL command from the container runtime), and which does not process the network config again (as it was already processed at the beginning of CmdAdd). By not processing the config a second time, we ensure that there are no duplicate CIDRs in the IPAMConfig. * Migrate klog calls in server.go to use structured logging. * Improve unit tests for the CNI server to validate this fix. Fixes antrea-io#5547 Signed-off-by: Antonin Bas <[email protected]>
Signed-off-by: Antonin Bas <[email protected]>
904a6b5
to
6ab005d
Compare
/test-all |
/test-e2e |
…failure in CNI (#5558) * When performing configuration rollback after an error in CmdAdd, we do not invoke CmdDel directly. Instead, we invoke an internal version of it which does not log a "Received CmdDel request" message (the message is confusing otherwise as it implies that we received a new CNI DEL command from the container runtime), and which does not process the network config again (as it was already processed at the beginning of CmdAdd). By not processing the config a second time, we ensure that there are no duplicate CIDRs in the IPAMConfig. * Migrate klog calls in server.go to use structured logging. * Improve unit tests for the CNI server to validate this fix. Fixes #5547 Signed-off-by: Antonin Bas <[email protected]>
…failure in CNI (#5559) * When performing configuration rollback after an error in CmdAdd, we do not invoke CmdDel directly. Instead, we invoke an internal version of it which does not log a "Received CmdDel request" message (the message is confusing otherwise as it implies that we received a new CNI DEL command from the container runtime), and which does not process the network config again (as it was already processed at the beginning of CmdAdd). By not processing the config a second time, we ensure that there are no duplicate CIDRs in the IPAMConfig. * Migrate klog calls in server.go to use structured logging. * Improve unit tests for the CNI server to validate this fix. Fixes #5547 Signed-off-by: Antonin Bas <[email protected]>
…failure in CNI server (#5560) * When performing configuration rollback after an error in CmdAdd, we do not invoke CmdDel directly. Instead, we invoke an internal version of it which does not log a "Received CmdDel request" message (the message is confusing otherwise as it implies that we received a new CNI DEL command from the container runtime), and which does not process the network config again (as it was already processed at the beginning of CmdAdd). By not processing the config a second time, we ensure that there are no duplicate CIDRs in the IPAMConfig. * Migrate klog calls in server.go to use structured logging. * Improve unit tests for the CNI server to validate this fix. Fixes #5547 Signed-off-by: Antonin Bas <[email protected]>
Fixes #5547