Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance.IP is set randomly when a server has multiple networks #926

Closed
mdbooth opened this issue Jun 30, 2021 · 9 comments · Fixed by #971 or #1004
Closed

Instance.IP is set randomly when a server has multiple networks #926

mdbooth opened this issue Jun 30, 2021 · 9 comments · Fixed by #971 or #1004
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mdbooth
Copy link
Contributor

mdbooth commented Jun 30, 2021

/kind bug

This is from code inspection while working on MAPO: https://github.com/shiftstack/machine-api-provider-openstack
There is no reproducer, I could have misread the code, and/or this may not be relevant to CAPO.

The problem is where we're iterating over a server's Addresses in here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/master/pkg/cloud/services/compute/instance.go#L916-L930

		for _, network := range networkList {
			var netInterface networkInterface
			b, _ := json.Marshal(network)
			err = json.Unmarshal(b, &netInterface)
			if err != nil {
				return nil, fmt.Errorf("extract IP from instance err: %v", err)
			}
			if netInterface.Version == 4.0 {
				if netInterface.Type == "floating" {
					addrMap["floating"] = netInterface.Address
				} else {
					addrMap["internal"] = netInterface.Address
				}
			}
		}

If the server has multiple networks we will return the last one returned by the API which is not a floating IP as the machine's control plane IP.

The problem here is that as this interface is currently used it doesn't have enough context to determine which network of multiple networks is the control plane network. When we call GetInstance() or InstanceExists() we're passing in an OpenStack identifier. However, to determine which is the control plane IP we additionally need to know which is the control plane network. AFAICT the canonical reference for this would be OpenStackMachine.Spec.Subnet.

I'm currently thinking of a refactor. I think serverToInstance() needs this additional context. I also note that GetIPFromInstance is called by an E2E test, and duplicates a bunch of code. I wonder if it might be cleaner to have it use an exported serverToInstance instead. I haven't got a firm plan in mind, yet.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 30, 2021
@jichenjc
Copy link
Contributor

jichenjc commented Jul 1, 2021

looks like the code you pointed out seems you are right, the last one will be selected
addrMap["internal"] = netInterface.Address

but I think floating ip in current implementation is mandatory? if so if we have 2 networks then
it means likely the internal IP will be selected by some chance? then it might be a severe bug..

@mdbooth mdbooth changed the title Instance.AccessIPv4 is set randomly when a server has multiple networks Instance.IP is set randomly when a server has multiple networks Aug 3, 2021
@mdbooth
Copy link
Contributor Author

mdbooth commented Aug 3, 2021

instance.IP is currently used in 2 places:

  • func (r *OpenStackMachineReconciler) reconcileLoadBalancerMember(logger logr.Logger, osProviderClient *gophercloud.ProviderClient, clientOpts *clientconfig.ClientOpts, openStackCluster *infrav1.OpenStackCluster, machine *clusterv1.Machine, openStackMachine *infrav1.OpenStackMachine, instance *infrav1.Instance, clusterName string) error {
    ip := instance.IP
    loadbalancerService, err := loadbalancer.NewService(osProviderClient, clientOpts, logger)
    if err != nil {
    return err
    }
    return loadbalancerService.ReconcileLoadBalancerMember(openStackCluster, machine, openStackMachine, clusterName, ip)
    }
  • address := []corev1.NodeAddress{{Type: corev1.NodeInternalIP, Address: instance.IP}}

The former identifies the IP of a load balancer member. The latter is the only address of type NodeInternalIP in OpenStackMachineStatus.Addresses. I'm not yet sure how this is used in CAPI, but IIRC in OpenShift I think MCO uses this to identify a set of valid IP addresses for kubelet on this machine.

If it lands, #924 adds a third use: to identify the port we should associate with a floating IP.

On Addresses, cluster-api-provider-aws is returning all addresses, not just one. I suspect we should do the same:
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/e69740275ca5a14b51eeca3ed74cff2d88dd8cdc/pkg/cloud/services/ec2/instances.go#L804-L831

However, we'd need somewhere to put them. Instance.Networks doesn't look like what what we want. A new Instance.Addresses field, perhaps? Instance isn't internal-only because it's exposed in OpenStackClusterSpec.Bastion.Instance. What would this field mean when used in a Spec? Perhaps we need to pass it round in some other manner?

The 2 remaining uses are:

  • Load balancer member
  • Floating IP internal port

How about the following logic for this IP:

if OpenStackMachine.Subnet is set:
  return first port matching this subnet
  if no matching port, emit warning and fall-through(*)

return first port matching OpenStackClusterStatus.Network.Subnet
if no matching port, emit warning and fall-through(*)

return first port

(*) We could prevent future occurrences of these in a validating webhook?

@mdbooth
Copy link
Contributor Author

mdbooth commented Aug 16, 2021

If we merge #971 I believe we can address this by changing the signature of

func (is *InstanceStatus) NetworkStatus() (*InstanceNetworkStatus, error) {
to take OpenStackMachine and OpenStackClusterStatus as arguments. This would allow us to implement the logic described in #926 (comment).

@mdbooth
Copy link
Contributor Author

mdbooth commented Aug 20, 2021

This was close accidentally, and I can't reopen it 🤔

@tobiasgiese
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@tobiasgiese: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Aug 20, 2021
@tobiasgiese
Copy link
Member

tobiasgiese commented Aug 20, 2021

@mdbooth you should never write the issue number and fix in the same line. Otherwise, Github Prow will close the mentioned issue :)

@mkjpryor
Copy link
Contributor

mkjpryor commented Sep 15, 2021

@mdbooth

I'm not sure this is entirely a Cluster API provider issue. Detecting the IP for the API load-balancer member definitely is, but for the node IP as it appears in Kubernetes, I think the OpenStack Cloud Controller Manager will override whatever Cluster API passes (not to say that we shouldn't make it deterministic/configurable anyway, of course!).

To that end, I notice in your templates that you are still using the deprecated in-tree OpenStack CCM. By using the newer out-of-tree external CCM, you get to use the configuration option internal-network-name that allows you to specify which network to get the internal address for the Kubernetes node from.

I have this working quite smoothly in the case where the control plane nodes share a VXLAN with the workers, but the workers are also attached to a VLAN with SR-IOV enabled. I can tell the OpenStack CCM to use the VXLAN as the Kubernetes node IP.

@mdbooth
Copy link
Contributor Author

mdbooth commented Sep 15, 2021

I need to fix this issue in CAPO due to how we use it in OpenShift. The interaction with in-tree/CCM cloud provider isn't my concern here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment