Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IPv4/IPv6 dual stack KEP #2254

Closed
wants to merge 1 commit into from

Conversation

leblancd
Copy link

@leblancd leblancd commented Jun 12, 2018

Add a KEP for IPv4/IPv6 dual stack functionality to Kubernetes clusters. Dual-stack functionality includes the following concepts:

  • Awareness of multiple IPv4/IPv6 address assignments per pod
  • Native IPv4-to-IPv4 in parallel with IPv6-to-IPv6 communications to, from, and within a cluster

References:
kubernetes/kubernetes issue # 62822
kubernetes/features issue # 563

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2018
@k8s-ci-robot k8s-ci-robot requested review from dcbw and thockin June 12, 2018 16:49
@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/network Categorizes an issue or PR as relevant to SIG Network. labels Jun 12, 2018
@feiskyer
Copy link
Member

/cc @kubernetes/sig-network-proposals

@k8s-ci-robot k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Jun 12, 2018
@rpothier
Copy link

/area ipv6

Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leblancd thanks for making this! I've done a first pass and added some comments.

- Link Local Addresses (LLAs) on a pod will remain implicit (Kubernetes will not display nor track these addresses).
- Kubernetes needs to be configurable for up to two service CIDRs.
- Backend pods for a service can be dual stack. For the first release of dual-stack support, each IPv4/IPv6 address of a backend pod will be treated as a separate Kubernetes endpoint.
- Kube-proxy needs to support IPv4 and IPv6 services in parallel (e.g. drive iptables and ip6tables in parallel).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also consider the impact to the IPVS proxy. I think at this point we need to maintain both (unless we say dual-stack is iptables only?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I had added an IPVS section below, but forgot to add it to the proposal summary here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, did some exploring into IPVS, and have found one problem using it so far. The team reports that, it does not currently support IPv6. Seems like we need to document that effort is needed there too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this, it's good to know up front!

### Awareness of Multiple IPs per Pod

Since Kubernetes Version 1.9, Kubernetes users have had the capability to use dual-stack-capable CNI network plugins (e.g. Bridge + Host Local, Calico, etc.), using the
[0.3.1 version of the CNI Networking Plugin API](https://github.com/containernetworking/cni/blob/spec-v0.3.1/SPEC.md), to configure multiple IPv4/IPv6 addresses on pods. However, Kubernetes currently captures and uses only the first IP address in the list of assigned pod IPs that a CNI plugin returns to Kubelet in the [CNI Results structure](https://github.com/containernetworking/cni/blob/spec-v0.3.1/SPEC.md#result).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Kubernetes uses only the IP address it reads from eth0 within the Pod and ignores the response from CNI altogether, right? Or did that get changed and I missed it? :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is still currently the case.. for now. CNI get is getting closer!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!
@squeed - If we do add the capability for the CNI plugin to pass up metadata (labels and such) to Kubernetes, does this have to be done via a CNI "get", or is there a way for kubelet to gather this information directly from the CNI results?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CNI doesn't allow for any kind of annotations (right now -- that could change) - it only has IPs and routes. Changing that is... out of scope :-)

I think this section is fine as-is; how kubelet gets the list of IPs from a running container is just an implementation detail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squeed - Understood, thanks. This was more out of curiosity about where we're headed with the CNI API.


### Support of Health/Liveness/Readiness Probes

Currently, health, liveness, and readiness probes are defined without any concern for IP addresses or families. For the first release of dual-stack support, no configuration "knobs" will be added for probe definitions. A probe for a dual-stack pod will be deemed successful if either an IPv4 or IPv6 response is received. (QUESTION: Does the current probe implementation include DNS lookups, or are IP addresses hard coded?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current probe implementation include DNS lookups, or are IP addresses hard coded

I believe you can configure a host for the probe which will be resolved via DNS if it is a DNS name rather than an exact IP. See here


### Load Balancer Operation ???

### Network Policy Considerations ???
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking strictly for Calico, I don't believe there are any (besides testing).

I don't think there are API impacts because the NP API selects using labels rather than addresses (though now that I think about it we should check if the NP CIDR support does validation on IPv4 / IPv6)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks!


#### First Release of Dual Stack: No Service Address Configuration "Knobs"
For the first release of Kubernetes dual-stack support, no new configuration "knobs" will be added for service definitions. This greatly simplifies the design and implementation, but requires imposing the following behavior:
- Service IP allocation: Kubernetes will always allocate a service IP from each service CIDR for each service that is created. (In the future, we might want to consider adding configuration options to allow a user to select e.g. whether a given service should be assigned only IPv4, only IPv6, or both IPv4 and IPv6 service IPs.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If kube-proxy/plugins/etc are going to have to be updated to deal with multiple CIDR ranges anyway, then it would be useful to let the admin say "a.b.c.d/x should be treated as part of the service IP range, but the controller shouldn't allocate any new IPs out of that range". This would let you live-migrate the cluster from one service CIDR to another (something we (OpenShift) have had a handful of requests for).

(The same applies to the cluster CIDRs, though in that case the allocations are done by the plugin, not kube itself, so you could already implement this via plugin-specific configuration.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the In the future clause in parens could be moved down, as a note, since it is not part of the imposed behavior for not using a knob).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have heard a VERY small number of similar requests, but I think it's orthogonal to multi-family. We could add multi-CIDR with metadata (e.g. allow existing, but no more allocations) without touching multi-family support.

That said, if we do add multiple CIDRs, we should at least make it possible to add metadata later (so []struct, not []string).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should add metadata within a PodIP structure, as @thockin described earlier, at least as a placeholder for the future. Support for live migration of pod CIDRs could then be done as a followup enhancement and make use of the metadata.

@squeed
Copy link

squeed commented Jun 15, 2018

A few thoughts about ExtraPodIPs:

  • Should it always also include the default PodIP? So up-to-date clients don't have to manually concatenate
  • I would like some metadata attached to IPs. Some labels, perhaps?

Some time ago, before multi-network was tabled, I suggested that every IP should be labeled with the name of the network that created it. Then services could have an optional label selector.

Even if we don't use this right now, it would be good to have for room to grow.

The singular --service-cluster-ip-range argument will become deprecated.

#### controller-manager Startup Configuration for Multiple Service CIDRs
A new, plural "service-cluster-ip-ranges" option for the [controller-manager startup configuration](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) is proposed, in addition to retaining the existing, singular "service-cluster-ip-range" option (for backwards compaibility):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete utter bikeshed: what about making --service-cluster-ip-range accept a comma-separated list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can do that transparently, that might be great for all of these flags (if a bit less obviously named)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me! I wasn't sure if there'd be a problem with not having a trailing 's' for a plural argument, or any backwards compatibility headaches with command line arguments (e.g. old/new manifests working with new/old images).
I'll make this change for this and for the other command line arguments in this doc.

A new, plural "service-cluster-ip-ranges" option for the [controller-manager startup configuration](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) is proposed, in addition to retaining the existing, singular "service-cluster-ip-range" option (for backwards compaibility):
```
--service-cluster-ip-range ipNet [Singular IP CIDR, Default: 10.0.0.0/24]
--service-cluster-ip-ranges stringSlice [Multiple IP CIDRs, comma separated list of CIDRs, Default: []]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the user specifies two v4 ranges?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call - err on the side of over-specifying :)

@sb1975
Copy link

sb1975 commented Oct 12, 2018

@leblancd - Thanks for the details, do you have any link to understand the steps involved in creating the multiple ingress- for IPv4 and IPv6 seperately and also how to configure the NAT46 in the IPv4 ingress.

@leblancd
Copy link
Author

leblancd commented Oct 22, 2018

@sb1975 - With the help of @aojea, we've put together an overview on how to install a dual-stack NGINX ingress controller on an (internally) IPv6-only cluster: "Installing a Dual-Stack Ingress Controller on an IPv6-Only Kubernetes Cluster". This requires that the nodes be configured with dual-stack public/global IPv4/IPv6 addresses, and it runs the ingress controller pods on the host network of each node.

I haven't configured Stateless NAT46 on a Kubernetes IPv6-only cluster, but you can find some good background references on the web. e.g. Citrux has a helpful reference for configuring their NAT46 appliance here, and there's a video on configuring Stateless NAT46 on a Cisco ASA here.

- Link Local Addresses (LLAs) on a pod will remain implicit (Kubernetes will not display nor track these addresses).
- For simplicity, only a single family of service IPs per cluster will be supported (i.e. service IPs are either all IPv4 or all IPv6).
- Backend pods for a service can be dual stack.
- Endpoints for a dual-stack backend pod will be represented as a dual-stack address pair (i.e. 1 IPv4/IPv6 endpoint per backend pod, rather than 2 single-family endpoints per backend pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this and the previous: If we say that Services are always single-family and you said that "Cross-family connectivity" is a non-goal, what value do we get for endpoints to be dual-stack?

I guess I could see an argument for headless Services or external Services. Is that what motivates this? Is it worth the effort? Could it be deferred?

Or is this about NodePorts being available on both families?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good question, and your point is well-taken that we probably don't get value out of having endpoints being dual-stack. Maybe you can confirm my thought process here. I had added this dual-stack endpoints with the thinking that maybe, somehow, ingress controllers or load balancers might need to know about V4 and V6 addresses for endpoints, in order to provide dual-stack access from outside. Thinking about this more, I don't think this is the case. For ingress controllers and load balancers to provide dual-stack access, support of dual-stack NodePorts and dual-stack externalIPs (and ingress controllers using hostnetwork pods) should be sufficient.

Let me know what you think, so I can modify the spec.

For headless services, I believe that we can get by with a single IP family. The IP assigned for a headless service will match the "primary" IP family. This would put headless services on par with non-headless Kube services.

Re. the "Cross-family connectivity", I should remove this from the non-goals. It's confusing and misleading. Family cross over will be supported e.g. with dual-stack ingress controller mapping to a single family endpoint inside the cluster. Cross-family connectivity won't be supported inside the cluster, but that's pretty obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the non-goal is correct -- Kubernetes itself is NOT doing address family translation. The fact that Ingress controllers can do that is merely a side-effect of the fact that they are almost universally full proxies which accept the frontside connection and open another for the backside. I don't think it is obvious that we won't convert a v4 to a v6 connection via service VIP, and we should be clear that it is NOT part of the scope.

When I wrote this question I was thinking about Service VIP -> Backends. That has to be same-family (because of the non-goal above), so alt-family endpoints is nonsensical. I think this is also true for external IPs and load-balancer IPs -- they are received and processed the same as cluster IPs, with no family conversion.

BUT...

  1. external IPs is a list, so could plausibly include both families.
  2. LB status.ingress is a list, so could plausibly include both families.
  3. nodePorts could reasonably be expected to work on any interface address.
  4. headless services could reasonably be expected to work on any interface address.

You suggest you might be willing to step back from (4), but unless we also step back from (1), (2), AND (3), that wouldn't save any work.

I think the only simplification here is if we can say "all service-related logic is single family". And I am not sure that is very useful -- tell me if you disagree.

Assuming we have ANY service-related functionality with both families, we need dualstack endpoints :(

Or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...more to the point, is multi-family endpoints, nodeports, LBs, etc somethign we can defer to a "phase 2" and iterate? Would it simplify this proposal or just make it not useful?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your analysis is spot on. I agree, we would need dual-stack endpoints to support (1), (2), (3), and (4) (although I'm not real familiar with how LB status.ingress workings, but it sounds like it's also driven by endpoint events/state), and if we support 1 of the 4 we might as well support all 4.

And regarding the idea of NOT supporting 1-4 as a simplification, I believe that would make this proposal not very useful. What we'd have left is informational visibility to dual stack pod addresses, as far as I can tell.

I'd say that the minimum useful subset of support would have to include dual-stack endpoints, nodeports, LBs, externalIPs, and headless services, IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so any scope-reduction around not doing dual-stack endpoints is rendered moot,a nd all such comments should be ignored :)

- NodePort: Support listening on both IPv4 and IPv6 addresses
- ExternalIPs: Can be IPv4 or IPv6
- Kube-proxy IPVS mode will support dual-stack functionality similar to kube-proxy iptables mode as described above. IPVS kube-router support for dual stack, on the other hand, is considered outside of the scope of this proposal.
- For health/liveness/readiness probe support, a kubelet configuration will be added to allow a cluster administrator to select a preferred IP family to use for implementing probes on dual-stack pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a "primary" family (e.g. the one used for Services) do we need this flag?

Do we need a per-pod, per-probe flag to request address family?

Copy link
Author

@leblancd leblancd Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do without a global kubelet configuration for preferred IP family for probes. I'll change this to say that health/liveness/readiness probes will use the IP family of the default IP for a pod (which should match the primary IP family in most cases).

I don't think we need a per-pod, per-probe flag for IP family for the intial release. In a future release, we can consider adding a per-pod, per-probe flag to allow e.g. a user to specify that probes can be dual stack, meaning probes are sent for both IP families, and success is declared if either probe is successful, or alternatively if both probes are successful.

// Properties: Arbitrary metadata associated with the allocated IP.
type PodIPInfo struct {
IP string
Properties map[string]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have examples of what we want to put in properties and we are willing to spec, validate, and test things like key formats, content size, etc, we should probably leave this out for now. Just a comment indicating this is left as a followup patch-set, perhaps?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin - Sure, I can take this out. If the Properties map is removed, should the PodIPInfo structure be removed, and just leave PodIPs as a simple slice of strings, to simplify?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The properties map would be very useful for multi-network. And no sense in changing the data structure twice? I'd prefer to keep it if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter what, I would keep the struct.

@squeed if we keep it, we need to have concrete use-cases for it such that we can flesh out the management of that data as I listed above. We can always ADD fields, with new validation, etc. I'd rather add it when we have real need. I am confident it's something we will want, eventually.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this has already generated some discussion :)
I'll add my own comments here - @leblancd you can ignore my other comment up above.

One pattern I've seen used successfully in internal interfaces is to have a mostly-strongly typed struct with a bag-of-strings at the end for "experimental" free-for-all properties. But that requires agreement that the bag-of-strings should not be relied upon, and can change at any time. I do not think we can enforce such a thing if we put this into an external facing API, so my vote is to only add fully-typed fields with validation and strong semantic meaning. Then we can argue about names all at once before they are used instead of after the fact :)

Properties map[string]string
}

// IP addresses allocated to the pod with associated metadata. This list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also want to document the sync logic. I finally sent a PR against docs.

#2838

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "document the sync logic", you mean just adding references to that doc in this spec, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd spell it out in the comments. I don't expect end-users to read our API devel docs :)

```

##### Default Pod IP Selection
Older servers and clients that were built before the introduction of full dual stack will only be aware of and make use of the original, singular PodIP field above. It is therefore considered to be the default IP address for the pod. When the PodIP and PodIPs fields are populated, the PodIPs[0] field must match the (default) PodIP entry. If a pod has both IPv4 and IPv6 addresses allocated, then the IP address chosen as the default IP address will match the IP family of the cluster's configured service CIDR. For example, if the service CIDR is IPv4, then the IPv4 address will be used as the default address.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When the PodIP and PodIPs fields are populated" implies no sync logic. I think we all settled on sync being a better path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "sync logic" you mean how the singular value from old clients (and plural value from new clients) gets fixed up (as described in the "On Compatibility" section?

I'll delete that line. What I meant to say is covered in your API change guide update.


- Because service IPs will remain single-family, pods will continue to access the CoreDNS server via a single service IP. In other words, the nameserver entries in a pod's /etc/resolv.conf will typically be a single IPv4 or single IPv6 address, depending upon the IP family of the cluster's service CIDR.
- Non-headless Kubernetes services: CoreDNS will resolve these services to either an IPv4 entry (A record) or an IPv6 entry (AAAA record), depending upon the IP family of the cluster's service CIDR.
- Headless Kubernetes services: CoreDNS will resolve these services to either an IPv4 entry (A record), an IPv6 entry (AAAA record), or both, depending on the service's endpointFamily configuration (see [Configuration of Endpoint IP Family in Service Definitions](#configuration-of-endpoint-ip-family-in-service-definitions)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on previous question about this config being per-service vs per-pod

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now think single-family headless services would work (on par with non-headless kubernetes services being single-family).

The [Kubernetes ingress feature](https://kubernetes.io/docs/concepts/services-networking/ingress/) relies on the use of an ingress controller. The two "reference" ingress controllers that are considered here are the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) and the [NGINX ingress controller](https://github.com/kubernetes/ingress-nginx/blob/master/README.md#nginx-ingress-controller).

#### GCE Ingress Controller: Out-of-Scope, Testing Deferred For Now
It is not clear whether the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) supports external, dual-stack access. Testing of dual-stack access to Kubernetes services via a GCE ingress controller is considered out-of-scope until after the initial implementation of dual-stack support for Kubernetes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this is Google's problem to implement.

@bowei

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just say this is out-of-scope for this effort.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unclear parts should at least be clarified :)
I can take this one.


#### Multiple bind addresses configuration

The existing "--bind-address" option for the will be modified to support multiple IP addresses in a comma-separated list (rather than a single IP string).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a sub-heading of cloud-providers?

There are other components that support a flag like this - do we have a list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kube-proxy and the kubelet startup config have a similar requirement, that's a good idea to list them.
Also possibly the controller manager if we went with the full Dual Stack.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, that line is missing the link to the cloud controller manager, so it should read
The existing "--bind-address" option for the cloud-controller-manager will be modified ...

- name: MY_POD_IPS
valueFrom:
fieldRef:
fieldPath: status.podIPs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kubernetes/sig-cli-api-reviews We should get a consult as to whether this is right or whether the fieldpath should be something like status.podIPs[].ip - it was supposed to be a literal syntax.

@k8s-ci-robot k8s-ci-robot added sig/cli Categorizes an issue or PR as relevant to SIG CLI. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Oct 23, 2018
Copy link
Author

@leblancd leblancd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin - Thank you for your thorough review! I think that eliminating the support for dual-stack endpoints makes sense, let me know if I should go ahead and remove this.

- Link Local Addresses (LLAs) on a pod will remain implicit (Kubernetes will not display nor track these addresses).
- For simplicity, only a single family of service IPs per cluster will be supported (i.e. service IPs are either all IPv4 or all IPv6).
- Backend pods for a service can be dual stack.
- Endpoints for a dual-stack backend pod will be represented as a dual-stack address pair (i.e. 1 IPv4/IPv6 endpoint per backend pod, rather than 2 single-family endpoints per backend pod)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good question, and your point is well-taken that we probably don't get value out of having endpoints being dual-stack. Maybe you can confirm my thought process here. I had added this dual-stack endpoints with the thinking that maybe, somehow, ingress controllers or load balancers might need to know about V4 and V6 addresses for endpoints, in order to provide dual-stack access from outside. Thinking about this more, I don't think this is the case. For ingress controllers and load balancers to provide dual-stack access, support of dual-stack NodePorts and dual-stack externalIPs (and ingress controllers using hostnetwork pods) should be sufficient.

Let me know what you think, so I can modify the spec.

For headless services, I believe that we can get by with a single IP family. The IP assigned for a headless service will match the "primary" IP family. This would put headless services on par with non-headless Kube services.

Re. the "Cross-family connectivity", I should remove this from the non-goals. It's confusing and misleading. Family cross over will be supported e.g. with dual-stack ingress controller mapping to a single family endpoint inside the cluster. Cross-family connectivity won't be supported inside the cluster, but that's pretty obvious.

- NodePort: Support listening on both IPv4 and IPv6 addresses
- ExternalIPs: Can be IPv4 or IPv6
- Kube-proxy IPVS mode will support dual-stack functionality similar to kube-proxy iptables mode as described above. IPVS kube-router support for dual stack, on the other hand, is considered outside of the scope of this proposal.
- For health/liveness/readiness probe support, a kubelet configuration will be added to allow a cluster administrator to select a preferred IP family to use for implementing probes on dual-stack pods.
Copy link
Author

@leblancd leblancd Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do without a global kubelet configuration for preferred IP family for probes. I'll change this to say that health/liveness/readiness probes will use the IP family of the default IP for a pod (which should match the primary IP family in most cases).

I don't think we need a per-pod, per-probe flag for IP family for the intial release. In a future release, we can consider adding a per-pod, per-probe flag to allow e.g. a user to specify that probes can be dual stack, meaning probes are sent for both IP families, and success is declared if either probe is successful, or alternatively if both probes are successful.

// Properties: Arbitrary metadata associated with the allocated IP.
type PodIPInfo struct {
IP string
Properties map[string]string
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin - Sure, I can take this out. If the Properties map is removed, should the PodIPInfo structure be removed, and just leave PodIPs as a simple slice of strings, to simplify?

```

##### Default Pod IP Selection
Older servers and clients that were built before the introduction of full dual stack will only be aware of and make use of the original, singular PodIP field above. It is therefore considered to be the default IP address for the pod. When the PodIP and PodIPs fields are populated, the PodIPs[0] field must match the (default) PodIP entry. If a pod has both IPv4 and IPv6 addresses allocated, then the IP address chosen as the default IP address will match the IP family of the cluster's configured service CIDR. For example, if the service CIDR is IPv4, then the IPv4 address will be used as the default address.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "sync logic" you mean how the singular value from old clients (and plural value from new clients) gets fixed up (as described in the "On Compatibility" section?

I'll delete that line. What I meant to say is covered in your API change guide update.

Properties map[string]string
}

// IP addresses allocated to the pod with associated metadata. This list
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "document the sync logic", you mean just adding references to that doc in this spec, right?


Currently, health, liveness, and readiness probes are defined without any concern for IP addresses or families. For the first release of dual-stack support, a cluster administrator will be able to select the preferred IP family to use for probes when a pod has both IPv4 and IPv6 addresses. For this selection, a new "--preferred-probe-ip-family" argument for the for the [kubelet startup configuration](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) will be added:
```
--preferred-probe-ip-family string ["ipv4", "ipv6", or "none". Default: "none", meaning use the pod's default IP]
Copy link
Author

@leblancd leblancd Oct 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my response to your earlier comment. I think we can do without this configuration, and for probes Kubelet should use the family of the default IP for each pod. I don't think we need a per-pod or per-probe configuration in the initial release of dual-stack, maybe do this as a followup (including support of probes that work on both IP families, requiring either both V4 and V6 responses, or either V4 or V6 responses).


- Because service IPs will remain single-family, pods will continue to access the CoreDNS server via a single service IP. In other words, the nameserver entries in a pod's /etc/resolv.conf will typically be a single IPv4 or single IPv6 address, depending upon the IP family of the cluster's service CIDR.
- Non-headless Kubernetes services: CoreDNS will resolve these services to either an IPv4 entry (A record) or an IPv6 entry (AAAA record), depending upon the IP family of the cluster's service CIDR.
- Headless Kubernetes services: CoreDNS will resolve these services to either an IPv4 entry (A record), an IPv6 entry (AAAA record), or both, depending on the service's endpointFamily configuration (see [Configuration of Endpoint IP Family in Service Definitions](#configuration-of-endpoint-ip-family-in-service-definitions)).
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now think single-family headless services would work (on par with non-headless kubernetes services being single-family).

The [Kubernetes ingress feature](https://kubernetes.io/docs/concepts/services-networking/ingress/) relies on the use of an ingress controller. The two "reference" ingress controllers that are considered here are the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) and the [NGINX ingress controller](https://github.com/kubernetes/ingress-nginx/blob/master/README.md#nginx-ingress-controller).

#### GCE Ingress Controller: Out-of-Scope, Testing Deferred For Now
It is not clear whether the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) supports external, dual-stack access. Testing of dual-stack access to Kubernetes services via a GCE ingress controller is considered out-of-scope until after the initial implementation of dual-stack support for Kubernetes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just say this is out-of-scope for this effort.


#### Multiple bind addresses configuration

The existing "--bind-address" option for the will be modified to support multiple IP addresses in a comma-separated list (rather than a single IP string).
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature requires the use of the [CNI Networking Plugin API version 0.3.1](https://github.com/containernetworking/cni/blob/spec-v0.3.1/SPEC.md)
or later. The dual-stack feature requires no changes to this API.

The versions of CNI plugin binaries that must be used for proper dual-stack functionality (and IPv6 functionality in general) depend upon the version of Docker that is used in the cluster nodes (see [CNI issue #531](https://github.com/containernetworking/cni/issues/531) and [CNI plugins PR #113](https://github.com/containernetworking/plugins/pull/113)):
Copy link

@squeed squeed Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 531 was a weird docker interaction that we've fixed; CNI no longer has a Docker version dependency.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squeed (and @nyren) thanks for taking care of this! I think it's fair to say that CNI 0.7.0 (or newer) no longer has the Docker dependency, i.e. if you're using CNI 0.6.0, you'll still have the dependency? Kubernetes has a bunch of distro pointers that point to CNI 0.6.0 for plugin binaries, so those pointers should be bumped up in the near future.

@sb1975
Copy link

sb1975 commented Oct 29, 2018

@sb1975 - With the help of @aojea, we've put together an overview on how to install a dual-stack NGINX ingress controller on an (internally) IPv6-only cluster: "Installing a Dual-Stack Ingress Controller on an IPv6-Only Kubernetes Cluster". This requires that the nodes be configured with dual-stack public/global IPv4/IPv6 addresses, and it runs the ingress controller pods on the host network of each node.

I haven't configured Stateless NAT46 on a Kubernetes IPv6-only cluster, but you can find some good background references on the web. e.g. Citrux has a helpful reference for configuring their NAT46 appliance here, and there's a video on configuring Stateless NAT46 on a Cisco ASA here.

@leblancd : Thanks for the response , this is very helpful but we have a additional use case : we need a IPv4 client reach a Kubernetes IPv6 service over non-http traffic(like SNMP). Now I understand the ingress would only support only http rules so how do we enable this please ?

The kubeadm configuration options for advertiseAddress and podSubnet will need to be changed to handle a comma-separated list of CIDRs:
```
api:
advertiseAddress: "fd00:90::2,10.90.0.2" [Multiple IP CIDRs, comma separated list of CIDRs]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the advertiseAddresses are addresses, not CIDRs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeedy.

@aojea
Copy link
Member

aojea commented Oct 30, 2018

@leblancd : Thanks for the response , this is very helpful but we have a additional use case : we need a IPv4 client reach a Kubernetes IPv6 service over non-http traffic(like SNMP). Now I understand the ingress would only support only http rules so how do we enable this please ?

@sb1975 the nginx ingress controller supports non http traffic over TCP and UDP, however, seems that feature is going to be removed kubernetes/ingress-nginx#3197

@thockin
Copy link
Member

thockin commented Nov 2, 2018

@leblancd Can you make your edits and resolve any comment threads that are old and stale? Ping me and I'll do another top-to-bottom pass. Hopefully we can merge soon and iterate finer points.

@leblancd
Copy link
Author

leblancd commented Nov 2, 2018

@thockin - Will get the next editing pass in by Monday. I'm working on some V6 CI changes today. Thanks!

@neolit123
Copy link
Member

/assign @timothysc
@kubernetes/sig-cluster-lifecycle
for the kubeadm and kubespray related topics.

## Motivation

The adoption of IPv6 has increased in recent years, and customers are requesting IPv6 support in Kubernetes clusters. To this end, the support of IPv6-only clusters was added as an alpha feature in Kubernetes Version 1.9. Clusters can now be run in either IPv4-only, IPv6-only, or in a "single-pod-IP-aware" dual-stack configuration. This "single-pod-IP-aware" dual-stack support is limited by the following restrictions:
- Some CNI network plugins are capable of assigning dual-stack addresses on a pod, but Kubernetes is aware of only one address per pod.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed design! We are running some prototype dual stack configurations inside of GCE and starting to find ways to work around the fact that Kubernetes itself is unaware of the IPv6 addresses.
I'd like very much to stay close to the design in this doc, and reinforce our prototyping/testing efforts when the time comes. Please keep me in the loop :)

This proposal aims to extend the Kubernetes Pod Status API so that Kubernetes can track and make use of up to one IPv4 address and up to one IPv6 address assignment per pod.

#### Versioned API Change: PodStatus v1 core
In order to maintain backwards compatibility for the core V1 API, this proposal retains the existing (singular) "PodIP" field in the core V1 version of the [PodStatus V1 core API](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#podstatus-v1-core), and adds a new array of structures that store pod IPs along with associated metadata for that IP. The metadata for each IP (refer to the "Properties" map below) will not be used by the dual-stack feature, but is added as a placeholder for future enhancements, e.g. to allow CNI network plugins to indicate to which physical network that an IP is associated. Retaining the existing "PodIP" field for backwards compatibility is in accordance with the [Kubernetes API change quidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've grown really leery of a bag-of-strings properties approach - I think that unless we commit to a naming scheme, or at least a conflict resolution mechanism if two components start using the same keys in incompatible ways I would really like to see these develop as fully specified types rather than a loose bag of strings.
What do you think?

// Properties: Arbitrary metadata associated with the allocated IP.
type PodIPInfo struct {
IP string
Properties map[string]string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this has already generated some discussion :)
I'll add my own comments here - @leblancd you can ignore my other comment up above.

One pattern I've seen used successfully in internal interfaces is to have a mostly-strongly typed struct with a bag-of-strings at the end for "experimental" free-for-all properties. But that requires agreement that the bag-of-strings should not be relied upon, and can change at any time. I do not think we can enforce such a thing if we put this into an external facing API, so my vote is to only add fully-typed fields with validation and strong semantic meaning. Then we can argue about names all at once before they are used instead of after the fact :)

```
--pod-cidr ipNetSlice [IP CIDRs, comma separated list of CIDRs, Default: []]
```
Only the first address of each IP family will be used; all others will be ignored.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think you mean "first CIDR" here, not the first address.

IP string `json:"ip" protobuf:"bytes,1,opt,name=ip"`
// The IPs for this endpoint. The zeroth element (IPs[0] must match
// the default value set in the IP field)
IPs []string `json:"ips" protobuf:"bytes,5,opt,name=ips"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pod IPs have metadata describing them (the PodIPs struct) then isnt' that useful to surface here as well?
It seems like @squeed has a use case for labeling IPs with the network names - is it useful to endpoints controllers (like ingress?) to know that here too?

If we do have the metadata here though, it will need to be exactly the same structure as the PodIPs. I'm not sure if that raises any other API issues.

#### Configuration of Endpoint IP Family in Service Definitions
This proposal adds an option to configure an endpoint IP family for a Kubernetes service:
```
endpointFamily: <ipv4|ipv6|dual-stack> [Default: dual-stack]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If service addresses only come from a single address family, why does this belong in the Service definition?

Or to put it another way - shouldn't the default be the same address family as the service CIDR? If Kubernetes itself isn't going to do any 6/4 translation, could you say more about how this can be used in any other way?

The [Kubernetes ingress feature](https://kubernetes.io/docs/concepts/services-networking/ingress/) relies on the use of an ingress controller. The two "reference" ingress controllers that are considered here are the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) and the [NGINX ingress controller](https://github.com/kubernetes/ingress-nginx/blob/master/README.md#nginx-ingress-controller).

#### GCE Ingress Controller: Out-of-Scope, Testing Deferred For Now
It is not clear whether the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) supports external, dual-stack access. Testing of dual-stack access to Kubernetes services via a GCE ingress controller is considered out-of-scope until after the initial implementation of dual-stack support for Kubernetes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unclear parts should at least be clarified :)
I can take this one.

@timothysc
Copy link
Member

@neolit123 This will affect sig-cluster-lifecycle, but it's squarely on sig-networking.
/assign @thockin

@timothysc timothysc removed their assignment Nov 6, 2018
@justaugustus
Copy link
Member

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

@justaugustus
Copy link
Member

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

@k8s-ci-robot
Copy link
Contributor

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Member

aojea commented Dec 5, 2018

Moving #2254 to kubernetes/enhancements#648

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipv6 cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.