-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support ipvs mode for kube-proxy #692
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@m1093782566 Thank you for reacting quickly, Could we add some statistics we collected during our experiment? Like a table that compares IpTables vs IPVS with few thousand services created. |
@dhilipkumars I think @haibinxie has the original statistics. |
IPVS vs. IPTables Latency to Add Rules Measured by iptables and ipvsadm, observation:
@dhilipkumars I am not sure if the statistics is sufficient. |
related to the google doc here? https://docs.google.com/document/d/1YEBWR4EWeCEWwxufXzRM0e82l_lYYzIXQiSayGaVQ8M |
@kubernetes/sig-network-proposals |
|
||
### Network policy | ||
|
||
For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking a soon-to-be GA feature seems like an edge case that we probably cannot overlook. Even with something in alpha form, I think we should be sure that all existing functionality is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SNAT is required for cross host communication, there should be compromise in between, probably an awareness of the situation in release note. I am also open to any solution/fix that we can work towards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apart from soon to be a GA feature, network policy is really important that people probably would trade performance for it. A draft idea is to not do snat, rather, on each host, add a new routing table and fwmark all service traffic to this table. The routing table will take care of routing back the packet. There's a lot of caveats, i don't know if it will ever work at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I prototyped IPVS, and I just re-did my tests, I don't see a need to SNAT. I tcpdumped it at both ends.
client pod has IP P
service has IP S
backend has IP B
P -> S packet
leaves pod netns to root netns
IPVS rewrites destination to B (packet is src:P, dst:B)
arrives at B's node's root netns
forwarded to B
response is src:B, dst:P
arrives at P's nodes' root netns
conntrack converts packet to src:S, dest:P
This works (on GCP, at least) because the only route to P is back to P's node. I can see how it might not work in some environments, but frankly, if this can't work in most environments, we should maybe not put it in kube-proxy.
Preserving client IP has been a huge concern, and I am not inclined to throw that away. WRT NetworkPolicy, this doesn't break the API, but it does end up breaking almost every implementation (in a really not-obviously-fixable way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if L3 overlay(flannel with vxlan backend) is the matter. Maybe I need a reverse traffic goes through IPVS proxy so that traffic reaches the source pod (after performing DNAT). I will re-test it in my environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@murali-reddy do you have any idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @thockin, there is no need for SNAT, atleast not for all traffic paths. So when a pod access cluster IP or node port, IPVS does DNAT. On reverse path pretty much (can not think of any pod networking solution where its not true) route to source pod is back to source pod node.
Where its gets tricky, is when a node port is accessed from outside the cluster. Node on which destination pod is running may route traffic directly to client through default gateway. To prevent that we need to SNAT traffic, so the return traffic goes through the same node through which client accessed node port. We do loose the source IP in this case. But AFAIK this is not unique to use of IPVS but even iptable kube-proxy will have to do the SNAT.
FWIW, i have implemented logic in kube-router to deal with external client accessing node port. I just tested Flannel VXLAN + IPVS service proxy + network policy, i dont see any issue. Dont see reason to do SNAT. Please test it with your POC, and see you can remove this restriction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @murali-reddy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-tested in my environment(Flannel VXLAN + IPVS service proxy) and found something different.
There is an ipvs service with 2 destinations which are in the different host.
[root@SHA1000130405 home]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.102.128.4:3080 rr
-> 10.244.0.235:8080 Masq 1 0 0
-> 10.244.2.123:8080 Masq 1 0 0
There was no SNAT rules applied and I curl VIP
in each container.
# In container 10.244.0.235
$curl 10.102.128.4:3080
# get response from 10.244.2.123:8080
I tcpdumped in another host to see what the source IP is.
$ tcpdump -i flannel.1
20:44:48.021765 IP 10.244.0.235.36844 > 10.244.2.123.webcache: Flags [S], seq 519767460, win 28200, options [mss 1410,sackOK,TS val 416
20:44:48.021998 IP 10.244.2.123.webcache > 10.244.0.235.36844: Flags [S.], seq 1131844123, ack 519767461, win 27960, options [mss 1410,76,nop,wscale 7], length 0
The output shows that the source IP no changed.
So, it seems that there is no need to do SNAT for cross-host communication.
However, when ipvs scheduled the self container, there is no response returned. It seems that the packet is dropped when the traffic reached the self container.
I can confirm that it has nothing to do with same host vs. cross host
, the issue is about whether the request hit the self container.
When I applied SNAT rules as iptables proxy does, it can reach self container via VIP.
iptables -t nat -A PREROUTING -s 10.244.0.235 -j KUBE-MARK-MASQ
Unfortunately, the source container lost its IP and the source IP became flannel.1's IP.
Anyway, it's a good news to me that SNAT is unnecessary for cross-host communication so that it won't break the network policy though I need more knowledge to fix the issue I mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this is nothing specic to IPVS, please search for hairpin related issues with kube-proxy/kubelet https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#a-pod-cannot-reach-itself-via-service-ip
- container -> host (cross host) | ||
|
||
|
||
## TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the TODO is necessary since it should be part of the PR steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks.
@spiffxp Yes. @haibinxie wrote the doc and I translated it to markdown file and added some details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together.
@@ -0,0 +1,152 @@ | |||
# Alpha Version IPVS Load Balancing Mode in Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd to see a proposal just for alpha version. what's the plan for beta and stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the proposal is not "for alpha". alpha is just a milestone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will fix.
|
||
For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm) | ||
|
||
In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there absolutely no way? We tainted some cluster nodes with ipvs for external loadbalancing; if cleanup cleans all rules, things will break. Using ipvs for external loadbalance for external traffic is not uncommon i think.
--cleanup-proxyrules
will clear all ipvs service in a host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddysher Do you have any good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of. It's probably ok to just call out this in document or flag comment. User with existing ipvs services should use with caution. If one just wants to clean up kubernetes related ipvs services, just start the proxy and clear any unwanted k8s services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We can add this to document or flag comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you only clean rules with RC1918 addresses? Or only rules in the service IP range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you only clean rules with RC1918 addresses?
[m1093782566] It still has a possibility of clearing ipvs rules created by other process although the the possibility is low.
[m1093782566] Kube-apiserver knows the service cluster IP range through its--service-cluster-ip-range parameter. However, kube-proxy knows nothing about that. I don't suggest to add a new --service-cluster-ip-range flag in kube-proxy since it will easily conflicts with kube-apiserver's flag. Even though kube-proxy clearing all ipvs rules in the service cluster IP range, it may leave some ipvs rules in external IP, exteral LB ingress IP and Node IP.
Clearing all ipvs rules with RFC1918 addresses probably is easier to implement and has lower possibility of clearing user's existing ipvs rules.
@fisherxu What do you think about it?
|
||
In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host. | ||
|
||
### Change to build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc linked somewhere uses seesaw, i suppose we changed to libnetwork afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We changed to use libnetwork to avoid cgo and libnl dependency.
|
||
### IPVS setup and network topology | ||
|
||
IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iptables is already widely used now; it's better to say 'an alternative to' instead of replacement IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks for reminding.
|
||
IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode. | ||
|
||
We will create some ipvs services for each kubernetes service. The VIP of ipvs service corresponding to the accessable IP(such as cluster IP, external IP, nodeIP, ingress IP, etc.) of kubernetes service. Each destination of an ipvs service corresponding to an kubernetes service endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more specific about "some ipvs services"? I suppose it's "one ipvs for each kubernetes service, port and protocol combination"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose kubernetes service is NodePort type. Then ipvs Proxier will create two ipvs serivice, one is NodeIP:NodePort and the other one is ClusterIP:Port.
Of course, I will explain it in the doc. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
Since ipvs proxier will fall back on iptables when support node port type service. I should give another example.
Note that the relationship between kubernetes service and ipvs service is 1:N。The address of ipvs service corresponding to service's access IP, such as Cluster IP, external IP and LB.ingress.IP. If a kubernetes service has more than one access IP, For example, a external IP type service has 2 access IP(ClusterIP and External IP), then ipvs proxier will create 2 ipvs serivice - one for Cluster IP and the other one for External IP.
ipvsadm -A -t 10.244.1.100:8080 -s rr -p [timeout] | ||
``` | ||
|
||
When a service specify session affinity, ipvs proxy will assign a timeout value(180min by default) to ipvs service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
180s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current default value for iptables proxy is 180min, see https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L205
|
||
## Other design considerations | ||
|
||
### IPVS setup and network topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design mixes ipvs and iptables rules, can we have a section dedicated to explain the interaction between ipvs and iptables, and which is responsible for what requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - this needs to detail every sort of flow and every feature of Services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I created a section "when fall back on iptables" to explain the interaction between ipvs and iptables. Thanks!
@@ -0,0 +1,152 @@ | |||
# Alpha Version IPVS Load Balancing Mode in Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the author's name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
|
||
### NodePort type service support | ||
|
||
For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this comment somehow gets lost.
Why do we enforce interface name? e.g. assuming ethx won't work for predictable network interface name for newer systemd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This is a challenging design constraint. Do we need to use IPVS for NodePorts or can that fall back on iptables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about take all UP state network interface(except the "vethxxx"s) address as the Node IP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates:
As discussed in sig-network meeting, IPVS proxier will fall back on iptables when support node port type service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc. This needs a LOT more detail, though. I flagged a few big issues, but I don't feel like this is covering the depth we need.
Is this built around exec ipvsadm?
I'd like to see pseudo-code explaining how the resync loop works, and what the intermediate state looks like.
How do you prevent dropped packets during updates?
How do you do cleanups across proxy restarts, where you might have lost information? (e.g. create service A and B, you crash, service A gets deleted, you restart, you get a sync for B - what happens to A?)
How does this scale (if I have 10,000 services and 5 backends each, is this 50,000 exec calls?)
How will you use/expose the different algorithms?
I am sure I have more questions. This is one of the most significant changes in the history of kube-proxy and kubernetes services. Please spend some time helping me not be scared of it. :)
@@ -0,0 +1,152 @@ | |||
# Alpha Version IPVS Load Balancing Mode in Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the proposal is not "for alpha". alpha is just a milestone.
|
||
For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm) | ||
|
||
In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you only clean rules with RC1918 addresses? Or only rules in the service IP range?
|
||
## Other design considerations | ||
|
||
### IPVS setup and network topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - this needs to detail every sort of flow and every feature of Services.
|
||
### NodePort type service support | ||
|
||
For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This is a challenging design constraint. Do we need to use IPVS for NodePorts or can that fall back on iptables?
|
||
### Sync period | ||
|
||
Similar to iptables proxy, IPVS proxy will do full sync loop every 10 seconds by default. Besides, every update on kubernetes service and endpoint will trigger an ipvs service and destination update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will you do changes to Services without downtime? If I change the session affinity, for example, you shouldn't take a service down to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing seesion affinity will call UpdateService API which directly send update command to kernel via socket communication and won't take a service down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhilipkumars did a test and found ipvs update did not disrupt the service or even the existing connection.
|
||
### Network policy | ||
|
||
For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I prototyped IPVS, and I just re-did my tests, I don't see a need to SNAT. I tcpdumped it at both ends.
client pod has IP P
service has IP S
backend has IP B
P -> S packet
leaves pod netns to root netns
IPVS rewrites destination to B (packet is src:P, dst:B)
arrives at B's node's root netns
forwarded to B
response is src:B, dst:P
arrives at P's nodes' root netns
conntrack converts packet to src:S, dest:P
This works (on GCP, at least) because the only route to P is back to P's node. I can see how it might not work in some environments, but frankly, if this can't work in most environments, we should maybe not put it in kube-proxy.
Preserving client IP has been a huge concern, and I am not inclined to throw that away. WRT NetworkPolicy, this doesn't break the API, but it does end up breaking almost every implementation (in a really not-obviously-fixable way).
|
||
For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**. | ||
|
||
## Test validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to enumerate everything you have to test:
pod -> pod, same VM
pop -> pod, other VM
pod -> own VM, own hostPort
pod -> own VM, other hostPort
pod -> other VM, other hostPort
pod -> own VM
pod -> other VM
pod -> internet
pod -> http://metadata
VM -> pod, same VM
VM -> pod, other VM
VM -> same VM hostPort
VM -> other VM hostPort
pod -> own clusterIP, hairpin
pod -> own clusterIP, same VM, other pod, no port remap
pod -> own clusterIP, same VM, other pod, port remap
pod -> own clusterIP, other VM, other pod, no port remap
pod -> own clusterIP, other VM, other pod, port remap
pod -> other clusterIP, same VM, no port remap
pod -> other clusterIP, same VM, port remap
pod -> other clusterIP, other VM, no port remap
pod -> other clusterIP, other VM, port remap
pod -> own node, own nodePort, hairpin
pod -> own node, own nodePort, policy=local
pod -> own node, own nodePort, same VM
pod -> own node, own nodePort, other VM
pod -> own node, other nodePort, policy=local
pod -> own node, other nodePort, same VM
pod -> own node, other nodePort, other VM
pod -> other node, own nodeport, policy=local
pod -> other node, own nodeport, same VM
pod -> other node, own nodeport, other VM
pod -> other node, other nodeport, policy=local
pod -> other node, other nodeport, same VM
pod -> other node, other nodeport, other VM
pod -> own external LB, no remap, policy=local
pod -> own external LB, no remap, same VM
pod -> own external LB, no remap, other VM
pod -> own external LB, remap, policy=local
pod -> own external LB, remap, same VM
pod -> own external LB, remap, other VM
VM -> same VM nodePort, policy=local
VM -> same VM nbodePort, same VM
VM -> same VM nbodePort, other VM
VM -> other VM nodePort, policy=local
VM -> other VM nbodePort, same VM
VM -> other VM nbodePort, other VM
VM -> external LB
public -> nodeport, policy=local
public -> nodeport, policy=global
public -> external LB, no remap, policy=local
public -> external LB, no remap, policy=global
public -> external LB, remap, policy=local
public -> external LB, remap, policy=global
public -> nodeport, manual backend
public -> external LB, manual backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super!
@thockin Is this built around exec ipvsadm? I'd like to see pseudo-code explaining how the resync loop works, and what the intermediate state looks like. How do you prevent dropped packets during updates? How do you do cleanups across proxy restarts, where you might have lost information? (e.g. create service A and B, you crash, service A gets deleted, you restart, you get a sync for B - what happens to A?) How does this scale (if I have 10,000 services and 5 backends each, is this 50,000 exec calls?) How will you use/expose the different algorithms? |
I re-tested and found something different. SNAT is not required for cross-host communication. So it won't break network policy. It's really a big finding to me :) I find the packet will be dropped when container vist VIP and the real backend is itself. I have no idea now but will try to find out why. |
I will update the proposal and try to fix the review comments in newer proposal. @thockin I will add pseudo-code explaining how the resync loop works. Thanks. |
c0a5b3e
to
ebb958f
Compare
@thockin @ddysher @cmluciano @murali-reddy I update the proposals according to review comments and add more details. PTAL. Any comments are welcomed. Thanks. |
@m1093782566: GitHub didn't allow me to request PR reviews from the following users: haibinxie. Note that only kubernetes members can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
934152c
to
a552531
Compare
According to @dhilipkumars 's test result. It shows ipvs update did not disrupt the service or even the existing connection
the real backedn service is redis-alpine conecting to the service
In a parllel session if i update the scheduler algorithm or persistence timeout the service is not disrupted. Ipvsadm and libnetwor's ipvs pkg works in the same principle firing netlink message to the kernal so the behaviour should be the same. Udpate persistance timout still no
what other parameters should we test? |
@thockin Could you confirm if @m1093782566 's comment above is the right thing to do? anything else is left on closing it. please expect me keep bothering you until it's closed :) If you get a chance we can have a quick phone call on reviewing and addressing issues on this, we want to commit to release the feature in 1.8. |
@danwinship Do you have interest in reviewing this design proposal? Any comments are welcome. :) |
I just come up with an idea about implementing nodeport type service via ipvs. Can we take all the IP address which ADDRTYPE match dst-type LOCAL as the address of ipvs service? For example,
Then, [100.106.179.225, 127.0.0.0/8, 127.0.0.1, 172.16.0.0, 172.17.0.1, 192.168.122.1] would be the address of ipvs service for nodeport service. I assume KUBE-NODEPORTS chain created by iptables proxier did the same thing? For example,
I am not opposed to implementing nodeport service via iptables, I just want to know if the approach mentioned above make sense? Or am I wrong? Looking forward to receiving your opinions. |
By doing this, I think we can remove the design constraint that assuming node IP is the address of |
@feiskyer Do you have bandwidth to take a look at this protosal? Thanks :) |
Using a list of IP addresses for nodePort has potential problems, e.g. ip addresses may be changed or new nics may be added later. And I don't think watching the changes of ip addresses and nics is a good idea. Maybe using iptables for nodePort services is a better choice. |
Glad to receive your feedback, @feiskyer
Yes, I agree. Thanks for your thoughts. |
ping @kubernetes/sig-network-feature-requests |
@luxas alpha in 1.8. we are working on beta release in 1.9 |
/keep-open |
/lifecycle frozen |
@m1093782566 Is there a PR that supersedes this one? |
NO. This PR is the only design proposal. IPVS proxier already reached beta while this document is still pending, unfortunately. |
/close |
Signed-off-by: kerthcet <[email protected]>
Implement IPVS-based in-cluster service load balancing. It can provide some performance enhancement and some other benefits to kube proxy while comparing iptables and userspace mode. Besides, it also support more sophisticated load balancing algorithms than iptables (least conns, weighted, hash and so on).
related issue: kubernetes/kubernetes#17470 kubernetes/kubernetes#44063
related PR: kubernetes/kubernetes#46580 kubernetes/kubernetes#48994
@thockin @quinton-hoole @wojtek-t