Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes don't route LoadBalancer traffic correctly #535

Open
danderson opened this issue Dec 17, 2018 · 18 comments
Open

Nodes don't route LoadBalancer traffic correctly #535

danderson opened this issue Dec 17, 2018 · 18 comments
Labels
kind:bug Something isn't working legacy Anything related to MetalK8s 1.x state:help wanted Extra attention is needed topic:networking Networking-related issues

Comments

@danderson
Copy link

Hi there! Author of MetalLB here. I've been receiving a bunch of reports of MetalLB not working correctly on MetalK8s. AFAICT, MetalLB is working correctly in these cases, the problem is that kube-proxy on the nodes is either misconfigured, or otherwise outright broken, and isn't correctly handling traffic for type=LoadBalancer services.

The symptom is simply that when packets destined for a LoadBalancer service IP arrive at the node, they're not getting routed correctly to the target pod(s). From the user's perspective, the service IP just doesn't respond at all.

Unfortunately I don't have time to debug in more detail right now, but I figured I'd get this filed to get it on the radar. What I would suggest as a next step is to compare your kube-proxy configuration with the one kubeadm generates, and adjust any discrepancies. You can also try installing MetalLB and using its L2 mode, to get a quick demonstration of the breakage.

@NicolasT NicolasT added kind:bug Something isn't working state:help wanted Extra attention is needed topic:networking Networking-related issues labels Dec 17, 2018
@NicolasT
Copy link
Contributor

Hello @danderson, thanks for reaching out.

I haven't heard of such reports, nor of any issues with Services in general (we use them all the time), though granted as far as I'm aware nobody ever tried using MetalLB (yet) or any LoadBalancert type services.

I don't have time to try deploying MetalLB on my clusters right now (furthermore, this could be complicated since they run on OpenStack which doesn't really like MACs or IPs it didn't assign itself on the virtual network, by default...), but for reference, here's the kube-proxy Pod object as deployed by MetalK8s 1.0 (Kubernetes 1.10):

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/config.hash: 44f08526ac5cac1d20d11cebd0c92953
    kubernetes.io/config.mirror: 44f08526ac5cac1d20d11cebd0c92953
    kubernetes.io/config.seen: 2018-12-17T20:15:21.385349621Z
    kubernetes.io/config.source: file
    kubespray.kube-proxy-cert/serial: A3E727B9A6023607
  creationTimestamp: null
  labels:
    k8s-app: kube-proxy
  name: kube-proxy-metalk8s-01
  selfLink: /api/v1/namespaces/kube-system/pods/kube-proxy-metalk8s-01
spec:
  containers:
  - command:
    - /hyperkube
    - proxy
    - --v=2
    - --kubeconfig=/etc/kubernetes/kube-proxy-kubeconfig.yaml
    - --bind-address=10.200.4.185
    - --cluster-cidr=10.233.64.0/18
    - --proxy-mode=ipvs
    - --oom-score-adj=-998
    - --healthz-bind-address=127.0.0.1
    - --masquerade-all
    - --ipvs-min-sync-period=5s
    - --ipvs-sync-period=5s
    - --ipvs-scheduler=rr
    image: gcr.io/google-containers/hyperkube:v1.10.11
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10256
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 15
    name: kube-proxy
    resources:
      limits:
        cpu: 500m
        memory: 2G
      requests:
        cpu: 150m
        memory: 64M
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: ssl-certs-host
      readOnly: true
    - mountPath: /etc/kubernetes/ssl
      name: etc-kube-ssl
      readOnly: true
    - mountPath: /etc/kubernetes/kube-proxy-kubeconfig.yaml
      name: kubeconfig
      readOnly: true
    - mountPath: /var/run/dbus
      name: var-run-dbus
    - mountPath: /lib/modules
      name: lib-modules
      readOnly: true
    - mountPath: /run/xtables.lock
      name: xtables-lock
  dnsPolicy: ClusterFirst
  hostNetwork: true
  nodeName: metalk8s-01
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    operator: Exists
  volumes:
  - hostPath:
      path: /etc/pki/tls
      type: ""
    name: ssl-certs-host
  - hostPath:
      path: /etc/kubernetes/ssl
      type: ""
    name: etc-kube-ssl
  - hostPath:
      path: /etc/kubernetes/kube-proxy-kubeconfig.yaml
      type: ""
    name: kubeconfig
  - hostPath:
      path: /var/run/dbus
      type: ""
    name: var-run-dbus
  - hostPath:
      path: /lib/modules
      type: ""
    name: lib-modules
  - hostPath:
      path: /run/xtables.lock
      type: FileOrCreate
    name: xtables-lock
status:
  phase: Pending
  qosClass: Burstable

Likewise, I quickly set up a cluster using kubeadm (1.13, though), configured it to use ipvs as we do in MetalK8s, and this resulted in the following two relevant objects to be deployed:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: null
  generateName: kube-proxy-
  labels:
    controller-revision-hash: 68b57dcf7d
    k8s-app: kube-proxy
    pod-template-generation: "3"
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: kube-proxy
    uid: 434b9d9f-0015-11e9-96df-fa163e7325df
  selfLink: /api/v1/namespaces/kube-system/pods/kube-proxy-l26r5
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - kubeadm-node01
  containers:
  - command:
    - /usr/local/bin/kube-proxy
    - --config=/var/lib/kube-proxy/config.conf
    - --hostname-override=$(NODE_NAME)
    env:
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: k8s.gcr.io/kube-proxy:v1.13.0
    imagePullPolicy: IfNotPresent
    name: kube-proxy
    resources: {}
    securityContext:
      privileged: true
      procMount: Default
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kube-proxy
      name: kube-proxy
    - mountPath: /run/xtables.lock
      name: xtables-lock
    - mountPath: /lib/modules
      name: lib-modules
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-proxy-token-8qnhs
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  nodeName: kubeadm-node01
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: kube-proxy
  serviceAccountName: kube-proxy
  terminationGracePeriodSeconds: 30
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - configMap:
      defaultMode: 420
      name: kube-proxy
    name: kube-proxy
  - hostPath:
      path: /run/xtables.lock
      type: FileOrCreate
    name: xtables-lock
  - hostPath:
      path: /lib/modules
      type: ""
    name: lib-modules
  - name: kube-proxy-token-8qnhs
    secret:
      defaultMode: 420
      secretName: kube-proxy-token-8qnhs
status:
  phase: Pending
  qosClass: BestEffort

---

apiVersion: v1
data:
  config.conf: |-
    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    bindAddress: 0.0.0.0
    clientConnection:
      acceptContentTypes: ""
      burst: 10
      contentType: application/vnd.kubernetes.protobuf
      kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
      qps: 5
    clusterCIDR: 10.22.0.0/16
    configSyncPeriod: 15m0s
    conntrack:
      max: null
      maxPerCore: 32768
      min: 131072
      tcpCloseWaitTimeout: 1h0m0s
      tcpEstablishedTimeout: 24h0m0s
    enableProfiling: false
    healthzBindAddress: 0.0.0.0:10256
    hostnameOverride: ""
    iptables:
      masqueradeAll: false
      masqueradeBit: 14
      minSyncPeriod: 0s
      syncPeriod: 30s
    ipvs:
      excludeCIDRs: null
      minSyncPeriod: 0s
      scheduler: ""
      syncPeriod: 30s
    kind: KubeProxyConfiguration
    metricsBindAddress: 127.0.0.1:10249
    mode: ipvs
    nodePortAddresses: null
    oomScoreAdj: -999
    portRange: ""
    resourceContainer: /kube-proxy
    udpIdleTimeout: 250ms
  kubeconfig.conf: |-
    apiVersion: v1
    kind: Config
    clusters:
    - cluster:
        certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        server: https://10.18.0.12:6443
      name: default
    contexts:
    - context:
        cluster: default
        namespace: default
        user: default
      name: default
    current-context: default
    users:
    - name: default
      user:
        tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
kind: ConfigMap
metadata:
  creationTimestamp: null
  labels:
    app: kube-proxy
  name: kube-proxy
  selfLink: /api/v1/namespaces/kube-system/configmaps/kube-proxy

With kubeadm, kube-proxy runs as a DaemonSet whilst in a MetalK8s deployment these are kubelet static manifests.

@NicolasT
Copy link
Contributor

So, some more insight. I set up a cluster in an environment where MetalLB in ARP mode can work.

First of all, MetalK8s sets up a couple of sysctl parameters during installation, which may have an impact:

net.ipv4.conf.all.accept_source_route=0
net.ipv4.conf.default.accept_source_route=0
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.default.send_redirects=0
net.ipv6.conf.all.accept_source_route=0
net.ipv4.conf.default.accept_redirects=0
kernel.randomize_va_space=2
net.ipv4.ip_forward=1
net.ipv4.ip_local_reserved_ports=30000-32767
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-arptables=1
net.bridge.bridge-nf-call-ip6tables=1

In my test environment, I disabled the ones until and including kernel.randomize_va_space which are all related to host hardening, to rule out those causing this issue (I have to try again with those enabled, read on). The last ones are required for K8s and kube-proxy to function properly.

After making these changes and reverting the values to the kernel defaults, accessing my test LoadBalancer Service managed by MetalLB still didn't work: things work fine when accessing from one of my cluster nodes (as expected: the MetalLB-assigned Service IP address shows up in ipvsadm), but not from an external node. tcpdump brought some insight:

$ tcpdump -n -i eth0 port 80 or dst 10.20.20.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
06:23:35.834684 IP 10.20.0.14.55342 > 10.20.20.1.http: Flags [S], seq 2312616073, win 29200, options [mss 1460,sackOK,TS val 111053792 ecr 0,nop,wscale 7], length 0
06:23:35.834758 ARP, Request who-has 10.20.20.1 tell 10.20.0.10, length 28
06:23:36.836155 ARP, Request who-has 10.20.20.1 tell 10.20.0.10, length 28
06:23:36.837703 IP 10.20.0.14.55342 > 10.20.20.1.http: Flags [S], seq 2312616073, win 29200, options [mss 1460,sackOK,TS val 111054796 ecr 0,nop,wscale 7], length 0
06:23:37.838176 ARP, Request who-has 10.20.20.1 tell 10.20.0.10, length 28
06:23:38.840767 IP 10.20.0.14.55342 > 10.20.20.1.http: Flags [S], seq 2312616073, win 29200, options [mss 1460,sackOK,TS val 111056799 ecr 0,nop,wscale 7], length 0
06:23:38.840827 ARP, Request who-has 10.20.20.1 tell 10.20.0.10, length 28
06:23:39.842171 ARP, Request who-has 10.20.20.1 tell 10.20.0.10, length 28
06:23:40.841820 ARP, Request who-has 10.20.20.1 tell 10.20.0.14, length 28
06:23:40.844149 ARP, Request who-has 10.20.20.1 tell 10.20.0.10, length 28

So, ARP is working fine (also validated by checking the ARP table on the client), and packets are correctly routed to the host, but then dropped (my test Pod is a plain HTTP server). To my surprise, the address assigned by MetalLB (10.20.20.1 in my case) doesn't show up as the address of any interface on my system, so I'd be surprised if anything accepts any of those packets anyway.

One Google search later I found kubernetes/kubernetes#59976 which is exactly what's going wrong on this deployment. After manually applying the fix on the host (ip addr add 10.20.20.1/32 dev kube-ipvs0), I can curl the LoadBalancer address without any problems from remote hosts, and the response is properly served.

So, I guess the core bug is a combination of MetalLB with Kubernetes 1.10 running kube-proxy in ipvs mode. Maybe the sysctl settings have an impact as well, I will test later, though they're not the root cause.

Are you aware of this issue in other MetalLB deployments @danderson ?

@danderson
Copy link
Author

Great debugging! Sounds like that k8s bug is indeed the root cause. I did have some other people reporting that particular problem (and you can see me doing a bunch of debugging in that bug :) ).

I didn't realize MetalK8s is configuring IPVS mode. My general advice for IPVS mode right now is: don't use it unless you really need the scalability benefits. The implementation has historically had bug after bug where LoadBalancer behavior is completely broken... And because there's no conformance tests for LoadBalancer in OSS k8s (only cloud provider tests that exercise their own custom implementations), the bugs are undetected until someone tries using MetalLB (none of the main cloud providers enable IPVS - for exactly the same reason, too many bugs).

In theory, with k8s 1.13, IPVS mode should finally work correctly, but until I set up my e2e test framework, I can't guarantee it :/. Until you can positively verify that kube-proxy in ipvs mode works correctly, my advice is to revert to iptables mode, or warn MetalK8s users that load-balancers like MetalLB just won't work on MetalK8s.

@NicolasT
Copy link
Contributor

Thanks for the confirmation and the insight in iptables vs. ipvs mode.

Reverting to iptables as-is would require quite some investigation w.r.t. upgrade (will kube-proxy clean up all ipvs configuration correctly? Remove the dummy interface?...) so that may take some time. However, we could document that patch one can apply to deploy with iptables mode in the meantime. Sadly enough I don't think this can be selected in the Ansible inventory being used because of Ansible variable precedence order 😔

I'll also need to investigate whether the custom sysctls we set up cause more trouble on top of the ipvs mode issue.

@danderson
Copy link
Author

Unfortunately no, you can't switch kube-proxy modes cleanly. The safest way I know to do that is to reconfigure kube-proxy, and then reboot the node to start from clean state :(. Since you're deploying with ansible, you could add some more workflow to try and clean up, but again that's a bit risky.

Did you test this with MetalK8s 1.1.0-alpha1 ? I see you're ugprading to k8s 1.11 in that release. I don't know if 1.11 works correctly, but it should fix at least some of the ipvs bugs that broke MetalLB in the past.

@NicolasT
Copy link
Contributor

I didn't try with Kubernetes 1.11. We may, however, not do a 1.1.x release, but skip and go for 1.2 (K8s 1.12). I'll give that a try.

At the same time, I'll check with the team whether it's possible to add a test to our suite which deploys MetalLB and validates things are (or are not) working as expected. However, given our CI runs on OpenStack, this may again require some tweaking of the current test environment to work around Neutron-enforced default network security policies.

@danderson
Copy link
Author

For running on OpenStack, at least for MetalLB's L2 mode you need to disable IP spoofing protection on the VMs, otherwise the OpenStack network layer drops ARP responses sent by MetalLB. I think BGP mode should just work out of the box, although it's more complex to set up because now you need to set up a BGP router as well.

@NicolasT
Copy link
Contributor

Yeah, given our systems BGP will likely not work. I got L2 mode to work, by indeed disabling spoofing protection etc, though given our environment you need to set up custom networks etc. to be able to set this up. We'll sort it out :)

Anyway, just checked the impact of the various sysctls. Those look OK, after enabling them, reboot, validation they're set correctly, the MetalLB-managed IP doesn't work again (as kinda expected), but after adding it to the dummy interface, everything works as expected.

I'll try upgrading to our 1.2 development branch (Kubernertes 1.12) and see what that gives.

Thanks for all input!

@danderson
Copy link
Author

Thank you for the quick and thorough response! If I can ever get my e2e testing stack set up, I'll try throwing MetalK8s into the test matrix on the MetalLB side as well :)

@NicolasT
Copy link
Contributor

If I can ever get my e2e testing stack set up, I'll try throwing MetalK8s into the test matrix on the MetalLB side as well :)

That'd be really cool. Let us know if there's anything we can do to help with that.

Also, any chance you can share some details about the MetalLB users who tried to run on MetalK8s? Always happy to hear more user stories!

@NicolasT
Copy link
Contributor

I upgraded my cluster to the development/1.2 branch (Kubernetes 1.12), and kube-proxy assigned the expected LoadBalancer addresses to its dummy interface. Also after a reboot they were configured correctly.

A curl of that address also works.

So, I guess this is kind-of sorted out... If someone wants to use MetalK8s 1.0.0, one can change the default inventory/var by applying the following patch:

diff --git a/playbooks/group_vars/k8s-cluster/10-metal-k8s.yml b/playbooks/group_vars/k8s-cluster/10-metal-k8s.yml
index b48de38cc3..cb614ab6b2 100644
--- a/playbooks/group_vars/k8s-cluster/10-metal-k8s.yml
+++ b/playbooks/group_vars/k8s-cluster/10-metal-k8s.yml
@@ -3,7 +3,7 @@ kube_basic_auth: True
 kubeconfig_localhost: True

 dns_mode: 'coredns'
-kube_proxy_mode: 'ipvs'
+kube_proxy_mode: 'iptables'

 kube_version: 'v1.10.11'

The same may be required for the 1.1 branch. Starting with the 1.2 branch, MetalLB should work out-of-the-box on MetalK8s.

Given some ideas we have w.r.t. the future directions of MetalK8s (which may include MetalLB), I'd rather not spend time on switching to iptables mode for kube-proxy right now, unless other bugs pop up, because of the upgrade complexity.

@kellymenzel
Copy link

kellymenzel commented Dec 19, 2018

Thanks to you both! By changing the kube-proxy mode to iptables, I was able to get MetalLB working in my MetalK8s cluster (I'm using the 1.0.0 release). This will work for now while I am very early on in my project and very much just learning Kubernetes. I'm looking forward to MetalK8s 1.2!

@jacobsmith928
Copy link

@NicolasT would it be useful to have access to any bare metal resources from Packet? We support local BGP, so it might be useful for testing, etc. Happy to support the community. Let me know!

@ominsign
Copy link

So, I guess this is kind-of sorted out... If someone wants to use MetalK8s 1.0.0, one can change the default inventory/var by applying the following patch: (...)

Thanks for this feedback @NicolasT , I was getting worried as I couldn't get MetalLB to work on a Metal-k8s based cluster. Integrating MetalLB to your project would really be a great feature.

Short quesiton: is it possible to apply the update of 10-metal-k8s.yml (ipvs to iptables) to an already running clulster ? I just tried running the playbook again with different arguments, but it seems the change is been ignored (also after a full restart/reboot).

Merci & kind regards.

@gdemonet gdemonet added the legacy Anything related to MetalK8s 1.x label Feb 4, 2020
@samcv
Copy link

samcv commented Feb 11, 2020

Is this still an issue?

@NicolasT
Copy link
Contributor

Hello @samcv!

Is this still an issue?

It shouldn't be: in MetalK8s 2.x we're using iptables mode for kube-proxy.

We're not deploying MetalLB by default, because it doesn't suit our needs (#1788). However, you should be able to deploy it on any cluster using its Chart, or other means!

@samcv
Copy link

samcv commented Feb 11, 2020

I will check in the next few days if I can find an issue with this. I know I had issues with MetalLB and kubernetes, but it may have been my own setup. So far I am really enjoying MetalK8s's philosophy, and hoping very much it will fit my needs. ATM it's not required for me to use MetalLB, but it would be nice to know it can work in the future. Thanks!

@NicolasT
Copy link
Contributor

So far I am really enjoying MetalK8s's philosophy, and hoping very much it will fit my needs.

That's super cool to hear, thanks for sharing. Please let us know whenever there's something missing to suit your needs, issues you'd run into,...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug Something isn't working legacy Anything related to MetalK8s 1.x state:help wanted Extra attention is needed topic:networking Networking-related issues
Projects
None yet
Development

No branches or pull requests

7 participants