coredns UDP service port point to TCP pod port when upgrading from 23.* #10860

amogilny · 2024-01-31T07:38:59Z

What happened?

Upgrade Kubespray to v2.24.0 then apply upgrade_cluster.yml playbook on a coredns enabled Kubernetes cluster.

Checked issues regarding DNS and found report #10816

After examining cluster configuration found that coredns service has mentioned configuration bug:

ports:
- name: dns
  port: 53
  protocol: UDP
  targetPort: dns-tcp

What did you expect to happen?

Cluster to be upgraded with no issue

How can we reproduce it (as minimally and precisely as possible)?

Set dns_mode: coredns in group_vars/k8s_cluster/k8s-cluster.yml, then install or upgrade a cluster.

OS

MacOS X Sonoma

Version of Ansible

ansible --version                                                             
ansible [core 2.15.8]
  config file = /Users/bonemancer/git/kubespray-2.24.0/ansible.cfg
  configured module search path = ['/Users/bonemancer/git/kubespray-2.24.0/library']
  ansible python module location = /Users/bonemancer/git/kubespray-2.24.0/.venv/lib/python3.11/site-packages/ansible
  ansible collection location = /Users/bonemancer/.ansible/collections:/usr/share/ansible/collections
  executable location = /Users/bonemancer/git/kubespray-2.24.0/.venv/bin/ansible
  python version = 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] (/Users/bonemancer/git/kubespray-2.24.0/.venv/bin/python)
  jinja version = 3.1.2
  libyaml = True

Version of Python

Python 3.11.7

Version of Kubespray (commit)

64447e7

Network plugin used

calico

Full inventory with variables

No response

Command used to invoke ansible

No response

Output of ansible run

No response

Anything else we need to know

No response

The text was updated successfully, but these errors were encountered:

simon-wessel · 2024-01-31T13:31:53Z

We are running into the same issue. The issue seems to have been introduced in #10617, but the problem does not originate there. There is an underlying problem that leads to a corrupted coredns service during the upgrade process.

From what we gathered, the process looks like this:

kubespray patches CoreDNS
kubeadm patches CoreDNS <- CoreDNS service gets corrupted
kubespray patches CoreDNS again <- CoreDNS is back to normal

The theory here is that kubeadm patches the coredns service which leads to a corrupted coredns service until kubespray overwrites it again. There might be some weird YAML merging going on which causes the corrupted YAML.

The corrupted service contains a port that looks like this:

  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: dns-tcp # Here is the issue
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 53
  - name: metrics
    port: 9153
    protocol: TCP
    targetPort: 9153

Our quick fix was removing the targetPort-attributes in the coredns service template that were added in #10617.

amogilny · 2024-01-31T13:37:07Z

@simon-wessel Thanks for your fast reply.

I have two k8s clusters, one of which is the test one.
I will run the tests on the quick fix that You provided and write back on results.

amogilny · 2024-01-31T15:00:30Z

@simon-wessel

Hello again
After commenting out targetPort attributes in file roles/kubernetes-apps/ansible/templates/coredns-svc.yml.j2 I get following coredns configuration:

spec:
  ports:
    - name: dns
      protocol: UDP
      port: 53
      targetPort: 53
    - name: dns-tcp
      protocol: TCP
      port: 53
      targetPort: 53
    - name: metrics
      protocol: TCP
      port: 9153
      targetPort: 9153

As You have said it is a quick fix and all is working as expected. So this workaround is suitable as there is no cluster interruption during upgrade process.

Here is a part of roles/kubernetes-apps/ansible/templates/coredns-svc.yml.j2 with commented out lines:

spec:
  selector:
    k8s-app: kube-dns{{ coredns_ordinal_suffix }}
  clusterIP: {{ clusterIP }}
  ports:
    - name: dns
      port: 53
      protocol: UDP
#      targetPort: "dns"
    - name: dns-tcp
      port: 53
      protocol: TCP
#      targetPort: "dns-tcp"
    - name: metrics
      port: 9153
      protocol: TCP

VannTen · 2024-02-01T11:07:56Z

@simon-wessel do you have the logs of the kubeadm invocation ? the addon/coredns should be skipped ( see

kubespray/roles/kubespray-defaults/defaults/main/main.yml

Lines 30 to 45 in dce68e6

    
           ## List of kubeadm init phases that should be skipped during control plane setup 
        
           ## By default 'addon/coredns' is skipped 
        
           ## 'addon/kube-proxy' gets skipped for some network plugins 
        
           kubeadm_init_phases_skip_default: [ "addon/coredns" ] 
        
           kubeadm_init_phases_skip: >- 
        
             {%- if kube_network_plugin == 'kube-router' and (kube_router_run_service_proxy is defined and kube_router_run_service_proxy) -%} 
        
             {{ kubeadm_init_phases_skip_default + ["addon/kube-proxy"] }} 
        
             {%- elif kube_network_plugin == 'cilium' and (cilium_kube_proxy_replacement is defined and cilium_kube_proxy_replacement == 'strict') -%} 
        
             {{ kubeadm_init_phases_skip_default + ["addon/kube-proxy"] }} 
        
             {%- elif kube_network_plugin == 'calico' and (calico_bpf_enabled is defined and calico_bpf_enabled) -%} 
        
             {{ kubeadm_init_phases_skip_default + ["addon/kube-proxy"] }} 
        
             {%- elif kube_proxy_remove is defined and kube_proxy_remove -%} 
        
             {{ kubeadm_init_phases_skip_default + ["addon/kube-proxy"] }} 
        
             {%- else -%} 
        
             {{ kubeadm_init_phases_skip_default }} 
        
             {%- endif -%}

)

Do I get your meaning correctly that the error is transient (during upgrade ?). That would explain why it was not caught by CI.

Note: this was exposed by the cleanup here (I think) : #10695 in addition to the dual coredns stuff

VannTen · 2024-02-06T13:44:48Z

I could not reproduce this on top of master by creating a new cluster then upgrading with upgrade-cluster.yml (watching the coredns service with -w during the upgrade did not reveal any changes or mismatched ports)
@amogilny or @simon-wessel could you share the exacts steps to reproduce this ?

amogilny · 2024-02-07T07:53:39Z

@VannTen Hello, will try to test it today in the evening on my test cluster.

amogilny · 2024-02-08T20:38:15Z

@VannTen Hello, I have checked with master. No it's still broken for coredns setup.

VannTen · 2024-02-09T10:02:42Z

could you share the exacts steps to reproduce this ?

(because I haven't been able to reproduce it until now)

amogilny · 2024-02-09T10:10:39Z

kubespray.tgz

@VannTen please find my configuration attached

command I used to upgrade cluster is as follows:

./.venv/bin/ansible-playbook -i inventory/drs/hosts.yaml --become upgrade-cluster.yml -e "@custom_vars.yaml"

VannTen · 2024-02-16T10:53:57Z

You're using a seven nodes etcd cluster which are also the cluster nodes ? Is that intended ?

amogilny · 2024-02-16T11:02:15Z

You're using a seven nodes etcd cluster which are also the cluster nodes ? Is that intended ?

Yes, that is intended. I wanted to increase redundancy of cluster in that way as I have only two of master nodes.
Does this affect DNS settings somehow?

VannTen · 2024-02-16T12:55:19Z

Not to my knowledge, that's just a bit unusual^.

amogilny · 2024-02-16T22:34:56Z

Not to my knowledge, that's just a bit unusual^.

Just read the documentation on this. Will rebuild my cluster for it to have 3 etcd nodes.

dansou901 · 2024-03-01T10:55:37Z

I can confirm the bug on a cluster with 3 control planes and 3 worker nodes, also deployed with dns_mode: coredns. I can also confirm the "workaround" deleting the targetPorts from the coredns svc template. But I don't have a clue what is causing this after poking around in kubespray a bit...

VannTen · 2024-03-01T11:00:12Z

I can confirm the bug on a cluster with 3 control planes and 3 worker nodes, also deployed with dns_mode: coredns. I can also confirm the "workaround" deleting the targetPorts from the coredns svc template. But I don't have a clue what is causing this after poking around in kubespray a bit...

On which version ? Can you provide and inventory and a more precise description ?

VannTen · 2024-03-01T11:32:58Z

For instance I just tried running upgrade cluster on a fresh cluster while running the following on the first master: (the jq test that we get the correct ports on each change) # kubectl get svc -n kube-system coredns -o json -w | jq '.spec.ports | [if(map((.protocol == "UDP" and .targetPort == "dns") or (.protocol == "TCP" and .targetPort == "dns-tcp") or (.name == "metrics")) | all) then "good" else . end, (now |strftime("%H:%M:%S"))]' [ "good", "11:14:59" ] As you can see it does not report any changes. This was with a cluster built on master using vagrant, then with upgrade-cluster.yml on the same branch. So if you have the problem, what's the origin version (tag on which the cluster was built/last upgraded) and the destination (tag on which upgrade-cluster.yml triggers that bug) ?

dansou901 · 2024-03-01T12:26:56Z

I upgraded from the release-2.23 branch to the recently released v2.24.1 tag.
In the inventory, the group kube_control_plane contains three control plane nodes, the group etcd contains the same three control plane nodes and kube_node contains three worker nodes.
I've got two other clusters which are installed the same way, but with more nodes. I didn't upgrade them yet, but plan to do so next week (as they have production workloads on them). Those clusters are still on release-2.23 (k8s version 1.27.7).

VannTen · 2024-03-01T12:34:02Z

I'll try with that, see if I can finally get a reproducer, thanks 👍

VannTen · 2024-03-01T14:27:47Z

So the transition (that's not completely complete) goes something like this:

[
  [
    {
      "name": "dns",
      "port": 53,
      "protocol": "UDP",
      "targetPort": 53
    },
    {
      "name": "dns-tcp",
      "port": 53,
      "protocol": "TCP",
      "targetPort": 53
    },
    {
      "name": "metrics",
      "port": 9153,
      "protocol": "TCP",
      "targetPort": 9153
    }
  ],
  "13:20:59"
]
[
  [
    {
      "name": "dns",
      "port": 53,
      "protocol": "UDP",
      "targetPort": "dns"
    },
    {
      "name": "dns-tcp",
      "port": 53,
      "protocol": "TCP",
      "targetPort": 53
    },
    {
      "name": "metrics",
      "port": 9153,
      "protocol": "TCP",
      "targetPort": 9153
    }
  ],
  "13:21:57"
]
[
  [
    {
      "name": "dns",
      "port": 53,
      "protocol": "UDP",
      "targetPort": "dns-tcp"
    },
    {
      "name": "dns-tcp",
      "port": 53,
      "protocol": "TCP",
      "targetPort": 53
    },
    {
      "name": "metrics",
      "port": 9153,
      "protocol": "TCP",
      "targetPort": 9153
    }
  ],
  "13:32:21"
]

The only problematic one is the last one. I guess the patching logic does not work as planned 🤔
I still need to pinpoint the exact tasks doing this.

VannTen · 2024-03-01T14:48:43Z

Apparently, this also happens when using `--tags coredns` (so kubeadm is out IMO). I think this has something to do with the way kubectl do it's apply.

VannTen · 2024-03-01T15:02:31Z

We're probably hitting something like this : kubernetes/kubernetes#39188 (comment) Another potentialy relevant : https://ben-lab.github.io/kubernetes-UDP-TCP-bug-same-port/ TL;DR : kubectl apply does the wrong thing because it does not use the correct key to merge the array (I understand it use port which is `53` is both cases instead of `name`). Apparently server-side apply might fix the problem. So I guess I should revisit #10701 ...

VannTen · 2024-03-04T18:49:48Z

Using server-side apply **does** fix the issue. I'm working on a PR (mentioned above) but it needs to refactor some stuff in boostrap-os and kubernetes/preinstall.

poblahblahblah · 2024-03-05T22:56:16Z

We're also seeing this, but one of our clusters ended up staying in a broken state with the coredns service setting the UDP TargetPort to dns-tcp.

Here's how we found it:

~ ❯ k describe service coredns
Name:              coredns
Namespace:         kube-system
Labels:            addonmanager.kubernetes.io/mode=Reconcile
                   k8s-app=kube-dns
                   kubernetes.io/name=coredns
Annotations:       createdby: kubespray
                   prometheus.io/port: 9153
                   prometheus.io/scrape: true
Selector:          k8s-app=kube-dns
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.128.0.3
IPs:               10.128.0.3
Port:              dns  53/UDP
TargetPort:        dns-tcp/UDP
Endpoints:
Port:              dns-tcp  53/TCP
TargetPort:        53/TCP
Endpoints:         172.16.0.162:53,172.16.0.202:53,172.16.1.112:53 + 1 more...
Port:              metrics  9153/TCP
TargetPort:        9153/TCP
Endpoints:         172.16.0.162:9153,172.16.0.202:9153,172.16.1.112:9153 + 1 more...
Session Affinity:  None
Events:            <none>

Notice that the TargetPort is dns-tcp/UDP and the Endpoints are all empty. We had to manually correct this.

VannTen · 2024-03-06T09:32:50Z

/retitle coredns UDP service port point to TCP pod port when upgrading from 23.* This affects master as well (AFAICT). @yankay or @floryut : can you pin the issue until it's resolved ?

dansou901 · 2024-03-14T22:37:55Z

@simon-wessel

Hello again
After commenting out targetPort attributes in file roles/kubernetes-apps/ansible/templates/coredns-svc.yml.j2 I get following coredns configuration:
spec:
  ports:
    - name: dns
      protocol: UDP
      port: 53
      targetPort: 53
    - name: dns-tcp
      protocol: TCP
      port: 53
      targetPort: 53
    - name: metrics
      protocol: TCP
      port: 9153
      targetPort: 9153
As You have said it is a quick fix and all is working as expected. So this workaround is suitable as there is no cluster interruption during upgrade process.

Here is a part of roles/kubernetes-apps/ansible/templates/coredns-svc.yml.j2 with commented out lines:
spec:
  selector:
    k8s-app: kube-dns{{ coredns_ordinal_suffix }}
  clusterIP: {{ clusterIP }}
  ports:
    - name: dns
      port: 53
      protocol: UDP
#      targetPort: "dns"
    - name: dns-tcp
      port: 53
      protocol: TCP
#      targetPort: "dns-tcp"
    - name: metrics
      port: 9153
      protocol: TCP

This is the workaround, as suggested by @amogilny.

VannTen · 2024-03-14T22:53:21Z

Using `kubectl apply -f <coredns manifest> --server-side=true` (check the flag name, I'm writing this from memory) should also work ; it's what the fix I'm working on essentially is.

dv29 · 2024-03-14T23:23:55Z

I think i have different problem, removed the target ports from live manifest and its still in crashloop. also i'm using dns_mode: coredns.

For some reason, coredns and nodelocaldns pods are in crashloopbackoff, there are no logs as to why. also i resolvconf_mode: none so that there is no loop, still not sure whats going on

sathieu · 2024-04-04T07:31:10Z

What is the status of this issue? This is very annoying and cause major impact on cluster upgrade.

There is #10701 which would fit for long term solution, but what about using server-side apply as a short-term one?

VannTen · 2024-04-04T07:52:00Z

I'm working on the last prep PR for #10701 (that PR itself is not complicated, getting in the state where we can easily use the kubernetes.core.k8s module is)

but what about using server-side apply as a short-term one?

You mean in our custom k8s module ? I haven't looked because I'd rather focus on a long-term solution. I don't know how easy it would be to tweak that module, but I won't be the one working on it.

sathieu · 2024-04-09T14:21:35Z

Can CoreDNS setup be moved to kubeadm?

EDIT: this is currently skipped

VannTen · 2024-04-09T17:13:24Z

I don't think kubeadm can handle the flexibility we have, and also I don't know if it handle nodelocaldns. If it's possible that would be great, but I'm not hopeful

sathieu · 2024-04-10T08:01:08Z

@VannTen You are right.

VannTen · 2024-05-13T12:24:11Z

So, this should be fixed in master by #10701.
However, backporting that PR to release-2.24 could be tricky. Wdyt of reverting #10617 (root cause) for the branch release-2.24 (and only for that branch) instead ? @mzaian @yankay @MrFreezeex @cyclinder

Alternatively, we could also tweak the current kube custom ansible module, but I have no idea what quantity of work this means.

mzaian · 2024-05-13T12:34:28Z

+1 to the revert of #10617 for release-2.24 branch. It's a lot of work to make #10701 to the soon-to-be old releases.

amogilny added the kind/bug Categorizes issue or PR as related to a bug. label Jan 31, 2024

simon-wessel mentioned this issue Jan 31, 2024

Fix secondary coredns missing var #10821

Merged

VannTen mentioned this issue Feb 5, 2024

Release Proposal 2.24.1 #10882

Closed

VannTen mentioned this issue Mar 6, 2024

Migrate DNS manifests to kubernetes.core.k8s #10701

Closed

k8s-ci-robot changed the title ~~coredns setup is broken in Kubespray v2.24.0~~ coredns UDP service port point to TCP pod port when upgrading from 23.* Mar 6, 2024

sathieu mentioned this issue Apr 27, 2024

Release Proposal v2.25 #11126

Closed

VannTen mentioned this issue May 13, 2024

Revert "support CoreDNS use host network and config dns port (#10617)" #11185

Merged

k8s-ci-robot closed this as completed in #11185 May 13, 2024

VannTen mentioned this issue Nov 9, 2024

Convert kubernetes-apps to use kubectl directly #11700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coredns UDP service port point to TCP pod port when upgrading from 23.* #10860

coredns UDP service port point to TCP pod port when upgrading from 23.* #10860

amogilny commented Jan 31, 2024

simon-wessel commented Jan 31, 2024 •

edited

Loading

amogilny commented Jan 31, 2024

amogilny commented Jan 31, 2024

VannTen commented Feb 1, 2024 •

edited

Loading

VannTen commented Feb 6, 2024

amogilny commented Feb 7, 2024

amogilny commented Feb 8, 2024

VannTen commented Feb 9, 2024

amogilny commented Feb 9, 2024

VannTen commented Feb 16, 2024

amogilny commented Feb 16, 2024

VannTen commented Feb 16, 2024 via email

amogilny commented Feb 16, 2024

dansou901 commented Mar 1, 2024

VannTen commented Mar 1, 2024

VannTen commented Mar 1, 2024 via email

dansou901 commented Mar 1, 2024 •

edited

Loading

VannTen commented Mar 1, 2024

VannTen commented Mar 1, 2024

VannTen commented Mar 1, 2024 via email

VannTen commented Mar 1, 2024 via email

VannTen commented Mar 4, 2024 via email

poblahblahblah commented Mar 5, 2024

VannTen commented Mar 6, 2024 via email

dansou901 commented Mar 14, 2024

VannTen commented Mar 14, 2024 via email

dv29 commented Mar 14, 2024 •

edited

Loading

sathieu commented Apr 4, 2024

VannTen commented Apr 4, 2024 via email

sathieu commented Apr 9, 2024 •

edited

Loading

VannTen commented Apr 9, 2024 via email

sathieu commented Apr 10, 2024

VannTen commented May 13, 2024

mzaian commented May 13, 2024

coredns UDP service port point to TCP pod port when upgrading from 23.* #10860

coredns UDP service port point to TCP pod port when upgrading from 23.* #10860

Comments

amogilny commented Jan 31, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Anything else we need to know

simon-wessel commented Jan 31, 2024 • edited Loading

amogilny commented Jan 31, 2024

amogilny commented Jan 31, 2024

VannTen commented Feb 1, 2024 • edited Loading

VannTen commented Feb 6, 2024

amogilny commented Feb 7, 2024

amogilny commented Feb 8, 2024

VannTen commented Feb 9, 2024

amogilny commented Feb 9, 2024

VannTen commented Feb 16, 2024

amogilny commented Feb 16, 2024

VannTen commented Feb 16, 2024 via email

amogilny commented Feb 16, 2024

dansou901 commented Mar 1, 2024

VannTen commented Mar 1, 2024

VannTen commented Mar 1, 2024 via email

dansou901 commented Mar 1, 2024 • edited Loading

VannTen commented Mar 1, 2024

VannTen commented Mar 1, 2024

VannTen commented Mar 1, 2024 via email

VannTen commented Mar 1, 2024 via email

VannTen commented Mar 4, 2024 via email

poblahblahblah commented Mar 5, 2024

VannTen commented Mar 6, 2024 via email

dansou901 commented Mar 14, 2024

VannTen commented Mar 14, 2024 via email

dv29 commented Mar 14, 2024 • edited Loading

sathieu commented Apr 4, 2024

VannTen commented Apr 4, 2024 via email

sathieu commented Apr 9, 2024 • edited Loading

VannTen commented Apr 9, 2024 via email

sathieu commented Apr 10, 2024

VannTen commented May 13, 2024

mzaian commented May 13, 2024

simon-wessel commented Jan 31, 2024 •

edited

Loading

VannTen commented Feb 1, 2024 •

edited

Loading

dansou901 commented Mar 1, 2024 •

edited

Loading

dv29 commented Mar 14, 2024 •

edited

Loading

sathieu commented Apr 9, 2024 •

edited

Loading