-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coredns UDP service port point to TCP pod port when upgrading from 23.* #10860
Comments
We are running into the same issue. The issue seems to have been introduced in #10617, but the problem does not originate there. There is an underlying problem that leads to a corrupted coredns service during the upgrade process. From what we gathered, the process looks like this:
The theory here is that kubeadm patches the coredns service which leads to a corrupted coredns service until kubespray overwrites it again. There might be some weird YAML merging going on which causes the corrupted YAML. The corrupted service contains a port that looks like this: ports:
- name: dns
port: 53
protocol: UDP
targetPort: dns-tcp # Here is the issue
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 53
- name: metrics
port: 9153
protocol: TCP
targetPort: 9153 Our quick fix was removing the |
@simon-wessel Thanks for your fast reply. I have two k8s clusters, one of which is the test one. |
Hello again
As You have said it is a quick fix and all is working as expected. So this workaround is suitable as there is no cluster interruption during upgrade process. Here is a part of
|
@simon-wessel do you have the logs of the kubeadm invocation ? the kubespray/roles/kubespray-defaults/defaults/main/main.yml Lines 30 to 45 in dce68e6
Do I get your meaning correctly that the error is transient (during upgrade ?). That would explain why it was not caught by CI. Note: this was exposed by the cleanup here (I think) : #10695 in addition to the dual coredns stuff |
I could not reproduce this on top of master by creating a new cluster then upgrading with |
@VannTen Hello, will try to test it today in the evening on my test cluster. |
@VannTen Hello, I have checked with master. No it's still broken for coredns setup. |
(because I haven't been able to reproduce it until now) |
@VannTen please find my configuration attached command I used to upgrade cluster is as follows:
|
You're using a seven nodes etcd cluster which are also the cluster nodes ? Is that intended ? |
Yes, that is intended. I wanted to increase redundancy of cluster in that way as I have only two of master nodes. |
Not to my knowledge, that's just a bit unusual^.
|
Just read the documentation on this. Will rebuild my cluster for it to have 3 etcd nodes. |
I can confirm the bug on a cluster with 3 control planes and 3 worker nodes, also deployed with dns_mode: coredns. I can also confirm the "workaround" deleting the targetPorts from the coredns svc template. But I don't have a clue what is causing this after poking around in kubespray a bit... |
On which version ? Can you provide and inventory and a more precise description ? |
For instance I just tried running upgrade cluster on a fresh cluster while running the following on the first master:
(the jq test that we get the correct ports on each change)
# kubectl get svc -n kube-system coredns -o json -w | jq '.spec.ports | [if(map((.protocol == "UDP" and .targetPort == "dns") or (.protocol == "TCP" and .targetPort == "dns-tcp") or (.name == "metrics")) | all) then "good" else . end, (now |strftime("%H:%M:%S"))]'
[
"good",
"11:14:59"
]
As you can see it does not report any changes.
This was with a cluster built on master using vagrant, then with upgrade-cluster.yml on the same branch.
So if you have the problem, what's the origin version (tag on which the cluster was built/last upgraded) and the destination (tag on which upgrade-cluster.yml triggers that bug) ?
|
I upgraded from the release-2.23 branch to the recently released v2.24.1 tag. |
I'll try with that, see if I can finally get a reproducer, thanks 👍 |
So the transition (that's not completely complete) goes something like this: [
[
{
"name": "dns",
"port": 53,
"protocol": "UDP",
"targetPort": 53
},
{
"name": "dns-tcp",
"port": 53,
"protocol": "TCP",
"targetPort": 53
},
{
"name": "metrics",
"port": 9153,
"protocol": "TCP",
"targetPort": 9153
}
],
"13:20:59"
]
[
[
{
"name": "dns",
"port": 53,
"protocol": "UDP",
"targetPort": "dns"
},
{
"name": "dns-tcp",
"port": 53,
"protocol": "TCP",
"targetPort": 53
},
{
"name": "metrics",
"port": 9153,
"protocol": "TCP",
"targetPort": 9153
}
],
"13:21:57"
]
[
[
{
"name": "dns",
"port": 53,
"protocol": "UDP",
"targetPort": "dns-tcp"
},
{
"name": "dns-tcp",
"port": 53,
"protocol": "TCP",
"targetPort": 53
},
{
"name": "metrics",
"port": 9153,
"protocol": "TCP",
"targetPort": 9153
}
],
"13:32:21"
] The only problematic one is the last one. I guess the patching logic does not work as planned 🤔 |
Apparently, this also happens when using `--tags coredns` (so kubeadm is out IMO). I think this has something to do with the way kubectl do it's apply.
|
We're probably hitting something like this : kubernetes/kubernetes#39188 (comment)
Another potentialy relevant : https://ben-lab.github.io/kubernetes-UDP-TCP-bug-same-port/
TL;DR : kubectl apply does the wrong thing because it does not use the correct key to merge the array (I understand it use port which is `53` is both cases instead of `name`).
Apparently server-side apply might fix the problem.
So I guess I should revisit #10701 ...
|
Using server-side apply **does** fix the issue. I'm working on a PR (mentioned above) but it needs to refactor some stuff in boostrap-os and kubernetes/preinstall.
|
We're also seeing this, but one of our clusters ended up staying in a broken state with the coredns service setting the UDP TargetPort to Here's how we found it:
Notice that the TargetPort is dns-tcp/UDP and the Endpoints are all empty. We had to manually correct this. |
This is the workaround, as suggested by @amogilny. |
Using `kubectl apply -f <coredns manifest> --server-side=true` (check the flag name, I'm writing this from memory) should also work ; it's what the fix I'm working on essentially is.
|
I think i have different problem, removed the target ports from live manifest and its still in crashloop. also i'm using For some reason, coredns and nodelocaldns pods are in crashloopbackoff, there are no logs as to why. also i |
What is the status of this issue? This is very annoying and cause major impact on cluster upgrade. There is #10701 which would fit for long term solution, but what about using server-side apply as a short-term one? |
I'm working on the last prep PR for #10701 (that PR itself is not complicated, getting in the state where we can easily use the kubernetes.core.k8s module is)
but what about using server-side apply as a short-term one?
You mean in our custom k8s module ? I haven't looked because I'd rather focus on a long-term solution. I don't know how easy it would be to tweak that module, but I won't be the one working on it.
|
Can CoreDNS setup be moved to EDIT: this is currently skipped |
I don't think kubeadm can handle the flexibility we have, and also I don't know if it handle nodelocaldns.
If it's possible that would be great, but I'm not hopeful
|
@VannTen You are right. |
So, this should be fixed in master by #10701. Alternatively, we could also tweak the current |
What happened?
Upgrade Kubespray to v2.24.0 then apply upgrade_cluster.yml playbook on a coredns enabled Kubernetes cluster.
Checked issues regarding DNS and found report #10816
After examining cluster configuration found that coredns service has mentioned configuration bug:
What did you expect to happen?
Cluster to be upgraded with no issue
How can we reproduce it (as minimally and precisely as possible)?
Set dns_mode: coredns in group_vars/k8s_cluster/k8s-cluster.yml, then install or upgrade a cluster.
OS
MacOS X Sonoma
Version of Ansible
Version of Python
Python 3.11.7
Version of Kubespray (commit)
64447e7
Network plugin used
calico
Full inventory with variables
No response
Command used to invoke ansible
No response
Output of ansible run
No response
Anything else we need to know
No response
The text was updated successfully, but these errors were encountered: