talos_cluster_health fails, while talosctl health is fine #153

stelb · 2024-03-15T20:01:50Z

Hi,

I have 2 interfaces attached, external and internal network.
I provide internal control plane ips
endpoint one of the external control plane ips

first problem:
│ waiting for etcd members to be control plane nodes: etcd member ips ["10.1.0.6" "XX.75.176.68" "10.1.0.2"] are not subset of control plane node ips ["10.1.0.2" "10.1.0.6" "10.1.0.7"]
I added advertisedSubnets to be the internal cidr

Now etcd is ok, but now there is an unexpected k8s node
│ waiting for all k8s nodes to report: can't find expected node with IPs ["10.1.0.3"]
│ waiting for all k8s nodes to report: unexpected nodes with IPs ["XX.75.176.68"]
(I reduced nodes)

But when I check this with talosctl:

talosctl -n 10.1.0.3 -e xx.13.164.153 health

discovered nodes: ["10.1.0.3" "xx.75.176.68"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: OK
waiting for all k8s nodes to report ready: ...
waiting for all k8s nodes to report ready: OK
waiting for all control plane static pods to be running: ...
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: ...
waiting for all control plane components to be ready: OK
waiting for kube-proxy to report ready: ...
waiting for kube-proxy to report ready: SKIP
waiting for coredns to report ready: ...
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: ...
waiting for all k8s nodes to report schedulable: OK

or with public cp ip:

talosctl -n xx.13.164.153 -e xx.13.164.153 health

discovered nodes: ["10.1.0.3" "xx.75.176.68"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: OK
waiting for all k8s nodes to report ready: ...
waiting for all k8s nodes to report ready: OK
waiting for all control plane static pods to be running: ...
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: ...
waiting for all control plane components to be ready: OK
waiting for kube-proxy to report ready: ...
waiting for kube-proxy to report ready: SKIP
waiting for coredns to report ready: ...
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: ...
waiting for all k8s nodes to report schedulable: OK

so what is the problem?

JonasKop · 2024-09-16T19:56:21Z

I have the same issue when using vip. It works with talosctl health ....

machine:
  network:
    interfaces:
      - interface: eth0
        dhcp: true
        vip:
          ip: 10.0.2.160

spastorclovr · 2024-10-08T10:41:38Z

Almost the Same issue here.
kubelet.nodeIp.validsubnets is wel defined to internal IPs and using advertisedSubnets set to internal Ips but still

the terraform data does not find the cluster healthy while the command talosctl health does.

Errror is

 unexpected nodes with IP

followed by the list of the private ips for the worker nodes.

samos667 · 2024-11-21T23:06:14Z

Same error, the node reported unhealthy by talos_cluster_health is the remote nodes linked by kubespan that is in a different network than control-planes

Then talosctl with the same endpoint but targeting only 1 CP, report all nodes ok like it should:

gcavalcante8808 · 2025-01-20T03:06:19Z

I have the same issue. In my case, I'm using a cillium CNI on a KVM/Libvirt lab:

Terraform Error:

Talosctl health works:

discovered nodes: ["10.0.1.180" "10.0.0.91"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for no diagnostics: ...
waiting for no diagnostics: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: OK
waiting for all control plane static pods to be running: ...
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: ...
waiting for all control plane components to be ready: OK
waiting for all k8s nodes to report ready: ...
waiting for all k8s nodes to report ready: OK
waiting for kube-proxy to report ready: ...
waiting for kube-proxy to report ready: SKIP
waiting for coredns to report ready: ...
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: ...
waiting for all k8s nodes to report schedulable: OK

I can also see the ["10.0.1.180" "10.0.0.91"] ips discovered in the get addresses list:

❯ talosctl get addresses
NODE              NAMESPACE   TYPE            ID                                             VERSION   ADDRESS                                   LINK
192.168.100.166   network     AddressStatus   cilium_host/10.0.1.180/32                      1         10.0.1.180/32                             cilium_host
192.168.100.166   network     AddressStatus   cilium_host/fe80::6c94:67ff:fef8:d991/64       2         fe80::6c94:67ff:fef8:d991/64              cilium_host
192.168.100.166   network     AddressStatus   cilium_net/fe80::84cd:79ff:fef9:3f5d/64        2         fe80::84cd:79ff:fef9:3f5d/64              cilium_net
192.168.100.166   network     AddressStatus   cilium_vxlan/fe80::c079:3cff:fe58:af36/64      2         fe80::c079:3cff:fe58:af36/64              cilium_vxlan
192.168.100.166   network     AddressStatus   eth0/192.168.100.166/24                        1         192.168.100.166/24                        eth0
192.168.100.166   network     AddressStatus   eth0/2804:d55:4030:d500:5054:ff:fe7e:d845/64   2         2804:d55:4030:d500:5054:ff:fe7e:d845/64   eth0
192.168.100.166   network     AddressStatus   eth0/fe80::5054:ff:fe7e:d845/64                2         fe80::5054:ff:fe7e:d845/64                eth0
192.168.100.166   network     AddressStatus   lo/127.0.0.1/8                                 1         127.0.0.1/8                               lo
192.168.100.166   network     AddressStatus   lo/169.254.116.108/32                          1         169.254.116.108/32                        lo
192.168.100.166   network     AddressStatus   lo/::1/128                                     1         ::1/128                                   lo
192.168.100.166   network     AddressStatus   lxc_health/fe80::b41e:57ff:fe78:1fbb/64        2         fe80::b41e:57ff:fe78:1fbb/64              lxc_health

Seems to be related to the provider only. I can try to take a look in ~2 weeks, but for now this is the info that I could get.

gcavalcante8808 · 2025-01-20T03:15:56Z

A Small but important update: as soon I changed ipam.mode to kubernetes on the cillium side and set the worker_nodes it worked flawlessly:

talosctl get addresses output:

NODE              NAMESPACE   TYPE            ID                                             VERSION   ADDRESS                                   LINK
192.168.100.166   network     AddressStatus   cilium_host/10.244.0.239/32                    1         10.244.0.239/32                           cilium_host
192.168.100.166   network     AddressStatus   cilium_host/fe80::6c94:67ff:fef8:d991/64       2         fe80::6c94:67ff:fef8:d991/64              cilium_host
192.168.100.166   network     AddressStatus   cilium_net/fe80::84cd:79ff:fef9:3f5d/64        2         fe80::84cd:79ff:fef9:3f5d/64              cilium_net
192.168.100.166   network     AddressStatus   cilium_vxlan/fe80::c079:3cff:fe58:af36/64      2         fe80::c079:3cff:fe58:af36/64              cilium_vxlan
192.168.100.166   network     AddressStatus   eth0/192.168.100.166/24                        1         192.168.100.166/24                        eth0
192.168.100.166   network     AddressStatus   eth0/2804:d55:4030:d500:5054:ff:fe7e:d845/64   2         2804:d55:4030:d500:5054:ff:fe7e:d845/64   eth0
192.168.100.166   network     AddressStatus   eth0/fe80::5054:ff:fe7e:d845/64                2         fe80::5054:ff:fe7e:d845/64                eth0
192.168.100.166   network     AddressStatus   lo/127.0.0.1/8                                 1         127.0.0.1/8                               lo
192.168.100.166   network     AddressStatus   lo/169.254.116.108/32                          1         169.254.116.108/32                        lo
192.168.100.166   network     AddressStatus   lo/::1/128                                     1         ::1/128                                   lo
192.168.100.166   network     AddressStatus   lxc7c0aba4ccec6/fe80::ec14:5eff:fe89:9d68/64   2         fe80::ec14:5eff:fe89:9d68/64              lxc7c0aba4ccec6
192.168.100.166   network     AddressStatus   lxc_health/fe80::ecec:92ff:fe66:f1bc/64        2         fe80::ecec:92ff:fe66:f1bc/64              lxc_healt

xriser mentioned this issue Jan 8, 2025

Add public loadbalancer IP to the certificate_san hcloud-k8s/terraform-hcloud-kubernetes#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

talos_cluster_health fails, while talosctl health is fine #153

talos_cluster_health fails, while talosctl health is fine #153

stelb commented Mar 15, 2024 •

edited

Loading

JonasKop commented Sep 16, 2024 •

edited

Loading

spastorclovr commented Oct 8, 2024 •

edited

Loading

samos667 commented Nov 21, 2024 •

edited

Loading

gcavalcante8808 commented Jan 20, 2025

gcavalcante8808 commented Jan 20, 2025

talos_cluster_health fails, while talosctl health is fine #153

talos_cluster_health fails, while talosctl health is fine #153

Comments

stelb commented Mar 15, 2024 • edited Loading

talosctl -n 10.1.0.3 -e xx.13.164.153 health

talosctl -n xx.13.164.153 -e xx.13.164.153 health

JonasKop commented Sep 16, 2024 • edited Loading

spastorclovr commented Oct 8, 2024 • edited Loading

samos667 commented Nov 21, 2024 • edited Loading

gcavalcante8808 commented Jan 20, 2025

gcavalcante8808 commented Jan 20, 2025

stelb commented Mar 15, 2024 •

edited

Loading

JonasKop commented Sep 16, 2024 •

edited

Loading

spastorclovr commented Oct 8, 2024 •

edited

Loading

samos667 commented Nov 21, 2024 •

edited

Loading