-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Config map errors in kubelet after 365 days of operation #2937
Comments
Some comments for whoever will look into this further: We're likely either not restarting the We may want to disable the latter completely (assuming it's enabled today), and solely rely on Salt-provisioned certs. Also note |
Just wanted to add some additional info here: Nov 5th: The Certs were updated, state.highstate was applied, and we then restarted kubelet on those servers:
Nov 11-12th: 3 servers ended up in a not ready state: The log info from November 12th on the 3rd server a short time before the restart of kubelet. These messages were also on the first two servers as well, but lots of things logged between these and the actual restart of kubelet due to hang happening the day before:
6 minutes later was restarted by the customer - HOSTNAME - Restarted kubelet:
If additional info from the data dump to a file from the 'journalctl -u kubelet' command from the three servers is needed please let me know. |
That So, what could be the case, is that So, we either need to
Approach 1 above is more in line with other certificates we manage throughout the system, and likely 'easier' to implement. Approach 2 would be useful if at some point in time we move away from Salt managing all kinds of (client) certs, including e.g., those used in a Calico client config file, and instead use provisioning through |
I did some tests, as @NicolasT already said Kubelet is generating CSR but these requests are simply ignored by controller manager because Kubelet does not have sufficient right. |
@alexandre-allard-scality Well, I'm not entirely convinced what you suggest as a fix is definitely the way to go: indeed, this would make |
@NicolasT AFAICT, there is nothing to maintain and for the monitoring part it's more or less the same as what we could do for other certificates (something checking the expiration date of the certificate, whether we are managing this cert or not). |
The certificates rotation is handled by a Salt beacon, so we don't need this feature. Refs: #2937
Kubelet, Metalk8s version 2.4.0:
Nodes went into "NotReady" state with kubelet logs indicating problems with configmaps:
The issue started exactly 365 days after the system was installed. Just the prior week we had run a 'salt-call state.highstate' on the servers to fix some certs (calico) that were due to expire at the 365 day mark.
A restart of the kubelet process fixed the issue. We were not able to capture how long the kubelet processes itself had been running before we restarted it.
The text was updated successfully, but these errors were encountered: