-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to start k3s service in the master node #4187
Comments
Based on all the messages reporting 10+ second datastore operations, it would appear that your node has poor IO latency:
You didn't fill out the Bug Report issue template so I don't have any information what your cluster or nodes look like. What else is going on with this node? Do you need to do a fsck, wait for a raid scrub to complete, etc? Is it possible your disk was damaged during the outage, or is it perhaps just not able to keep up with the load of both starting workload pods and also servicing the datastore? |
Hi Brandond Many thanks for your quick answer and your tip. Poor performance, interesting, I'll check the sd card, since it was an outage, perhaps it's performing an deep-check. What else should I check?. I didn't filled the Bug Report, since I don't think it's a bug, the cluster was working fine for more than 200 days. Best i5Js |
SD cards are pretty easy to burn out with the sort of repetitive writes that the Kubernetes datastore does. As they fill up and/or run out of write cycles the performance dips as they try to find fresh cells that can be written to. In general I recommend against using them for anything critical. If this is a Raspberry Pi i usually recommend picking up a USB SATA or NVMe device to use for storage on the server node. |
Here is the Bug Report: Environmental Info:
Node(s) CPU architecture, OS, and Version:
Cluster Configuration: 2 Master nodes After a power outage, the master nodes can't start the k3s service with the logs attached in the first post.
K3s service should start K3s service is trying to restart all time Backporting
|
Thanks for your quick comment Brandond. The datastores are mounted via NFS into another server, using an Nvme memory:
|
Can you test your sdcard as described here? #2903 (comment) |
I'll also note that I've not had great results with datastores hosted on NFS. If you're going to host the datastore on another server, you might get better results from running mysql or postgres on that node and pointing at that via --datastore-endpoint, as opposed to just keeping the sqlite or etcd data on a NFS volume. |
I'm unable to install fio :(
The mysql server is in the NFS server too. |
Yes, but the difference would be that the disk is local to the server when using mysql against that server, as opposed to running a database on top of NFS. |
I'm sorry, but I don't understand your last comment. Anyway, I was trying a different test yesterday with one of the masters, and, after uninstalling the current version and installing the last one, the service started successfully, although there were a lot of certificate issues with the worker nodes, so I guess it could not be a database issue. |
I've tested the card... it seems, it's time to buy a new one or reinstall it:
I'll try in the other master. |
the other master seems better, but....
|
I've performed an fsck into a master one, after that:
I'm going to try to install it again. |
Same:
|
I've installed the new version in the master1, and after big troubleshooting, I've all the pods working. Honestly, I don't what was the problem, but it's working now. @brandond just decide if you want to keep this ticket opened. Cheers. |
I have seen Avahi trying to attach itself to all the pod interfaces cause really bad behavior. Could you try disabling/uninstalling avahi? |
Sure thing @brandond , I’ll many thanks |
Same issue here what kind of logs should i share?
|
@mglants have you tried any of the several things suggested in the comments above? |
@brandond SOLVED IT by vacuuming the database. My SQLite wal file was over 80Gigs (Lol). After vacuuming it's sits at 1gb. So timeouting of reading operations because DB is struggling to handle 80Gigs of selects. |
That sounds like #4044 |
|
@jdbogdan Your logs are truncated, both in line length and number of lines. Please run |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
Hi,
After a power outage, I've realize, my master nodes can't boot up properly, please, could you help me??
It's a raspberry cluster, running v1.20.4+k3s1, two master nodes, and 5 workers.
The text was updated successfully, but these errors were encountered: