-
-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 x DietPi (geographically apart) on v6.33.3 are halting almost daily - Rock64 #3939
Comments
Many thanks for your report. Did you enable persistent journald logs explicitly? Since yours look like those are from the boot on only. Note that timestamps might match those from before crash/boot until network time sync corrects them. Please do the following to enable persistent journald logs:
|
Done:
Do you want me to update journalctl next time they halt or run another dietpi-bugreport Thanks a lot for your help! |
Yes that would be great. I'll have a look into the logs then. |
Ok so it halted again... this time I noticed a different behavior on the red led light after enabling persistent journald logs. It kept on blinking every second or so (after halting I assume), then after a while it stopped blinking. Here is the bug report and journalctl attached. Thanks. Bug report sent, reference code: cc2fedee-0cff-4d44-a687-113ee8e59186 journalctl: |
Ah sorry I forgot DietPi-RAMlog. For persistent journald log this needs to be disabled of course. Please do the following:
The reboot is required since the uninstall does not remove the tmpfs mount on /var/log directly (which would fail or break any service that currently writes to logs, like Pi-hole in your case) but prepares it to be done cleanly on reboot. Another thing is recognised is an obsolete dhcpcd with does nothing else as reapplying the anyway static IP address over and over again. Luckily Pi-hole is about to removing the dependency on this. You should disable it: |
reference code: cc2fedee-0cff-4d44-a687-113ee8e59186 journarctl |
The other SBC also halted, here is the bug report: reference code: 7a7e1557-0ead-4d6e-8c6b-50aa00c7217d |
Okay I didn't find a good explanation why those systems crash but the following recommendations to start with:
And one thing to test a little enhancement on ROCK64: On the first board or after updating the second to latest Linux version: Could you try out to replace haveged with the hardware random generator daemon and see if this works fine?
|
Thank you so much for your time on this one, I ran all suggested commands...... here are the results: ======Rock64 # 1=========== root@Rock64:~# dmesg | grep random ====2nd Rock64====== Reading state information... Done E: Sub-process /usr/bin/dpkg returned an error code (1) root@SBC2:~# dmesg | grep random ======================= |
Ok
The dmesg random outputs were after installing rng-tools5 and reboot? To be sure: |
Thank you very much.... ROCK64 #2
================================
===================================================== Thanks |
Okay, on
Then On
I recognised something strange after the last boot:
Setting CPU governor went well before, but on latest boot not. Can you retry this:
|
ROCK64 # 1
============================================= Yes, the 3 kernel packages installed successfully
################################
|
Okay, so far so good when all is up-to-date now. Let's hope future reboots on
Good to know about the hardware generator. If you are in mood, you could test an older rng-tools package (would be still better than haveged):
Else revert to haveged:
Generally, keep an eye on CPU temperature an RAM usage by times when the halts still happen:
The logs currently do not give any hint, it seems to halt without any previous error message or specific action 🤔. |
Rock64 # 1
Rock64 # 2
I decided to try to install wireguard on Rock64 # 2 and it seems like the service is not starting
Do you think a fresh OS re-install would be good at this point? |
The install process with
|
Server:
|
Nevermind my last comment.... I realized I had openvpn still installed... after running pivpn -u managed to remove OpenVPN, rebooted and WG works now. I will update next time any of them both halt. Thanks |
I think we have made some progress.....this is already a record (Rock64#1 - The one that halted more often)
I'll keep an eye on them, if # 1 goes beyond 48 hours, that will be a great improvement. Will continue to update this thread. |
That is great. What does RAM usage and CPU temperature say? |
|
Looks like this little swap file there has not much reason. By default swap files <100 MiB as not created when auto-estimating the size, but yours is lightly larger now (mem + swap sum up to 2048 MiB = 2 GiB, which is the auto-size goal). However with that much free memory, I'd simplify things: |
So far so good.... I don't think I've had 2 days solid. What you suggested seems to have done the trick
Also thanks for your memory suggestion, I ran the command
I also left a volunteering note on 6.34 thread as an appreciation to your time and suggestions. Thanks |
Ok so here is today's update:
Rock64 # 2 - Halted this morning, realized that around 7 am, rebooted it and it halted again a couple hours later, here is the bug report reference code: 7a7e1557-0ead-4d6e-8c6b-50aa00c7217d Thanks |
The machine has a few obsolete package configs file left:
EDIT: Ah wait, do you actively use haveged failed again 🤔:
Again, no CPU governor was applied during boot. It seems that the related /sys files get created after the service is starting, which is strange:
Ah, and finally we have something relevant:
The same kernel error repeats interestingly exactly every three minutes (07:49:00, then 07:52:00 etc). I think the timestamps are wrong due to different time zone here, in case you wonder, but you should be able to find those as well: During boot up there is also some error I'm not happy with:
and
I'll have a look at those. Due to time sync, messages from your current bootup and the errors prior to crash got mixed a up, while I think the error did not appear again after reboot. I'll also try to find out something about this. |
Thanks. I have purged as suggested.
Thanks for your help so far, I will capture bugs if it halts again and will be pending on anything you can find. Thanks a million! |
Update: Rock64 # 1
Rock64 # 2
At this point I feel very confident about the issues being resolved, I will continue to pay attention to it, but running more than 2 days it's something that I never experienced since initially flashing images to these. Thanks a million for all the help provided so far, you rock! |
Many thanks for the kind feedback, though I was a bid distracted by getting DietPi v6.34 ready, so didn't do research about the error messages yet. Will do that soon. |
Rock 64 # 1 has been stable, no recent crashes Rock 64 # 2 halted twice this morning, see logs below:
|
Do you run some service(s) or cron job with raised nice/priority levels or real-time scheduler (round-robin or first-in-first-out)? https://stackoverflow.com/a/35403677
The The crash occurred with the same error. I now recognised something else. After the first part with the call trace (sown above) repeated a few times every exactly three minutes, another error came on top a few seconds later:
Then some additional error lines came on top:
Again, a higher priority/real-time scheduled process seems to be a typical reason for such errors: https://unix.stackexchange.com/questions/252045 |
No, I have not raised any nice priority or use any real time scheduler. Crontab -l only shows this
I was thinking if I should go for an OS reinstall at this point, thoughts? |
I think so. Let me take the change to update our image first, it is more than half a year old. |
Okay, if you reinstall DietPi, please try the new image: https://dietpi.com/downloads/images/testing/DietPi_ROCK64-ARMv8-Buster.7z |
Right on, I will try the new image now. I will let you know if any issues arise. On the other hand, Rock64 # 1 has been pretty stable:
|
Dammit that the issue persists with the new image... I'll give you instructions tomorrow, if don't figure it out yourself first. |
I would like to report that after updating to v6.34.3 both Rock64's are fully functional and no longer halting / crashing. Thank you very much for your support and patience. Happy holidays for you and the Dietpi team, you guys Rock!(64) lol 🥇 |
Great to hear, let's hope that it's finally persistent. Enjoy your Christmas/Holidays. |
Creating a bug report/issue
root@Rock64:~# dietpi-bugreport
[ INFO ] DietPi-Bugreport | Packing upload archive, please wait...
[ OK ] DietPi-Bugreport | Checking URL: ssh.dietpi.com
[ OK ] DietPi-Bugreport | Bug report sent, reference code: cc2fedee-0cff-4d44-a687-113ee8e59186
Required Information
root@Rock64:~# cat /boot/dietpi/.version
G_DIETPI_VERSION_CORE=6
G_DIETPI_VERSION_SUB=33
G_DIETPI_VERSION_RC=3
G_GITBRANCH='master'
G_GITOWNER='MichaIng'
root@Rock64:~# cat /etc/debian_version
10.6
root@Rock64:~# uname -a
Linux Rock64 5.8.17-rockchip64 #20.08.21 SMP PREEMPT Sat Oct 31 08:22:59 CET 2020 aarch64 GNU/Linux
root@Rock64:~# echo $G_HW_MODEL_NAME
ROCK64 (aarch64)
Power supply used: Stock 5V 3000mA
Additional Information (if applicable)
cc2fedee-0cff-4d44-a687-113ee8e59186
Steps to reproduce
Expected behaviour
Actual behaviour
Extra details
journalctl details of the times it halts and the moment I reset it right after
The text was updated successfully, but these errors were encountered: