Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 x DietPi (geographically apart) on v6.33.3 are halting almost daily - Rock64 #3939

Closed
arpegius5555 opened this issue Dec 1, 2020 · 37 comments

Comments

@arpegius5555
Copy link

arpegius5555 commented Dec 1, 2020

Creating a bug report/issue

root@Rock64:~# dietpi-bugreport
[ INFO ] DietPi-Bugreport | Packing upload archive, please wait...
[ OK ] DietPi-Bugreport | Checking URL: ssh.dietpi.com
[ OK ] DietPi-Bugreport | Bug report sent, reference code: cc2fedee-0cff-4d44-a687-113ee8e59186

Required Information

root@Rock64:~# cat /boot/dietpi/.version
G_DIETPI_VERSION_CORE=6
G_DIETPI_VERSION_SUB=33
G_DIETPI_VERSION_RC=3
G_GITBRANCH='master'
G_GITOWNER='MichaIng'

root@Rock64:~# cat /etc/debian_version
10.6

root@Rock64:~# uname -a
Linux Rock64 5.8.17-rockchip64 #20.08.21 SMP PREEMPT Sat Oct 31 08:22:59 CET 2020 aarch64 GNU/Linux

root@Rock64:~# echo $G_HW_MODEL_NAME
ROCK64 (aarch64)

Power supply used: Stock 5V 3000mA

  • SDcard used: None - using eMMC

Additional Information (if applicable)

  • Software title | (EG: Nextcloud) PiVPN, PiHole, Webmin, NFS
  • Was the software title installed freshly or updated/migrated?
  • Can this issue be replicated on a fresh installation of DietPi? Yes, it is happening in two SBCs geographically apart, One was fresh installed 2 weeks after the other

cc2fedee-0cff-4d44-a687-113ee8e59186

Steps to reproduce

  1. Boot and let idle for a few hours, then it halts sometime after a few hours. Doesn't run longer than 8 to 12 hours.

Expected behaviour

  • Remain active and working,

Actual behaviour

  • ...it halts totally, doesn't respond to SSH, VPN or ping, and the red LED stops flashing

Extra details

journalctl details of the times it halts and the moment I reset it right after

Nov 22 14:17:04 Rock64 kernel: rockchip-drm display-subsystem: [drm] Cannot find any crtc or sizes
Nov 22 14:17:04 Rock64 systemd-timesyncd[645]: System clock time unset or jumped backwards, restoring from recorded timestamp: Sun 2020-11-22 18:13:12 EST
Nov 22 18:13:12 Rock64 systemd[1]: Starting Clean php session files...

Nov 22 18:13:16 Rock64 dhcpcd[454]: eth0: soliciting an IPv6 router
Nov 22 18:13:17 Rock64 DietPi-Boot[567]: [ INFO ] DietPi-Run_NTPD | Waiting for completion of systemd-timesyncd (5/60)
Nov 23 23:34:45 Rock64 systemd-timesyncd[645]: Synchronized to time server for the first time 199.182.221.110:123 (0.debian.pool.ntp.org).

Nov 26 14:17:09 Rock64 systemd[1]: Mounted /mnt/hiddenfoldername
Nov 27 11:37:16 Rock64 systemd-timesyncd[658]: Synchronized to time server for the first time 216.232.132.102:123 (0.debian.pool.ntp.org).

Nov 29 23:17:09 Rock64 DietPi-Boot[612]: [ INFO ] DietPi-Run_NTPD | Waiting for completion of systemd-timesyncd (5/60)
Nov 30 19:50:08 Rock64 systemd-timesyncd[672]: Synchronized to time server for the first time 138.197.135.239:123 (0.debian.pool.ntp.org).
@MichaIng
Copy link
Owner

MichaIng commented Dec 1, 2020

Many thanks for your report. Did you enable persistent journald logs explicitly? Since yours look like those are from the boot on only. Note that timestamps might match those from before crash/boot until network time sync corrects them.

Please do the following to enable persistent journald logs:

mkdir /var/log/journal

@arpegius5555
Copy link
Author

arpegius5555 commented Dec 1, 2020

Done:

root@Rock64:~# mkdir /var/log/journal
root@Rock64:~#

Do you want me to update journalctl next time they halt or run another dietpi-bugreport

Thanks a lot for your help!

@MichaIng
Copy link
Owner

MichaIng commented Dec 2, 2020

Do you want me to update journalctl next time they halt or run another dietpi-bugreport

Yes that would be great. I'll have a look into the logs then.

@arpegius5555
Copy link
Author

arpegius5555 commented Dec 3, 2020

Ok so it halted again... this time I noticed a different behavior on the red led light after enabling persistent journald logs. It kept on blinking every second or so (after halting I assume), then after a while it stopped blinking. Here is the bug report and journalctl attached. Thanks.

Bug report sent, reference code: cc2fedee-0cff-4d44-a687-113ee8e59186

journalctl:
Dec 02 04:17:04 Rock64 DietPi-Boot[602]: [ INFO ] DietPi-Run_NTPD | Waiting for completion of systemd-timesyncd (1/60)
Dec 02 04:17:04 Rock64 kernel: sda: sda1
Dec 02 04:17:04 Rock64 kernel: sd 0:0:0:0: [sda] Attached SCSI disk
Dec 02 04:17:05 Rock64 DietPi-Boot[602]: [ INFO ] DietPi-Run_NTPD | Waiting for completion of systemd-timesyncd (2/60)
Dec 02 04:17:06 Rock64 DietPi-Boot[602]: [ INFO ] DietPi-Run_NTPD | Waiting for completion of systemd-timesyncd (3/60)
Dec 02 04:17:07 Rock64 dhcpcd[462]: eth0: carrier acquired
Dec 02 04:17:07 Rock64 kernel: rk_gmac-dwmac ff540000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
Dec 02 04:17:07 Rock64 dhcpcd[462]: eth0: IAID 13:14:d3:38
Dec 02 04:17:07 Rock64 dhcpcd[462]: eth0: using static address 192.168.17.111/24
Dec 02 04:17:07 Rock64 dhcpcd[462]: eth0: adding route to 192.168.17.0/24
Dec 02 04:17:07 Rock64 dhcpcd[462]: eth0: adding default route via 192.168.17.1
Dec 02 04:17:07 Rock64 dhcpcd[462]: eth0: soliciting an IPv6 router
Dec 02 04:17:07 Rock64 DietPi-Boot[602]: [ INFO ] DietPi-Run_NTPD | Waiting for completion of systemd-timesyncd (4/60)
Dec 02 04:17:08 Rock64 DietPi-Boot[602]: [ INFO ] DietPi-Run_NTPD | Waiting for completion of systemd-timesyncd (5/60)
Dec 03 00:02:21 Rock64 systemd-timesyncd[662]: Synchronized to time server for the first time 159.203.8.72:123 (0.debian.pool.ntp.org).
Dec 03 00:02:21 Rock64 systemd[1]: Starting Clean php session files...
Dec 03 00:02:21 Rock64 DietPi-Boot[602]: [ OK ] DietPi-Run_NTPD | systemd-timesyncd synced
Dec 03 00:02:21 Rock64 systemd[1]: Stopping Network Time Synchronization...
Dec 03 00:02:21 Rock64 systemd[1]: systemd-timesyncd.service: Succeeded.
Dec 03 00:02:21 Rock64 systemd[1]: Stopped Network Time Synchronization.

@MichaIng
Copy link
Owner

MichaIng commented Dec 3, 2020

Ah sorry I forgot DietPi-RAMlog. For persistent journald log this needs to be disabled of course. Please do the following:

dietpi-software uninstall 103
mkdir /var/log/journal
reboot

The reboot is required since the uninstall does not remove the tmpfs mount on /var/log directly (which would fail or break any service that currently writes to logs, like Pi-hole in your case) but prepares it to be done cleanly on reboot.

Another thing is recognised is an obsolete dhcpcd with does nothing else as reapplying the anyway static IP address over and over again. Luckily Pi-hole is about to removing the dependency on this. You should disable it: systemctl disable --now dhcpcd

@arpegius5555
Copy link
Author

reference code: cc2fedee-0cff-4d44-a687-113ee8e59186

journarctl
Dec 04 12:32:56 Rock64 systemd[1]: Stopping DHCP Client Daemon...
Dec 04 12:32:56 Rock64 dhcpcd[445]: dhcpcd exited
Dec 04 12:32:56 Rock64 systemd[1]: dhcpcd.service: Succeeded.
Dec 04 12:32:56 Rock64 systemd[1]: Stopped DHCP Client Daemon.
Dec 04 12:35:01 Rock64 CRON[1596]: pam_unix(cron:session): session opened for user root by (uid=0)
Dec 04 12:35:01 Rock64 CRON[1597]: (root) CMD (~/duckdns/duck.sh >/dev/null 2>&1)
Dec 04 12:35:03 Rock64 CRON[1596]: pam_unix(cron:session): session closed for user root
Dec 04 12:39:01 Rock64 CRON[1612]: pam_unix(cron:session): session opened for user root by (uid=0)
Dec 04 12:39:01 Rock64 CRON[1613]: (root) CMD ( [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Dec 04 12:39:01 Rock64 CRON[1612]: pam_unix(cron:session): session closed for user root
-- Reboot --
Dec 04 16:36:41 Rock64 systemd-timesyncd[641]: Synchronized to time server for the first time 207.34.48.31:123 (0.debian.pool.ntp.org).
Dec 04 16:36:41 Rock64 systemd[1]: Starting Clean php session files...
Dec 04 16:36:42 Rock64 DietPi-Boot[569]: [ OK ] DietPi-Run_NTPD | systemd-timesyncd synced
Dec 04 16:36:42 Rock64 systemd[1]: Stopping Network Time Synchronization...
Dec 04 16:36:42 Rock64 systemd[1]: systemd-timesyncd.service: Succeeded.
Dec 04 16:36:42 Rock64 systemd[1]: Stopped Network Time Synchronization.
Dec 04 16:36:42 Rock64 DietPi-Boot[569]: [ OK ] Network time sync | Completed

@arpegius5555
Copy link
Author

The other SBC also halted, here is the bug report:

reference code: 7a7e1557-0ead-4d6e-8c6b-50aa00c7217d

@MichaIng
Copy link
Owner

MichaIng commented Dec 5, 2020

Okay I didn't find a good explanation why those systems crash but the following recommendations to start with:

  • dhcpcd is still active: systemctl disable --now dhcpcd
  • You use htpdate for network time sync, but DietPi-Run_NTPD is active as well. Either remove or disable the second to avoid interfering:
    G_CONFIG_INJECT 'CONFIG_NTP_MODE=' 'CONFIG_NTP_MODE=0' /boot/dietpi.txt
    
  • There is a swap file assigned, but it has holes. Please try to remove or, if required, recreate it:
    /boot/dietpi/func/dietpi-set_swapfile 1
    
    This will auto-size it so that 2 GiB overall memory are assured, which means none in your case since it's a 2 GiB board. Otherwise replace 1 with any other number which will then be the swap file size in MiB.
  • You have WireGuard and OpenVPN both running. One VPN server should be enough or you use one as server and one as client?
  • There are a few error messages around NFS client and RPC bind, where you probably should have a look at. If not used, you should disable the service(s): systemctl disable nfs-client.target; systemctl disable --now rpcbind
  • On the second machine, you have Dropbear and OpenSSH server installed. The first succeeds (and is obviously used) while the second of course fails to bind to the same port. I suggest to purge OpenSSH: apt purge openssh-server
  • The second machine is on a very old kernel version. You could try to upgrade it to the current Armbian Linux 5.X version:
    apt install linux-image-current-rockchip64 linux-dtb-current-rockchip64 linux-u-boot-rock64-current linux-stretch-root-current-rock64
    # If the above fails with package conflicts, you first need to purge the old ones
    apt purge linux-dtb-rockchip64 linux-image-rockchip64 linux-stretch-root-rock64 linux-u-boot-rock64-default
    # Then flash u-boot to be sure
    . /usr/lib/u-boot/platform_install.sh
    write_uboot_platform /usr/lib/linux-u-boot-current-rock64_20.11_arm64 /dev/mmcblk0
    
    That will also allow you to install WireGuard on that machine to replace the OpenVPN server 😉.

And one thing to test a little enhancement on ROCK64: On the first board or after updating the second to latest Linux version: Could you try out to replace haveged with the hardware random generator daemon and see if this works fine?

apt -y purge haveged
apt -y install rng-tools5
reboot
# after reboot
dmesg | grep random

@arpegius5555
Copy link
Author

Thank you so much for your time on this one, I ran all suggested commands...... here are the results:

======Rock64 # 1===========

root@Rock64:~# dmesg | grep random
[ 0.000000] random: get_random_bytes called from start_kernel+0x674/0x82c with crng_init=0
[ 3.338237] random: fast init done
[ 5.588847] random: systemd: uninitialized urandom read (16 bytes read)
[ 5.711134] random: systemd: uninitialized urandom read (16 bytes read)
[ 5.726272] random: systemd: uninitialized urandom read (16 bytes read)
[ 12.102060] random: crng init done
[ 12.102382] random: 7 urandom warning(s) missed due to ratelimiting

====2nd Rock64======
Errors were encountered while processing:
/var/cache/apt/archives/linux-image-current-rockchip64_20.11.1_arm64.deb

Reading state information... Done
E: Unable to locate package linux-u-boot-rock64-default

E: Sub-process /usr/bin/dpkg returned an error code (1)

root@SBC2:~# dmesg | grep random
[ 0.000000] random: get_random_bytes called from start_kernel+0x690/0x848 with crng_init=0
[ 3.282689] random: fast init done
[ 3.383377] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[ 3.384706] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[ 3.385459] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[ 10.897422] random: crng init done
[ 10.897757] random: 7 urandom warning(s) missed due to ratelimiting

=======================
I do have Wireguard running on Rock64 # 1, (the one that crashes more often) and have OpenVPN on Rock64 # 2
I am intending to run Wireguard on both, once the crashing stops. Thanks a million again.

@MichaIng
Copy link
Owner

MichaIng commented Dec 7, 2020

Ok ROCK64 #2 kernel upgrade did not jet finish, please try the following which should work regardless of which U-Boot package is installed. If the purge still fails, check dpkg -l | grep 'u-boot' and if nothing is listed, remove the package from the command arguments:

apt purge linux-dtb-rockchip64 linux-image-rockchip64 linux-stretch-root-rock64 linux-u-boot-rock64-*
apt install linux-image-current-rockchip64 linux-dtb-current-rockchip64 linux-u-boot-rock64-current linux-stretch-root-current-rock64

The dmesg random outputs were after installing rng-tools5 and reboot? To be sure: systemctl status rngd
So we can add this by default to our ROCK64 images for more efficient entropy collection.

@arpegius5555
Copy link
Author

arpegius5555 commented Dec 7, 2020

Thank you very much.... ROCK64 #2

The following NEW packages will be installed:
  linux-u-boot-rock64-current
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
Need to get 0 B/298 kB of archives.
After this operation, 1,024 B of additional disk space will be used.
Do you want to continue? [Y/n]
Selecting previously unselected package linux-u-boot-rock64-current.
(Reading database ... 83192 files and directories currently installed.)
Preparing to unpack .../linux-u-boot-rock64-current_20.11_arm64.deb ...
Unpacking linux-u-boot-rock64-current (20.11) ...
Setting up linux-u-boot-rock64-current (20.11) ...

================================

root@SBC2:~# systemctl status rngd
● rngd.service - Start entropy gathering daemon (rngd)
   Loaded: loaded (/lib/systemd/system/rngd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2020-12-07 15:07:05 EST; 1min 25s ago
     Docs: man:rngd(8)
  Process: 464 ExecStart=/usr/sbin/rngd -f (code=exited, status=1/FAILURE)
 Main PID: 464 (code=exited, status=1/FAILURE)

Dec 07 15:07:05 SBC2 systemd[1]: Started Start entropy gathering daemon (rngd).
Dec 07 15:07:05 SBC2 rngd[464]: Unable to open file: /dev/tpm0
Dec 07 15:07:05 SBC2 rngd[464]: can't open any entropy source
Dec 07 15:07:05 SBC2 rngd[464]: Maybe RNG device modules are not loaded
Dec 07 15:07:05 SBC2 systemd[1]: rngd.service: Main process exited, code=exited, status=1/FAILURE
Dec 07 15:07:05 SBC2 systemd[1]: rngd.service: Unit entered failed state.
Dec 07 15:07:05 SBC2 systemd[1]: rngd.service: Failed with result 'exit-code'.

=====================================================
ROCK64#1 halted again. Here is the bug report:
cc2fedee-0cff-4d44-a687-113ee8e59186

Thanks

@MichaIng
Copy link
Owner

MichaIng commented Dec 7, 2020

Okay, on ROCK64 #2 the other three kernel packages were installed already + successfully?
Then don't forget to flash the bootloader:

. /usr/lib/u-boot/platform_install.sh
write_uboot_platform /usr/lib/linux-u-boot-current-rock64_20.11_arm64 /dev/mmcblk0

Then reboot and probably that made as well the entropy daemon work (after reboot, with new kernel loaded): systemctl status rngd


On ROCK64 #1, the kernel is already on latest version?

apt update
apt install linux-image-current-rockchip64 linux-dtb-current-rockchip64 linux-u-boot-rock64-current linux-buster-root-current-rock64

I recognised something strange after the last boot:

Dec 07 17:03:09 Rock64 systemd[1]: Starting DietPi-PreBoot...
Dec 07 17:03:09 Rock64 DietPi-PreBoot[423]: [ SUB1 ] DietPi-CPU_set > Applying CPU governor settings: ondemand
Dec 07 17:03:09 Rock64 DietPi-PreBoot[423]: [ INFO ] DietPi-CPU_set | Setting CPU frequency limits : Max = Disabled MHz | Min = Disabled MHz
Dec 07 17:03:09 Rock64 DietPi-PreBoot[423]: [ INFO ] DietPi-CPU_set | Setting up_threshold: 50 %
Dec 07 17:03:09 Rock64 DietPi-PreBoot[423]: [ INFO ] DietPi-CPU_set | Setting sampling_rate: 25000 microseconds
Dec 07 17:03:09 Rock64 DietPi-PreBoot[423]: [ INFO ] DietPi-CPU_set | Setting sampling_down_factor: 40
Dec 07 17:03:09 Rock64 DietPi-PreBoot[423]: [  OK  ] DietPi-CPU_set | Applied CPU governor settings: ondemand
Dec 07 17:03:09 Rock64 systemd[1]: Started DietPi-PreBoot.
-- Reboot --
Dec 07 18:17:02 Rock64 systemd[1]: Starting DietPi-PreBoot...
Dec 07 18:17:02 Rock64 DietPi-PreBoot[422]: DietPi-CPU_set | CPU governors are not supported on this device. Aborting...
Dec 07 18:17:02 Rock64 systemd[1]: Started DietPi-PreBoot.

Setting CPU governor went well before, but on latest boot not. Can you retry this:

/boot/dietpi/func/dietpi-set_cpu
ls -l /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

@arpegius5555
Copy link
Author

arpegius5555 commented Dec 7, 2020

ROCK64 # 1

root@Rock64: # apt install linux-image-current-rockchip64 linux-dtb-current-rock                                                                                                             chip64 linux-u-boot-rock64-current linux-buster-root-current-rock64
Reading package lists... Done
Building dependency tree
Reading state information... Done
Suggested packages:
  armbian-config
Recommended packages:
  toilet
The following packages will be upgraded:
  linux-buster-root-current-rock64 linux-dtb-current-rockchip64
  linux-image-current-rockchip64 linux-u-boot-rock64-current
4 upgraded, 0 newly installed, 0 to remove and 14 not upgraded.
Need to get 42.6 MB of archives.
After this operation, 1,690 kB disk space will be freed.
Get:4 https://armbian.hosthatch.com/apt buster/main arm64 linux-u-boot-rock64-cu                                                                                                             rrent arm64 20.11 [298 kB]
Get:2 https://mirrors.netix.net/armbian/apt buster/main arm64 linux-dtb-current-                                                                                                             rockchip64 arm64 20.11.1 [294 kB]
Get:3 https://mirrors.dotsrc.org/armbian-apt buster/main arm64 linux-image-curre                                                                                                             nt-rockchip64 arm64 20.11.1 [41.6 MB]
Get:1 https://armbian.systemonachip.net/apt buster/main arm64 linux-buster-root-                                                                                                             current-rock64 arm64 20.11 [413 kB]
Fetched 42.6 MB in 7s (5,720 kB/s)
(Reading database ... 80987 files and directories currently installed.)
Preparing to unpack .../linux-buster-root-current-rock64_20.11_arm64.deb ...
Unpacking linux-buster-root-current-rock64 (20.11) over (20.08.17) ...
Preparing to unpack .../linux-dtb-current-rockchip64_20.11.1_arm64.deb ...
Unpacking linux-dtb-current-rockchip64 (20.11.1) over (20.08.21) ...
Preparing to unpack .../linux-image-current-rockchip64_20.11.1_arm64.deb ...
update-initramfs: Deleting /boot/initrd.img-5.8.17-rockchip64
Removing obsolete file uInitrd-5.8.17-rockchip64
Unpacking linux-image-current-rockchip64 (20.11.1) over (20.08.21) ...
Preparing to unpack .../linux-u-boot-rock64-current_20.11_arm64.deb ...
Unpacking linux-u-boot-rock64-current (20.11) over (20.08.13) ...
Setting up linux-buster-root-current-rock64 (20.11) ...
Failed to enable unit: Unit file /etc/systemd/system/armbian-ramlog.service is m                                                                                                             asked.
Setting up linux-image-current-rockchip64 (20.11.1) ...
update-initramfs: Generating /boot/initrd.img-5.9.11-rockchip64
update-initramfs: Converting to u-boot format
Setting up linux-u-boot-rock64-current (20.11) ...
Setting up linux-dtb-current-rockchip64 (20.11.1) ...
Processing triggers for initramfs-tools (0.133+deb10u1) ...
update-initramfs: Generating /boot/initrd.img-5.9.11-rockchip64
update-initramfs: Converting to u-boot format
root@Rock64:~#

=============================================
ROCK64 # 2

Yes, the 3 kernel packages installed successfully

root@SBC2:~# . /usr/lib/u-boot/platform_install.sh
root@SBC2:~# write_uboot_platform /usr/lib/linux-u-boot-current-rock64_20.11_arm64 /dev/mmcblk0
root@SBC2:~# reboot
#######Did not reboot, had to ask someone to power cycle it

root@SBC2:~# systemctl status rngd
● rngd.service - Start entropy gathering daemon (rngd)
   Loaded: loaded (/lib/systemd/system/rngd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2020-12-07 15:59:42 EST; 2h 5min ago
     Docs: man:rngd(8)
  Process: 462 ExecStart=/usr/sbin/rngd -f (code=exited, status=1/FAILURE)
 Main PID: 462 (code=exited, status=1/FAILURE)

Dec 07 15:59:42 SBC2 systemd[1]: Started Start entropy gathering daemon (rngd).
Dec 07 15:59:42 SBC2 rngd[462]: Unable to open file: /dev/tpm0
Dec 07 15:59:42 SBC2 rngd[462]: can't open any entropy source
Dec 07 15:59:42 SBC2 rngd[462]: Maybe RNG device modules are not loaded
Dec 07 15:59:42 SBC2 systemd[1]: rngd.service: Main process exited, code=exited, status=1/FAILURE
Dec 07 15:59:42 SBC2 systemd[1]: rngd.service: Unit entered failed state.
Dec 07 15:59:42 SBC2 systemd[1]: rngd.service: Failed with result 'exit-code'.
root@SBC2:~#

################################

root@SBC2:~# ls -l /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
-r--r--r-- 1 root root 4096 Dec  7 15:59 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors


root@Rock64: # ls -l /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
-r--r--r-- 1 root root 4096 Dec  7 15:58 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
root@Rock64: #

@MichaIng
Copy link
Owner

MichaIng commented Dec 8, 2020

Okay, so far so good when all is up-to-date now. Let's hope future reboots on #2 succeed and that one was due to the large upgrade only.

/boot/dietpi/func/dietpi-set_cpu now succeeds on both boards?

Good to know about the hardware generator. If you are in mood, you could test an older rng-tools package (would be still better than haveged):

apt purge rng-tools5
apt install rng-tools
systemctl status rng-tools

Else revert to haveged:

apt purge rng-tools*
apt install haveged

Generally, keep an eye on CPU temperature an RAM usage by times when the halts still happen:

cpu
htop

The logs currently do not give any hint, it seems to halt without any previous error message or specific action 🤔.

@arpegius5555
Copy link
Author

Rock64 # 1

root@Rock64:~# /boot/dietpi/func/dietpi-set_cpu
[ SUB1 ] DietPi-CPU_set > Applying CPU governor settings: ondemand
[ INFO ] DietPi-CPU_set | Setting CPU frequency limits : Max = Disabled MHz | Min = Disabled MHz
[ INFO ] DietPi-CPU_set | Setting up_threshold: 50 %
[ INFO ] DietPi-CPU_set | Setting sampling_rate: 25000 microseconds
[ INFO ] DietPi-CPU_set | Setting sampling_down_factor: 40
[  OK  ] DietPi-CPU_set | Applied CPU governor settings: ondemand
=========================
root@Rock64:~# systemctl status rng-tools
● rng-tools.service
   Loaded: loaded (/etc/init.d/rng-tools; generated)
   Active: failed (Result: exit-code) since Mon 2020-12-07 19:20:26 EST; 9s ago
     Docs: man:systemd-sysv-generator(8)

Dec 07 19:20:26 Rock64 systemd[1]: Starting rng-tools.service...
Dec 07 19:20:26 Rock64 rng-tools[2704]: Starting Hardware RNG entropy gatherer daemon: (Hardware RNG device inode not found)
Dec 07 19:20:26 Rock64 rng-tools[2704]: /etc/init.d/rng-tools: Cannot find a hardware RNG device to use.
Dec 07 19:20:26 Rock64 systemd[1]: rng-tools.service: Control process exited, code=exited, status=1/FAILURE
Dec 07 19:20:26 Rock64 systemd[1]: rng-tools.service: Failed with result 'exit-code'.
Dec 07 19:20:26 Rock64 systemd[1]: Failed to start rng-tools.service.
========================
apt install haveged
#did run successfully

Rock64 # 2

root@SBC2:~# /boot/dietpi/func/dietpi-set_cpu
[ SUB1 ] DietPi-CPU_set > Applying CPU governor settings: ondemand
[ INFO ] DietPi-CPU_set | Setting CPU frequency limits : Max = Disabled MHz | Min = Disabled MHz
[ INFO ] DietPi-CPU_set | Setting up_threshold: 50 %
[ INFO ] DietPi-CPU_set | Setting sampling_rate: 25000 microseconds
[ INFO ] DietPi-CPU_set | Setting sampling_down_factor: 80
[  OK  ] DietPi-CPU_set | Applied CPU governor settings: ondemand

===========================
root@SBC2:~# systemctl status rng-tools
● rng-tools.service
   Loaded: loaded (/etc/init.d/rng-tools; generated; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2020-12-07 19:24:58 EST; 13s ago
     Docs: man:systemd-sysv-generator(8)

Dec 07 19:24:58 SBC2 systemd[1]: Starting rng-tools.service...
Dec 07 19:24:58 SBC2 rng-tools[1885]: Starting Hardware RNG entropy gatherer daemon: (Hardware RNG device inode not found)
Dec 07 19:24:58 SBC2 rng-tools[1885]: /etc/init.d/rng-tools: Cannot find a hardware RNG device to use.
Dec 07 19:24:58 SBC2 systemd[1]: rng-tools.service: Control process exited, code=exited status=1
Dec 07 19:24:58 SBC2 systemd[1]: Failed to start rng-tools.service.
Dec 07 19:24:58 SBC2 systemd[1]: rng-tools.service: Unit entered failed state.
Dec 07 19:24:58 SBC2 systemd[1]: rng-tools.service: Failed with result 'exit-code'.
======================
apt install haveged
#did run successfully

I decided to try to install wireguard on Rock64 # 2 and it seems like the service is not starting

:: [ERR] WireGuard is not running, try to start now? [Y/n]
Job for [email protected] failed because the control process exited with error code.
See "systemctl status [email protected]" and "journalctl -xe" for details.
======================
Job for [email protected] failed because the control process exited with error code.

Do you think a fresh OS re-install would be good at this point?

@MichaIng
Copy link
Owner

MichaIng commented Dec 8, 2020

The install process with dietpi-software went through without errors? Did you set it up as client or as server?
Can you try to start it manually:

wg-quick down wg0 # failsafe
wg-quick up wg0

@arpegius5555
Copy link
Author

Server:

root@SBC2:~# systemctl status [email protected][email protected] - WireGuard via wg-quick(8) for wg0
   Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/[email protected]
           └─override.conf
   Active: failed (Result: exit-code) since Mon 2020-12-07 19:35:30 EST; 1min 48s ago
     Docs: man:wg-quick(8)
           man:wg(8)
           https://www.wireguard.com/
           https://www.wireguard.com/quickstart/
           https://git.zx2c4.com/wireguard-tools/about/src/man/wg-quick.8
           https://git.zx2c4.com/wireguard-tools/about/src/man/wg.8
  Process: 2658 ExecStart=/usr/bin/wg-quick up wg0 (code=exited, status=2)
 Main PID: 2658 (code=exited, status=2)

Dec 07 19:35:30 SBC2 wg-quick[2658]: [#] ip link add wg0 type wireguard
Dec 07 19:35:30 SBC2 wg-quick[2658]: [#] wg setconf wg0 /dev/fd/63
Dec 07 19:35:30 SBC2 wg-quick[2658]: [#] ip -4 address add 10.6.0.1/24 dev wg0
Dec 07 19:35:30 SBC2 wg-quick[2658]: [#] ip link set mtu 1420 up dev wg0
Dec 07 19:35:30 SBC2 wg-quick[2658]: RTNETLINK answers: Address already in use
Dec 07 19:35:30 SBC2 wg-quick[2658]: [#] ip link delete dev wg0
Dec 07 19:35:30 SBC2 systemd[1]: [email protected]: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 07 19:35:30 SBC2 systemd[1]: Failed to start WireGuard via wg-quick(8) for wg0.
Dec 07 19:35:30 SBC2 systemd[1]: [email protected]: Unit entered failed state.
Dec 07 19:35:30 SBC2 systemd[1]: [email protected]: Failed with result 'exit-code'.

============

wg-quick down wg0 # failsafe
wg-quick: `wg0' is not a WireGuard interface

@arpegius5555
Copy link
Author

Nevermind my last comment.... I realized I had openvpn still installed... after running pivpn -u managed to remove OpenVPN, rebooted and WG works now. I will update next time any of them both halt.

Thanks

@arpegius5555
Copy link
Author

I think we have made some progress.....this is already a record (Rock64#1 - The one that halted more often)

root@Rock64:~# uptime
 14:48:03 up 22:49,  1 user,  load average: 0.00, 0.00, 0.00
===============
root@SBC2:~# uptime
 14:51:19 up 19:09,  2 users,  load average: 1.01, 1.02, 1.00

I'll keep an eye on them, if # 1 goes beyond 48 hours, that will be a great improvement. Will continue to update this thread.

@MichaIng
Copy link
Owner

MichaIng commented Dec 8, 2020

That is great. What does RAM usage and CPU temperature say?

@arpegius5555
Copy link
Author

 - CPU temp : 51'C : 123'F (Running warm, but safe)
MiB Mem :   1918.6 total,   1486.7 free,    144.7 used,    287.2 buff/cache
MiB Swap:    130.0 total,    130.0 free,      0.0 used.   1633.1 avail Mem

@MichaIng
Copy link
Owner

MichaIng commented Dec 9, 2020

Looks like this little swap file there has not much reason. By default swap files <100 MiB as not created when auto-estimating the size, but yours is lightly larger now (mem + swap sum up to 2048 MiB = 2 GiB, which is the auto-size goal). However with that much free memory, I'd simplify things: /boot/dietpi/func/dietpi-set_swapfile 0

@arpegius5555
Copy link
Author

So far so good.... I don't think I've had 2 days solid. What you suggested seems to have done the trick

root@Rock64:~# uptime
 16:38:24 up 2 days, 39 min,  1 user,  load average: 0.00, 0.00, 0.00

Also thanks for your memory suggestion, I ran the command

root@Rock64:~# /boot/dietpi/func/dietpi-set_swapfile 0
[ SUB1 ] DietPi-Set_swapfile > Applying 0 /var/swap
[ INFO ] DietPi-Set_swapfile | Disabling and deleting all existing swap files
[  OK  ] DietPi-Set_swapfile | swapoff -a
removed '/var/swap'
[  OK  ] DietPi-Set_swapfile | Setting in /boot/dietpi.txt adjusted: AUTO_SETUP_SWAPFILE_SIZE=0
[  OK  ] DietPi-Set_swapfile | Desired setting in /boot/dietpi.txt was already set: AUTO_SETUP_SWAPFILE_LOCATION=/var/swap
[ INFO ] DietPi-Set_swapfile | Setting /tmp tmpfs size: 959 MiB
[  OK  ] DietPi-Set_swapfile | mount -o remount /tmp

I also left a volunteering note on 6.34 thread as an appreciation to your time and suggestions. Thanks

@arpegius5555
Copy link
Author

Ok so here is today's update:
Rock64 #1

root@Rock64:~# uptime
 10:02:44 up 2 days, 18:04,  1 user,  load average: 0.00, 0.00, 0.00

Rock64 # 2 - Halted this morning, realized that around 7 am, rebooted it and it halted again a couple hours later, here is the bug report

reference code: 7a7e1557-0ead-4d6e-8c6b-50aa00c7217d

Thanks

@MichaIng
Copy link
Owner

MichaIng commented Dec 10, 2020

The machine has a few obsolete package configs file left:

apt purge apt-show-versions openssh-client ddclient armbian-tools-stretch

EDIT: Ah wait, do you actively use ddclient it's strange as dpkg -l ddclient shows it being uninstalled, only config files left (rc) while systemctl status ddclient shows it started up. If you use it, reinstall it apt install --reinstall ddclient otherwise purge it together with above.
EDIT2: Ah you use DuckDNS (good choice!) which updates via simple cron job and shell script. So ddclient is pretty sure obsolete.

haveged failed again 🤔:

systemctl restart haveged
sleep 1
systemctl status haveged

Again, no CPU governor was applied during boot. It seems that the related /sys files get created after the service is starting, which is strange:

systemctl status dietpi-preboot
# shows:
Dec 10 09:17:02 SBC2 DietPi-PreBoot[474]: DietPi-CPU_set | CPU governors are not supported on this device. Aborting...
/boot/dietpi/func/dietpi-set_cpu
# works fine

Ah, and finally we have something relevant:

Dec 10 07:46:00 SBC2 kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Dec 10 07:46:00 SBC2 kernel: rcu:         3-...0: (1 GPs behind) idle=9de/0/0x1 softirq=269307/269308 fqs=7149
Dec 10 07:46:00 SBC2 kernel:         (t=15000 jiffies g=964349 q=164)
Dec 10 07:46:00 SBC2 kernel: Task dump for CPU 3:
Dec 10 07:46:00 SBC2 kernel: task:swapper/3       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
Dec 10 07:46:00 SBC2 kernel: Call trace:
Dec 10 07:46:00 SBC2 kernel:  dump_backtrace+0x0/0x1f0
Dec 10 07:46:00 SBC2 kernel:  show_stack+0x18/0x28
Dec 10 07:46:00 SBC2 kernel:  sched_show_task+0x13c/0x168
Dec 10 07:46:00 SBC2 kernel:  dump_cpu_task+0x44/0x54
Dec 10 07:46:00 SBC2 kernel:  rcu_dump_cpu_stacks+0xb0/0xf0
Dec 10 07:46:00 SBC2 kernel:  rcu_sched_clock_irq+0xb34/0xe70
Dec 10 07:46:00 SBC2 kernel:  update_process_times+0x30/0x70
Dec 10 07:46:00 SBC2 kernel:  tick_sched_handle.isra.0+0x34/0x58
Dec 10 07:46:00 SBC2 kernel:  tick_sched_timer+0x58/0xb0
Dec 10 07:46:00 SBC2 kernel:  __hrtimer_run_queues+0x148/0x3b0
Dec 10 07:46:00 SBC2 kernel:  hrtimer_interrupt+0xf4/0x258
Dec 10 07:46:00 SBC2 kernel:  arch_timer_handler_phys+0x34/0x48
Dec 10 07:46:00 SBC2 kernel:  handle_percpu_devid_irq+0xa0/0x2b8
Dec 10 07:46:00 SBC2 kernel:  generic_handle_irq+0x30/0x48
Dec 10 07:46:00 SBC2 kernel:  __handle_domain_irq+0x94/0x108
Dec 10 07:46:00 SBC2 kernel:  gic_handle_irq+0x54/0xa8
Dec 10 07:46:00 SBC2 kernel:  el1_irq+0xb8/0x180
Dec 10 07:46:00 SBC2 kernel:  arch_cpu_idle+0x14/0x20
Dec 10 07:46:00 SBC2 kernel:  do_idle+0x210/0x260
Dec 10 07:46:00 SBC2 kernel:  cpu_startup_entry+0x28/0x60
Dec 10 07:46:00 SBC2 kernel:  secondary_start_kernel+0x148/0x180

The same kernel error repeats interestingly exactly every three minutes (07:49:00, then 07:52:00 etc). I think the timestamps are wrong due to different time zone here, in case you wonder, but you should be able to find those as well: journalctl

During boot up there is also some error I'm not happy with:

Dec 10 13:17:03 SBC2 systemd[1]: Starting LSB: Start htpdate daemon...
Dec 10 13:17:03 SBC2 kernel: Unable to handle kernel paging request at virtual address ffff8000775b6240
Dec 10 13:17:03 SBC2 kernel: Mem abort info:
Dec 10 13:17:03 SBC2 kernel:   ESR = 0x96000005
Dec 10 13:17:03 SBC2 kernel:   EC = 0x25: DABT (current EL), IL = 32 bits
Dec 10 13:17:03 SBC2 kernel:   SET = 0, FnV = 0
Dec 10 13:17:03 SBC2 kernel:   EA = 0, S1PTW = 0
Dec 10 13:17:03 SBC2 kernel: Data abort info:
Dec 10 13:17:03 SBC2 kernel:   ISV = 0, ISS = 0x00000005
Dec 10 13:17:03 SBC2 kernel:   CM = 0, WnR = 0
Dec 10 13:17:03 SBC2 kernel: swapper pgtable: 4k pages, 48-bit VAs, pgdp=000000000361e000
Dec 10 13:17:03 SBC2 kernel: [ffff8000775b6240] pgd=00000000fefff003, p4d=00000000fefff003, pud=0000000000000000
Dec 10 13:17:03 SBC2 kernel: Internal error: Oops: 96000005 [#1] PREEMPT SMP
Dec 10 13:17:03 SBC2 kernel: Modules linked in: libblake2s libcurve25519_generic libblake2s_generic snd_soc_hdmi_codec rc_cec dw_hdmi_cec dw_hdmi_i2s_audio hantro_vpu(C) sg v4l2_h264 videobuf2_dma
_contig v4l2_mem2mem videobuf2_vmalloc videobuf2_memops snd_soc_audio_graph_card snd_soc_simple_card_utils snd_soc_spdif_tx videobuf2_v4l2 rockchipdrm videobuf2_common dw_mipi_dsi dw_hdmi videodev analogix_dp gpio_ir_recv drm_kms_helper
cec snd_soc_rk3328 lima mc gpu_sched rc_core snd_soc_rockchip_spdif snd_soc_rockchip_i2s drm drm_panel_orientation_quirks snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer snd soundcore cpufreq_dt iptable_filter xt_MASQUERADE xt_comment i
ptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter ip_tables x_tables autofs4 realtek dwmac_rk stmmac_platform stmmac mdio_xpcs gpio_syscon uas
Dec 10 13:17:03 SBC2 kernel: CPU: 2 PID: 734 Comm: modprobe Tainted: G         C        5.9.11-rockchip64 #20.11.1
Dec 10 13:17:03 SBC2 kernel: Hardware name: Pine64 Rock64 (DT)
Dec 10 13:17:03 SBC2 kernel: pstate: 80000005 (Nzcv daif -PAN -UAO BTYPE=--)
Dec 10 13:17:03 SBC2 kernel: pc : __pi_strcmp+0x8c/0x154
Dec 10 13:17:03 SBC2 kernel: lr : cmp_name+0x18/0x28
Dec 10 13:17:03 SBC2 kernel: sp : ffff8000129fba30
Dec 10 13:17:03 SBC2 kernel: x29: ffff8000129fba30 x28: ffff80000925c03c
Dec 10 13:17:03 SBC2 kernel: x27: ffff8000129fbb58 x26: ffff0000fd66c880
Dec 10 13:17:03 SBC2 kernel: x25: ffff8000101537f8 x24: ffff80000925c172
Dec 10 13:17:03 SBC2 kernel: x23: ffff800008f60078 x22: 000000000000000c
Dec 10 13:17:03 SBC2 kernel: x21: 0000000000000001 x20: ffff800008f60084
Dec 10 13:17:03 SBC2 kernel: x19: 0000000000000001 x18: 0000000000000002
Dec 10 13:17:03 SBC2 kernel: x17: ffff800011ae3000 x16: 0000000000000000
Dec 10 13:17:03 SBC2 kernel: x15: ffff8000122f42d0 x14: ffff8000122f4470
Dec 10 13:17:03 SBC2 kernel: x13: ffff800008f93a70 x12: ffff0000fd66c880
Dec 10 13:17:03 SBC2 kernel: x11: 0000000000000008 x10: 0101010101010101
Dec 10 13:17:03 SBC2 kernel: x9 : fffffffffffffffe x8 : 0000000000000008
Dec 10 13:17:03 SBC2 kernel: x7 : 0000000000000006 x6 : 0000000000000000
Dec 10 13:17:03 SBC2 kernel: x5 : 0000000000000000 x4 : ffff8000101537f8
Dec 10 13:17:03 SBC2 kernel: x3 : 0000000000000073 x2 : 0000000000000075
Dec 10 13:17:03 SBC2 kernel: x1 : ffff8000775b6240 x0 : ffff80000925c173
Dec 10 13:17:03 SBC2 kernel: Call trace:
Dec 10 13:17:03 SBC2 kernel:  __pi_strcmp+0x8c/0x154
Dec 10 13:17:03 SBC2 kernel:  bsearch+0x50/0xb8
Dec 10 13:17:03 SBC2 kernel:  find_exported_symbol_in_section+0x4c/0xf8
Dec 10 13:17:03 SBC2 kernel:  each_symbol_section.constprop.0+0x13c/0x1c0
Dec 10 13:17:03 SBC2 kernel:  find_symbol+0x4c/0xd8
Dec 10 13:17:03 SBC2 kernel:  load_module+0x1cd8/0x22c8
Dec 10 13:17:03 SBC2 kernel:  __do_sys_finit_module+0xb4/0x120
Dec 10 13:17:03 SBC2 kernel:  __arm64_sys_finit_module+0x20/0x30
Dec 10 13:17:03 SBC2 kernel:  el0_svc_common.constprop.0+0x70/0x188
Dec 10 13:17:03 SBC2 kernel:  do_el0_svc+0x24/0x90
Dec 10 13:17:03 SBC2 kernel:  el0_sync_handler+0x90/0x198
Dec 10 13:17:03 SBC2 kernel:  el0_sync+0x158/0x180
Dec 10 13:17:03 SBC2 kernel: Code: 91002108 eb0800e9 9a8880eb 38401402 (38401423)
Dec 10 13:17:03 SBC2 kernel: ---[ end trace 7b1140a45cfe8ba4 ]---

and

Dec 10 15:17:01 SBC2 kernel: rockchip-pinctrl pinctrl: pin gpio0-2 already requested by vcc-host-5v-regulator; cannot claim for vcc-host1-5v-regulator
Dec 10 15:17:01 SBC2 kernel: rockchip-pinctrl pinctrl: pin-2 (vcc-host1-5v-regulator) status -22
Dec 10 15:17:01 SBC2 kernel: rockchip-pinctrl pinctrl: could not request pin 2 (gpio0-2) from group usb20-host-drv  on device rockchip-pinctrl
Dec 10 15:17:01 SBC2 kernel: reg-fixed-voltage vcc-host1-5v-regulator: Error applying setting, reverse things back
Dec 10 15:17:01 SBC2 kernel: reg-fixed-voltage: probe of vcc-host1-5v-regulator failed with error -22

I'll have a look at those.

Due to time sync, messages from your current bootup and the errors prior to crash got mixed a up, while I think the error did not appear again after reboot. I'll also try to find out something about this.

@arpegius5555
Copy link
Author

Thanks. I have purged as suggested.

apt purge apt-show-versions openssh-client ddclient armbian-tools-stretch

Thanks for your help so far, I will capture bugs if it halts again and will be pending on anything you can find. Thanks a million!

@arpegius5555
Copy link
Author

Update:

Rock64 # 1

root@Rock64:~# uptime
 16:32:40 up 2 days, 15:23,  1 user,  load average: 0.03, 0.03, 0.00

Rock64 # 2

root@SBC2:~# uptime
 16:34:07 up 2 days,  9:29,  2 users,  load average: 0.00, 0.00, 0.00

At this point I feel very confident about the issues being resolved, I will continue to pay attention to it, but running more than 2 days it's something that I never experienced since initially flashing images to these. Thanks a million for all the help provided so far, you rock!

@MichaIng
Copy link
Owner

Many thanks for the kind feedback, though I was a bid distracted by getting DietPi v6.34 ready, so didn't do research about the error messages yet. Will do that soon.

@arpegius5555
Copy link
Author

Rock 64 # 1 has been stable, no recent crashes

Rock 64 # 2 halted twice this morning, see logs below:

7a7e1557-0ead-4d6e-8c6b-50aa00c7217d

@MichaIng
Copy link
Owner

MichaIng commented Dec 15, 2020

Do you run some service(s) or cron job with raised nice/priority levels or real-time scheduler (round-robin or first-in-first-out)? https://stackoverflow.com/a/35403677

htop shows the nice level and via Setup > Columns the PRIORITY and IO_PRIORITY can be added.

The rockchip-pinctrl error btw seems to be nothing to worry about, you'll find it in many boot log pastes/gists but no-one pays attention. It seems to be a doubled called module or conflicting GPIO pin usage. IMO it is bad to ignore such errors from developer side, but investigation and fixing such kernel-internals is a bid out of scope for us.


The crash occurred with the same error. I now recognised something else. After the first part with the call trace (sown above) repeated a few times every exactly three minutes, another error came on top a few seconds later:

Dec 15 09:28:51 SBC2 kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Dec 15 09:28:51 SBC2 kernel: rcu:         3-...0: (1 GPs behind) idle=212/0/0x1 softirq=403147/403148 fqs=7376
Dec 15 09:28:51 SBC2 kernel:         (t=15000 jiffies g=1512601 q=11829)
Dec 15 09:28:51 SBC2 kernel: Task dump for CPU 3:
Dec 15 09:28:51 SBC2 kernel: task:swapper/3       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
Dec 15 09:28:51 SBC2 kernel: Call trace:
Dec 15 09:28:51 SBC2 kernel:  dump_backtrace+0x0/0x1f0
Dec 15 09:28:51 SBC2 kernel:  show_stack+0x18/0x28
Dec 15 09:28:51 SBC2 kernel:  sched_show_task+0x13c/0x168
Dec 15 09:28:51 SBC2 kernel:  dump_cpu_task+0x44/0x54
Dec 15 09:28:51 SBC2 kernel:  rcu_dump_cpu_stacks+0xb0/0xf0
Dec 15 09:28:51 SBC2 kernel:  rcu_sched_clock_irq+0xb34/0xe70
Dec 15 09:28:51 SBC2 kernel:  update_process_times+0x30/0x70
Dec 15 09:28:51 SBC2 kernel:  tick_sched_handle.isra.0+0x34/0x58
Dec 15 09:28:51 SBC2 kernel:  tick_sched_timer+0x58/0xb0
Dec 15 09:28:51 SBC2 kernel:  __hrtimer_run_queues+0x148/0x3b0
Dec 15 09:28:51 SBC2 kernel:  hrtimer_interrupt+0xf4/0x258
Dec 15 09:28:51 SBC2 kernel:  arch_timer_handler_phys+0x34/0x48
Dec 15 09:28:51 SBC2 kernel:  handle_percpu_devid_irq+0xa0/0x2b8
Dec 15 09:28:51 SBC2 kernel:  generic_handle_irq+0x30/0x48
Dec 15 09:28:51 SBC2 kernel:  __handle_domain_irq+0x94/0x108
Dec 15 09:28:51 SBC2 kernel:  gic_handle_irq+0x54/0xa8
Dec 15 09:28:51 SBC2 kernel:  el1_irq+0xb8/0x180
Dec 15 09:28:51 SBC2 kernel:  arch_cpu_idle+0x14/0x20
Dec 15 09:28:51 SBC2 kernel:  do_idle+0x210/0x260
Dec 15 09:28:51 SBC2 kernel:  cpu_startup_entry+0x24/0x60
Dec 15 09:28:51 SBC2 kernel:  secondary_start_kernel+0x148/0x180
Dec 15 09:28:53 SBC2 kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 3-... } 15359 jiffies s: 649 root: 0x8/.
Dec 15 09:28:53 SBC2 kernel: rcu: blocking rcu_node structures:
Dec 15 09:28:53 SBC2 kernel: Task dump for CPU 3:
Dec 15 09:28:53 SBC2 kernel: task:swapper/3       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
Dec 15 09:28:53 SBC2 kernel: Call trace:
Dec 15 09:28:53 SBC2 kernel:  __switch_to+0x13c/0x198
  • self-detected stall became detected expedited stalls.

Then some additional error lines came on top:

Dec 15 09:34:51 SBC2 kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Dec 15 09:34:51 SBC2 kernel: rcu:         3-...0: (1 GPs behind) idle=212/0/0x1 softirq=403147/403148 fqs=30836
Dec 15 09:34:51 SBC2 kernel:         (t=105006 jiffies g=1512601 q=12746)
Dec 15 09:34:51 SBC2 kernel: rcu: rcu_preempt kthread starved for 42091 jiffies! g1512601 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
Dec 15 09:34:51 SBC2 kernel: rcu:         Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
Dec 15 09:34:51 SBC2 kernel: rcu: RCU grace-period kthread stack dump:
Dec 15 09:34:51 SBC2 kernel: task:rcu_preempt     state:I stack:    0 pid:   10 ppid:     2 flags:0x00000028
Dec 15 09:34:51 SBC2 kernel: Call trace:
Dec 15 09:34:51 SBC2 kernel:  __switch_to+0x13c/0x198
Dec 15 09:34:51 SBC2 kernel:  __schedule+0x2f8/0x810
Dec 15 09:34:51 SBC2 kernel:  schedule+0x48/0x108
Dec 15 09:34:51 SBC2 kernel:  schedule_timeout+0x198/0x368
Dec 15 09:34:51 SBC2 kernel:  rcu_gp_kthread+0x4dc/0x1498
Dec 15 09:34:51 SBC2 kernel:  kthread+0x118/0x150
Dec 15 09:34:51 SBC2 kernel:  ret_from_fork+0x10/0x34
  • Note Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior..

Again, a higher priority/real-time scheduled process seems to be a typical reason for such errors: https://unix.stackexchange.com/questions/252045
All systemd units at least seem to run with default scheduling policy/priorities 🤔.

@arpegius5555
Copy link
Author

No, I have not raised any nice priority or use any real time scheduler.

Crontab -l only shows this

*/10 * * * * ~/duckdns/duck.sh >/dev/null 2>&1

I was thinking if I should go for an OS reinstall at this point, thoughts?

@MichaIng
Copy link
Owner

MichaIng commented Dec 16, 2020

I think so. Let me take the change to update our image first, it is more than half a year old.
Especially since it is a Stretch system, our support would have been dropped anyway upcoming summer, so a good reason to move to current stable Debian and get related software package updates with it.
EDIT: Might take a little longer, Armbian APT mirrors are currently broken: https://forum.armbian.com/topic/16504-bionic-apt-update-file-has-unexpected-size/

@MichaIng
Copy link
Owner

Okay, if you reinstall DietPi, please try the new image: https://dietpi.com/downloads/images/testing/DietPi_ROCK64-ARMv8-Buster.7z
Newest kernel, bootloader and packages, so should be the smoothest first boot as well. But it's good to have one test on a real ROCK64 (I have none) before replacing the old image 😉.

@arpegius5555
Copy link
Author

Right on, I will try the new image now. I will let you know if any issues arise.

On the other hand, Rock64 # 1 has been pretty stable:

root@Rock64:~# uptime
 23:43:59 up 4 days, 22:34,  1 user,  load average: 0.12, 0.03, 0.01

@MichaIng
Copy link
Owner

Dammit that the issue persists with the new image...
I found something which might be related: https://forum.armbian.com/topic/15082-rock64-focal-fossa-memory-frequency/
I'm a bid too tired now, but it's about flashing a new bootloader to clock down the memory, which is by default quite overclocked with the Armbian kernel. CPU stall does not sound like it's related, but worth to give it a try.

I'll give you instructions tomorrow, if don't figure it out yourself first.

@arpegius5555
Copy link
Author

I would like to report that after updating to v6.34.3 both Rock64's are fully functional and no longer halting / crashing. Thank you very much for your support and patience. Happy holidays for you and the Dietpi team, you guys Rock!(64) lol 🥇

@MichaIng
Copy link
Owner

Great to hear, let's hope that it's finally persistent. Enjoy your Christmas/Holidays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants