Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lockup when copying files on NFS share on RBP3B+ #2788

Closed
graysky2 opened this issue Dec 22, 2018 · 49 comments
Closed

Lockup when copying files on NFS share on RBP3B+ #2788

graysky2 opened this issue Dec 22, 2018 · 49 comments

Comments

@graysky2
Copy link

graysky2 commented Dec 22, 2018

Describe the bug
Errors and timeouts when using rsync to copy files from an NFS share to the same NFS share.
Perhaps this is related to #2482

To reproduce

  • Mount the NFS share.
  • Use rysnc as described below.
  • Monitor with dmesg -e -w as the rsync job grinds on.

Expected behaviour
Rsync should complete without errors.

rsync -a --delete-after -W -x -q /scratch/armc8/root/ /scratch/armc8/facade

The source dir, /scratch/armc8/root/ contains approx 26,600 files around 760 MB.

Actual behaviour
Rsync does not complete; dmesg reports a number of problems.

I attached the complete dmesg output in the logs section below but here is a partial output illustrating the error:

[  +0.000005] nfs: server ease.lan not responding, still trying
[  +1.061951] nfs: server ease.lan OK
[  +0.000007] nfs: server ease.lan OK
[  +0.000019] nfs: server ease.lan OK
[  +0.000667] nfs: server ease.lan OK
[  +0.000037] nfs: server ease.lan OK
[  +0.000036] nfs: server ease.lan OK
[  +0.000015] nfs: server ease.lan OK
[  +0.000012] nfs: server ease.lan OK
[  +0.000012] nfs: server ease.lan OK
[  +0.000012] nfs: server ease.lan OK
[Dec22 10:44] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [10.1.8.228-mana:1151]
[  +0.008170] Modules linked in: rpcsec_gss_krb5 btsdio brcmfmac rc_cec brcmutil vc4 cec rc_core nf_log_ipv4 nf_log_common ipt_REJECT drm_kms_helper xt_LOG drm microchip hci_uart cfg80211 xt_limit btqca drm_panel_orientation_quirks syscopyarea btbcm xt_addrtype btintel sysfillrect lan78xx sysimgblt fb_sys_fops bluetooth xt_conntrack raspberrypi_hwmon ecdh_generic bcm2835_thermal pwm_bcm2835 rfkill i2c_bcm2835 bcm2835_wdt ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter
[  +0.055229] CPU: 2 PID: 1151 Comm: 10.1.8.228-mana Not tainted 4.19.10-1-ARCH #1
[  +0.007777] Hardware name: Raspberry Pi 3 Model B+ (DT)
[  +0.005500] pstate: 20000005 (nzCv daif -PAN -UAO)
[  +0.005054] pc : nfs_sb_active+0x98/0xf0
[  +0.004133] lr : nfs_delegation_reap_unclaimed+0x3c/0x108
[  +0.005676] sp : ffff00000dea3c50
[  +0.003484] x29: ffff00000dea3c50 x28: 0000000000000000
[  +0.005592] x27: ffff800030d064c8 x26: ffff80002f5d0a40
[  +0.005592] x25: ffff0000097c96c8 x24: 0000000000000000
[  +0.005590] x23: 0000000000000000 x22: ffff800030d06400
[  +0.005591] x21: ffff800030d064c8 x20: ffff8000313aa400
[  +0.005591] x19: ffff80002f485800 x18: 0000000000000000
[  +0.005590] x17: ffff7e000082e880 x16: 00000000fffffff8
[  +0.005592] x15: 0000000000000003 x14: 1800000000000000
[  +0.005590] x13: 0000000000000000 x12: 0100000000000000
[  +0.015436] x11: 00000006756df2ec x10: 00000000000009a0
[  +0.015328] x9 : ffff00000dea3960 x8 : ffff800035e39780
[  +0.015345] x7 : 0000000000000000 x6 : 0000000000000000
[  +0.015122] x5 : ffff80002f48589c x4 : 000000000000000c
[  +0.014783] x3 : ffff8000313aa400 x2 : ffff80002f48589c
[  +0.014487] x1 : 0000000000000000 x0 : 0000000000000001
[  +0.014115] Call trace:
[  +0.010777]  nfs_sb_active+0x98/0xf0
[  +0.011860]  nfs_delegation_reap_unclaimed+0x3c/0x108
[  +0.013324]  nfs4_state_end_reclaim_reboot+0x174/0x240
[  +0.013310]  nfs4_recovery_handle_error+0x5c/0x1b0
[  +0.012956]  nfs4_do_reclaim+0x214/0x250
[  +0.012076]  nfs4_state_manager+0x438/0x960
[  +0.012373]  nfs4_run_state_manager+0x2c/0x40
[  +0.012579]  kthread+0x130/0x138
[  +0.011429]  ret_from_fork+0x10/0x1c

System
Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:

  • Model = Pi3B+
  • OS and version = Both Arch ARM aarch64 and armv7h
  • Firmware version (vcgencmd version) = Dec 17 2018 23:56:39 version da468960fe03ecbaa8e3f1ee01c7217c3bd01fa8 (clean) (release)

Note that the firmware package is only available on the armv7h flavor; no firmware is available on aarch64.

  • Kernel on armv7h = Linux workbench 4.14.89-1-ARCH #1 SMP Tue Dec 18 14:02:18 EST 2018 armv7l GNU/Linux
  • Kernel on aarch64 = Linux workbench 4.19.10-1-ARCH #1 SMP Tue Dec 18 19:48:51 MST 2018 aarch64 GNU/Linux

I have been reviewing #2482 and came across this comment suggesting this setting: ethtool -K eth0 tx-tcp-segmentation off prior to the copy command.

I tested this with my rsync command above and still the bug is present.

Logs
Link to complete dmesg.

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

If you could reproduce the lockup on a Raspbian image (or a stock Pi kernel from the firmware repo) it would help a lot.

@graysky2
Copy link
Author

graysky2 commented Dec 22, 2018

@pelwell - Testing with Raspbian Stretch Lite (November 2018) seems to successfully run the rsync task without errors (tested 9 times in a row).

$ uname -a
Linux workbench 4.14.79-v7+ #1159 SMP Sun Nov 4 17:50:20 GMT 2018 armv7l GNU/Linux

Lots of uncontrolled variables with this comparison thought.... same hardware but different architecture and kernel versions. I believe rsync and nfs-utils are also different versions. In any case, where to go from here to troubleshoot?

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

The next step would be to take the kernel and modules from our firmware repo and drop them into your ARCH system. As long as the version names differ slightly the modules should coexist happily - you just need to overwrite kernel7.img to switch.

@graysky2
Copy link
Author

graysky2 commented Dec 22, 2018

The next step would be to take the kernel and modules from our firmware repo and drop them into your ARCH system. As long as the version names differ slightly the modules should coexist happily - you just need to overwrite kernel7.img to switch.

OK. I have my Arch ARM armv7h image and confirmed the bug on it as with aarch64.

For the firmware, is this the file you recommend I use: https://github.com/raspberrypi/firmware/archive/1.20181112.tar.gz

I believe I have that from the raspberrypi-firmware package provided by Arch ARM.... but I am happy to manually place files to help figure out this error.

Can you provide me with some more specific steps so we're clear that I am doing the right experiment? Thanks!

EDIT: For reference, here is a complete list of the files provided by the Arch ARM raspberrypi-firmware package.

Note that /boot/kernel7.img is not provided by that package but by the linux-raspberrypi package.

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

I suggest downloading the zip file from https://github.com/raspberrypi/firmware?files=1 (see Clone or Download), and using just the modules, and the kernel from /boot.

@graysky2
Copy link
Author

graysky2 commented Dec 22, 2018

OK. I downloaded and extracted the zip file:

/scratch/firmware-master % ls -lh
drwxr-xr-x 3 facade users  480 Dec 21 09:10 boot/
drwxr-xr-x 3 facade users   60 Dec 21 09:10 documentation/
drwxr-xr-x 2 facade users  200 Dec 21 09:10 extra/
drwxr-xr-x 3 facade users   60 Dec 21 09:10 hardfp/
drwxr-xr-x 4 facade users   80 Dec 21 09:10 modules/
drwxr-xr-x 3 facade users   60 Dec 21 09:10 opt/

To be clear, on the RPi3B+, you want to do the following?

cp -r /scratch/firmware-master/boot/* /boot/

EDIT: OK... I did it. Booting into the system got lots of systemd failures (rngd, systemd-nostnamed, systemd-modules-load, systemd-resolved, systemd-timesyncd, systemd-update-utmp) and a read-only root partition which likely caused the aforementioned failures.

I was able to log in locally and:

# mount -o remount rw,relatime /
# echo "nameserver 9.9.9.9" > /etc/resolv.conf
# systemctl restart sshd

Now I am able to connect via sshd and conduct the test. Running now. Will post back. BTW, the uname output changed as expected Linux workbench 4.14.90-v7+ #1183 SMP Fri Dec 21 14:03:50 GMT 2018 armv7l GNU/Linux

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

You only really needed to copy the kernel IMG from the boot directory. Did you copy the modules as well?

@graysky2
Copy link
Author

graysky2 commented Dec 22, 2018

You only really needed to copy the kernel IMG from the boot directory. Did you copy the modules as well?

I did not copy the modules... I read your previous line too literally (ie only the things under /boot). In any case, the test completed 8 iterations without a single error.

I will replace the Arch-provided /boot contents and then copy only /scratch/firmware-master/boot/kernel7.img over and repeat the tests. Let me know if you disagree with this testing plan.

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

And the modules - you do need the modules.

@graysky2
Copy link
Author

And the modules - you do need the modules.

..are you sure? Perhaps they are the reason why my root filesystem was readonly, but even without them, I can do the rsync task 8 times in a row without an error.

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

You are clearly managing to do basic things without them, but lots of tasks may not be running which would make the test less representative.

@graysky2
Copy link
Author

graysky2 commented Dec 22, 2018

You are clearly managing to do basic things without them, but lots of tasks may not be running which would make the test less representative.

OK... where shall I copy them? The distro default is /usr/lib/modules/4.14.89-1-ARCH/ so /usr/lib/modules/ ?

Like:

cp -r /scratch/firmware-master/modules /usr/lib/modules/

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

Yes, that looks right to me.

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

Interesting.

We actually missed a step in the Raspbian test, which was to use rpi-update to get the latest kernel. If you still have the Raspbian Lite card around it would help to narrow down whether this is a straight kernel issue introduced with newer builds or whether Arch is using it in a different way.

@graysky2
Copy link
Author

graysky2 commented Dec 22, 2018

We actually missed a step in the Raspbian test, which was to use rpi-update to get the latest kernel. If you still have the Raspbian Lite card around it would help to narrow down whether this is a straight kernel issue introduced with newer builds or whether Arch is using it in a different way.

I actually ran the following on the Raspbian box:

# apt-get update
# apt-get upgrade

But now I booted to it and ran rpi-update:

$ sudo rpi-update
 *** Raspberry Pi firmware updater by Hexxeh, enhanced by AndrewS and Dom
 *** Performing self-update
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13545  100 13545    0     0  22628      0 --:--:-- --:--:-- --:--:-- 22650
 *** Relaunching after update
 *** Raspberry Pi firmware updater by Hexxeh, enhanced by AndrewS and Dom
 *** We're running for the first time
 *** Backing up files (this will take a few minutes)
 *** Backing up firmware
 *** Backing up modules 4.14.79-v7+
#############################################################
This update bumps to rpi-4.14.y linux tree
Be aware there could be compatibility issues with some drivers
Discussion here:
https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=197689
##############################################################
 *** Downloading specific firmware revision (this will take a few minutes)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   168    0   168    0     0    491      0 --:--:-- --:--:-- --:--:--   492
100 56.3M  100 56.3M    0     0  3126k      0  0:00:18  0:00:18 --:--:-- 3071k
 *** Updating firmware
 *** Updating kernel modules
 *** depmod 4.14.90-v7+
 *** depmod 4.14.90+
 *** Updating VideoCore libraries
 *** Using HardFP libraries
 *** Updating SDK
 *** Running ldconfig
 *** Storing current firmware revision
 *** Deleting downloaded files
 *** Syncing changes to disk
 *** If no errors appeared, your firmware was successfully updated to da5948d8ef8354557732d9c8f5ad5e7e24374a69
 *** A reboot is needed to activate the new firmware

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

The Raspbian kernel and firmware packages are periodically updated from the firmware repo, but rpi-update always gets the most recent builds. Please run rpi-update, but first make a note of the kernel version before updating.

@graysky2
Copy link
Author

graysky2 commented Dec 22, 2018

Before:

$ uname -a
Linux workbench 4.14.79-v7+ #1159 SMP Sun Nov 4 17:50:20 GMT 2018 armv7l GNU/Linux

After:

$ uname -a
Linux workbench 4.14.90-v7+ #1183 SMP Fri Dec 21 14:03:50 GMT 2018 armv7l GNU/Linux

Test running now... I will be unavailable for the next few hours unfortunately but I have it set to run for 21 iterations.

@pelwell
Copy link
Contributor

pelwell commented Dec 22, 2018

No problem. You should now have identical kernels and modules in Raspbian and Arch.

@graysky2
Copy link
Author

graysky2 commented Dec 22, 2018

OK... all 21 iterations completed without error on Raspbian (updated). I am repeating the Arch stock image + kernel7.img/modules from your zip file as a double-check.

@graysky2
Copy link
Author

@pelwell - OK. I took the fresh Arch ARM armv7h image, updated it (kernel version 4.14.90 and firmware dated 20181221). Then I replaced:

  • the distro provided /boot/kernel7.img with the one from your zip file.
  • Copied over modules/* from your zip files to /usr/lib/modules/ on the live filesystem.
  • No other modifications.

It boots fine and completed 21 iterations of the rsync copy test without error. I rebooted it and the nfs server and again, it completed 13 iterations of the rsync copy test without error (I only ran 13 this time).

Conclusion: booting with the zip file kernel and modules on an otherwise native Arch ARM armv7h image does not experience the bug.

@pelwell
Copy link
Contributor

pelwell commented Dec 23, 2018

That's good because the results are consistent, but it puts the ball in the Arch court - I don't think there is much I can do here.

@pelwell
Copy link
Contributor

pelwell commented Dec 23, 2018

Remember that, as well as a potentially different kernel commit, Arch may be using a different toolchain.

@graysky2
Copy link
Author

graysky2 commented Dec 23, 2018

@pelwell - I'm sure we are building from the same commits, for example but you are correct about the tool chain. Our current one is based on gcc v8.2.0.

What version of gcc are you using?

... is there anything obvious to you in our kernel config that might be driving this? Here is an online diff via diffchecker of the two configs to aid in reading them.

@graysky2
Copy link
Author

graysky2 commented Dec 23, 2018

I substituted the raspbian config (zcat /proc/config.gz > config) in the Arch ARM PKGBUILD and built 4.14.90 with it. Booted just fine but it causes the same errors I experience with the default Arch kernel. Does that rule-out any config differences?

[Dec23 16:20] INFO: rcu_sched self-detected stall on CPU
[  +0.000181] 	1-...: (2099 ticks this GP) idle=766/140000000000001/0 softirq=6439/6439 fqs=1050 
[  +0.000237] 	 (t=2100 jiffies g=2940 c=2939 q=28)
[  +0.000145] NMI backtrace for cpu 1
[  +0.000110] CPU: 1 PID: 583 Comm: 10.1.8.228-mana Not tainted 4.14.90-1-ARCH #1
[  +0.000199] Hardware name: BCM2835
[  +0.000129] [<801100f8>] (unwind_backtrace) from [<8010c084>] (show_stack+0x20/0x24)
[  +0.000225] [<8010c084>] (show_stack) from [<8078b4c8>] (dump_stack+0xcc/0x110)
[  +0.000213] [<8078b4c8>] (dump_stack) from [<807912e0>] (nmi_cpu_backtrace+0x90/0xc4)
[  +0.000229] [<807912e0>] (nmi_cpu_backtrace) from [<8079142c>] (nmi_trigger_cpumask_backtrace+0x118/0x12c)
[  +0.000272] [<8079142c>] (nmi_trigger_cpumask_backtrace) from [<8010e77c>] (arch_trigger_cpumask_backtrace+0x20/0x24)
[  +0.000298] [<8010e77c>] (arch_trigger_cpumask_backtrace) from [<80186224>] (rcu_dump_cpu_stacks+0xa8/0xd0)
[  +0.004244] [<80186224>] (rcu_dump_cpu_stacks) from [<80185b9c>] (rcu_check_callbacks+0x808/0x9c0)
[  +0.008345] [<80185b9c>] (rcu_check_callbacks) from [<8018be40>] (update_process_times+0x44/0x74)
[  +0.008265] [<8018be40>] (update_process_times) from [<8019e348>] (tick_sched_handle+0x64/0x70)
[  +0.008403] [<8019e348>] (tick_sched_handle) from [<8019e5a4>] (tick_sched_timer+0x50/0xac)
[  +0.008413] [<8019e5a4>] (tick_sched_timer) from [<8018ce98>] (__hrtimer_run_queues+0x17c/0x310)
[  +0.008604] [<8018ce98>] (__hrtimer_run_queues) from [<8018d2d0>] (hrtimer_interrupt+0xbc/0x224)
[  +0.008549] [<8018d2d0>] (hrtimer_interrupt) from [<806412ac>] (arch_timer_handler_phys+0x38/0x40)
[  +0.008650] [<806412ac>] (arch_timer_handler_phys) from [<8017acf0>] (handle_percpu_devid_irq+0x8c/0x240)
[  +0.008686] [<8017acf0>] (handle_percpu_devid_irq) from [<80175648>] (generic_handle_irq+0x34/0x44)
[  +0.008728] [<80175648>] (generic_handle_irq) from [<80175c98>] (__handle_domain_irq+0x6c/0xc4)
[  +0.008756] [<80175c98>] (__handle_domain_irq) from [<801014ec>] (bcm2836_arm_irqchip_handle_irq+0xac/0xb0)
[  +0.008707] [<801014ec>] (bcm2836_arm_irqchip_handle_irq) from [<807a59fc>] (__irq_svc+0x5c/0x7c)
[  +0.008815] Exception stack(0xae507e00 to 0xae507e48)
[  +0.004433] 7e00: add2d400 add2ddac 00000001 add2d460 00000000 00000000 add2dd98 ad1b98a8
[  +0.008931] 7e20: ad1b9800 ad1b98a8 ad1b9800 ae507e64 add2dc00 ae507e50 00000003 8039efa8
[  +0.009459] 7e40: 200f0013 ffffffff
[  +0.004730] [<807a59fc>] (__irq_svc) from [<8039efa8>] (nfs_sb_active+0x84/0xa8)
[  +0.009474] [<8039efa8>] (nfs_sb_active) from [<803d648c>] (nfs_delegation_reap_unclaimed+0x28/0xd4)
[  +0.009426] [<803d648c>] (nfs_delegation_reap_unclaimed) from [<803d143c>] (nfs4_state_clear_reclaim_reboot+0x108/0x110)
[  +0.009466] [<803d143c>] (nfs4_state_clear_reclaim_reboot) from [<803d1cb0>] (nfs4_recovery_handle_error+0x58/0x15c)
[  +0.009271] [<803d1cb0>] (nfs4_recovery_handle_error) from [<803d2bd8>] (nfs4_do_reclaim+0x43c/0x7b8)
[  +0.009148] [<803d2bd8>] (nfs4_do_reclaim) from [<803d32e4>] (nfs4_run_state_manager+0x390/0x74c)
[  +0.009116] [<803d32e4>] (nfs4_run_state_manager) from [<8013dfe0>] (kthread+0x170/0x178)
[  +0.009033] [<8013dfe0>] (kthread) from [<8010820c>] (ret_from_fork+0x14/0x28)

@graysky2
Copy link
Author

To try to get at the differences in the tool chain, I used the Arch packages from March 12th, 2017 (actually the entire system from that date) which provides gcc version 6.3.1. I see that the Raspbian image I have is using gcc 6.3.0-18+rpi+deb9u1 and the Arch ARM version I cited is the nearest match although obviously not identical.

I built kernel version 4.14.90 using it and after I booted into it, the rsync test failed with similar errors (system was frozen so I could not ssh in to paste out the dmesg output/I didn't see anything related to the errors in the journalctl output.

@lategoodbye
Copy link
Contributor

@graysky2 Did you try to copy the ARCH build kernel on the Raspbian SD card?

@graysky2
Copy link
Author

graysky2 commented Dec 26, 2018

@lategoodbye - Good idea. I copied the Arch modules and kernel7.img onto the raspbian image and the test completed without error (21 iterations). Puzzling indeed. Points to NFS or ...?

@lategoodbye
Copy link
Contributor

Good. This is definitely a kernel issue, but only the Arch rootfs seems to trigger it.
Here some ideas to narrow down the issue:

  • try to disable all unused processes on your Arch system and check if it's still occur
  • is this issue reproducible with a RPi 3 B?
  • is this issue reproducible with a mainline kernel (suggest 4.19)?
  • try to compare all relevant kernel & network configuration

The most important is to identify the trigger.

@graysky2
Copy link
Author

graysky2 commented Dec 26, 2018

@lategoodbye - Thanks for your help. Let me take these in order:

  • I tried stripping it down to a minimal set with the same results (list shown at the bottom).
  • I cannot reproduce this on a RPi3B on Arm ARM... it seems unique to the RPi3B+ only.
  • I don't think Arch ARM offers an armv7h kernel that isn't in the 4.14.x series BUT I have confirmed the bug on Arch ARM aarch64 (4.19.10 at the time).
  • To compare the config options differences, I built the Arch ARM kernel using the Raspbian config and got the same result (bug). See this post for details.

As to networking.... by default, Arch ARM uses systemd-networkd to manage the onboard NIC whereas Raspbian uses ifupdown to do it. I compiled ifupdown on Arch ARM but I couldn't get the resulting binaries to work. I tried on Arch ARM using dhcpcd to manage the NIC, as well as manually calling ip from iproute2 to do it, but the system locks up when I tried to rsync using both of these indepednely.

Network summary: counting systemd-networkd, that's 3 different network managing tools all of which don't seem to change the result.

I did notice that Raspbian uses an older version of NFS: nfs-common-1:1.3.4-2.1 whereas Arch ARM is using the current upstream of 2.3.3... but there are many differences in the distro-provided versions of things so without more to go on, it's a fishing expedition.

I did install nfs-utils-1.3.4-1 for Arch ARM and tried the same test. It too failed. I made sure the mount options between Raspbian and Arch ARM were identical for this test.

% mount | grep nfs
10.1.9.228:/scratch on /scratch type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.9.108,local_lock=none,addr=10.1.9.228)

List of user-enabled systemd services I mentioned as the minimal set I tried on ArchARM

% tree /etc/systemd/system
/etc/systemd/system
├── dbus-org.freedesktop.network1.service -> /usr/lib/systemd/system/systemd-networkd.service
├── dbus-org.freedesktop.resolve1.service -> /usr/lib/systemd/system/systemd-resolved.service
├── getty.target.wants
│   └── [email protected] -> /usr/lib/systemd/system/[email protected]
├── multi-user.target.wants
│   ├── remote-fs.target -> /usr/lib/systemd/system/remote-fs.target
│   ├── sshd.service -> /usr/lib/systemd/system/sshd.service
│   ├── systemd-networkd.service -> /usr/lib/systemd/system/systemd-networkd.service
│   ├── systemd-resolved.service -> /usr/lib/systemd/system/systemd-resolved.service
├── network-online.target.wants
│   └── systemd-networkd-wait-online.service -> /usr/lib/systemd/system/systemd-networkd-wait-online.service
├── sockets.target.wants
│   └── systemd-networkd.socket -> /usr/lib/systemd/system/systemd-networkd.socket
├── sysinit.target.wants
│   ├── rngd.service -> /usr/lib/systemd/system/rngd.service
│   └── systemd-timesyncd.service -> /usr/lib/systemd/system/systemd-timesyncd.service

Same output from Raspbian:

$ tree /etc/systemd/system
/etc/systemd/system
├── [email protected]
├── bluetooth.target.wants
│   └── bluetooth.service -> /lib/systemd/system/bluetooth.service
├── dbus-org.bluez.service -> /lib/systemd/system/bluetooth.service
├── dbus-org.freedesktop.Avahi.service -> /lib/systemd/system/avahi-daemon.service
├── dhcpcd.service.d
│   └── wait.conf
├── dhcpcd5.service -> /lib/systemd/system/dhcpcd.service
├── getty.target.wants
│   └── [email protected] -> /lib/systemd/system/[email protected]
├── [email protected]
│   └── noclear.conf
├── halt.target.wants
│   └── rpi-display-backlight.service -> /lib/systemd/system/rpi-display-backlight.service
├── multi-user.target.wants
│   ├── avahi-daemon.service -> /lib/systemd/system/avahi-daemon.service
│   ├── console-setup.service -> /lib/systemd/system/console-setup.service
│   ├── cron.service -> /lib/systemd/system/cron.service
│   ├── dhcpcd.service -> /lib/systemd/system/dhcpcd.service
│   ├── hciuart.service -> /lib/systemd/system/hciuart.service
│   ├── networking.service -> /lib/systemd/system/networking.service
│   ├── nfs-client.target -> /lib/systemd/system/nfs-client.target
│   ├── raspberrypi-net-mods.service -> /lib/systemd/system/raspberrypi-net-mods.service
│   ├── remote-fs.target -> /lib/systemd/system/remote-fs.target
│   ├── rsync.service -> /lib/systemd/system/rsync.service
│   ├── rsyslog.service -> /lib/systemd/system/rsyslog.service
│   ├── ssh.service -> /lib/systemd/system/ssh.service
│   ├── sshswitch.service -> /lib/systemd/system/sshswitch.service
│   ├── triggerhappy.service -> /lib/systemd/system/triggerhappy.service
│   └── wifi-country.service -> /lib/systemd/system/wifi-country.service
├── network-online.target.wants
│   └── networking.service -> /lib/systemd/system/networking.service
├── poweroff.target.wants
│   └── rpi-display-backlight.service -> /lib/systemd/system/rpi-display-backlight.service
├── rc-local.service.d
│   └── ttyoutput.conf
├── reboot.target.wants
│   └── rpi-display-backlight.service -> /lib/systemd/system/rpi-display-backlight.service
├── remote-fs.target.wants
│   └── nfs-client.target -> /lib/systemd/system/nfs-client.target
├── sockets.target.wants
│   ├── avahi-daemon.socket -> /lib/systemd/system/avahi-daemon.socket
│   └── triggerhappy.socket -> /lib/systemd/system/triggerhappy.socket
├── sysinit.target.wants
│   ├── fake-hwclock.service -> /lib/systemd/system/fake-hwclock.service
│   ├── keyboard-setup.service -> /lib/systemd/system/keyboard-setup.service
│   └── systemd-timesyncd.service -> /lib/systemd/system/systemd-timesyncd.service
├── syslog.service -> /lib/systemd/system/rsyslog.service
└── timers.target.wants
    ├── apt-daily-upgrade.timer -> /lib/systemd/system/apt-daily-upgrade.timer
    └── apt-daily.timer -> /lib/systemd/system/apt-daily.timer

@lategoodbye
Copy link
Contributor

@graysky2 Can you point me to the Arch armv7h images ( https://archlinuxarm.org/platforms/armv8/broadcom/raspberry-pi-3 )?

@graysky2
Copy link
Author

graysky2 commented Dec 26, 2018

For armv7h, grab this/4.14.x series of kernels and the officially supported firmware. Despite the filename, it is for RPi2 and RPi3. Alternatively, if you want the 64-bit image (aarch64), grab this/4.19.x series of kernels but no firware/hardware decoding etc. I see the bug on both images.

@graysky2
Copy link
Author

graysky2 commented Dec 26, 2018

This is interesting: if I use rysnc over ssh to either a read or to write, I do not experience the bug. I only get the bug if I use rsync from the mount to the mount. Does that have any implications?

In other words:
/scratch is the NFS mount (on the RPi3B+):
This triggers the bug usually during the first iteration but sometimes it takes up to the 4th:

 if ! mountpoint -q /scratch; then
   mount 10.1.9.228:/scratch /scratch
 fi

 for i in {1..9}; do
    rsync -a --delete-after -W -x -q /scratch/armc8/root/ /scratch/armc8/facade.$i
  fone

BUT neither this nor the one after it triggers the bug (I have run it up to 21 iterations):

 if ! mountpoint -q /scratch; then
   mount 10.1.9.228:/scratch /scratch
 fi

 # read from NFS and write over ssh
 for i in {1..21}; do
   rsync -a --delete-after -W -x -q /scratch/armc8/root/ [email protected]:/scratch/armc8/facade.$i
  done

Or

 if ! mountpoint -q /scratch; then
   mount 10.1.9.228:/scratch /scratch
 fi

 # read from ssh and write over NFS
 for i in {1..21}; do
   rsync -a --delete-after -W -x -q [email protected]:/scratch/armc8/root /scratch/armc8/facade.$i
 done

@graysky2
Copy link
Author

graysky2 commented Dec 27, 2018

Perhaps this bug is triggered by saturating the lan78xx bus. I was reading the rysnc man page and found the --bwlmit switch:

--bwlimit=RATE limit socket I/O bandwidth

I ran the script increasing the RATE value from 2000 up to 11000 in steps of 1000. To my surprise, I found I could get the box to run through 9 iterations (no errors) until I set it to 10000. Just to verify, I also used a value of 11000.

#!/bin/bash
if ! mountpoint -q /scratch; then
  mount 10.1.9.228:/scratch /scratch || exit 1
fi

# values tried and successful completion include 
# 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000
# a setting of 10000 triggers the bug
# a setting of 11000 triggers the bug

for i in {1..9}; do
 time rsync -a --delete-after -W -x --bwlimit=11000 \
   --info=progress2 --info=name0 /scratch/armc8/root/ /scratch/armc8/facade.$i
 sleep 1s
done

@lategoodbye
Copy link
Contributor

@graysky2 Are you able to make a wireshark trace (i think a minute before until 4 minutes after the lockup should be sufficient)?

@graysky2
Copy link
Author

graysky2 commented Dec 28, 2018 via email

@lategoodbye
Copy link
Contributor

lategoodbye commented Dec 30, 2018

wireshark is a graphical to capture network traffic. There are two options where you can capture the traffic, either on the server side or on the client side. I assume you want to capture on the Raspberry Pi side because of privacy or permission. In this case i would recommend to use tcpdump as a commandline replacement for wireshark.

You can start tracing the network with following command (not recommend via SSH connection, make sure there is enough disc space)
tcpdump -i eth0 -w trace.pcap

With Strg+C you can finish the trace. Drawback in this case is that you don't know when issue appears. So you will need to restart with every rsync attempt.

@graysky2
Copy link
Author

graysky2 commented Dec 31, 2018

I will try tcpdump and find a way to share the output.

EDIT: I have a trace now but even compressed with gzip -9 it's over 130 MB. The resulting trace.pcap files was over 450 MB. I don't know what data are included in there after browsing it through wireshark-qt or whether it will be helpful in diagnosing this...

@pelwell - Are you still following this ticket? If you see this comment I posted introducing the bandwidth saturation possible triggering the bug hypothesis?

@lategoodbye
Copy link
Contributor

@graysky2 This size is expected. You only need to provide a download link and a timestamp when the issue occurred.

@JamesH65
Copy link
Contributor

JamesH65 commented Jan 7, 2019

Do you have this patch, along with turning TSO off via the module parameter? This was a fix required when transferring large files on the 3B+

5762758

@graysky2
Copy link
Author

graysky2 commented Jan 7, 2019

@JamesH65 - I believe that patch is part of the kernel now, no? Can you share the module parameter you referenced and how you are applying it? I am happy to test.

@pelwell
Copy link
Contributor

pelwell commented Jan 7, 2019

The patch disables TSO by default, with a module parameter (enable_tso) to re-enable it.

@JamesH65
Copy link
Contributor

JamesH65 commented Jan 7, 2019

@graysky2 I don't know which kernel you are building, and the symptoms are similar to the ones that patch fixed, so thought I would suggest it. It may well be in your kernel already, in which case a red herring.

@graysky2
Copy link
Author

graysky2 commented Jan 7, 2019

@JamesH65 - I get this bug on armv7h kernels as recent as 4.14.90. I'm pretty sure the patch was applied months ago to the 4.14.x series.

@graysky2
Copy link
Author

For others experiencing this bug, refer to this comment. For me, a work-around is to append the --bwlimit=9000 switch to the rsync command. For whatever reason, it seems to prevent triggering this bug.

@JamesH65
Copy link
Contributor

That's pretty interesting, good find.

@lategoodbye
Copy link
Contributor

@graysky2 Still no chance to upload the tcpdump trace to a cloud service? Another possibility to reduce the filesize is to open the pcap in Wireshark and re-export only the relevant time.

@phil0u
Copy link

phil0u commented Apr 27, 2019

Hi,
Let me know if you want me to open a dedicated thread, but I feel this is the same problem I'm facing:

Newly bought RBP3B+, with a AUKRU 5V3A power supply.
I'm trying to use this setup for a second Kodi media player (other one is a Pi 3, works 24/365 without a glitch)
I first tried LibreELEC distro LibreELEC-RPi2.arm-9.0.1.img then LibreELEC-RPi2.arm-8.2.5.img then I tried the latest official RASPBIAN distro: 2019-04-08-raspbian-stretch-full.img, which I'm using right now for tests.

Symptom is the same every time : any light load on the ethernet link will lockup the Pi3B+ within seconds. A simple "apt-get update" will lockup the Pi3B+

No problem if using the WLAN. So the problem is definitely linked to the lan78xx/Ethernet adapter

I've switched SD cards, same behavior.
I've read most threads I could find regarding this issue : I tried forcing the link to 10/100/1000. Lockups each time.
I disabled EEE (dtparam=eee=off), no change
I tried disabling tx-checksumming ( ethtool -K eth0 tx off), no change
I'm running latest firmware so TSO is already disabled :
rpi-update has the latest firmware.

$ uname -a
Linux raspberrypi 4.19.36-v7+ #1213 SMP Thu Apr 25 15:08:02 BST 2019 armv7l GNU/Linux

$ vcgencmd version
Mar 27 2019 15:45:53
Copyright (c) 2012 Broadcom
version 2e98b31d18547962e564bdf88e57b3df7085c29b (clean) (release) (start)

$ cat /etc/debian_version
9.8

I've connected the Pi3B+ to:
1/ a Power Line Communication ethernet adapter , 100Mbits -> apt-get update will lockup
2/ a ISP router -> apt-get update will lockup
3/ a managed Cisco switch with flow control enabled on the port -> apt-get update will lockup

I made a tcpdump capture using a network SPAN on the Cisco Switch, and i can see some packets being DUP ack'ed up to five times and multiple TCP window update packets.

The file being retrieved is http://raspbian.raspberrypi.org/raspbian/dists/stretch/main/binary-armhf/Packages.xz and the capture stops after 5Mo (out of 11Mo)

Let me know anything I could try (new kernel, other firmware,...) in order to solve this bug, as for now, this makes my Pi3B+ totally unusable.

Thanks,

@JamesH65
Copy link
Contributor

@phil0u This sound like a HW fault - ethernet should simply work well out of the box with the standard Raspbian. I just tried gabbing that file on a Pi3B+ and all was fine. I'd suggest sending it back for replacement.

@phil0u
Copy link

phil0u commented May 1, 2019

@JamesH65 Thanks for the advice. I spent some more time trying to install older LibreELEC version which included kernel fixes (after reading the following thread #2608 ), still same problem (4.14 kernel).
Tried with latest LibreELEC nightly build with 5.0.7 kernel, same problem.
Will send the unit back.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants