-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lockup when copying files on NFS share on RBP3B+ #2788
Comments
If you could reproduce the lockup on a Raspbian image (or a stock Pi kernel from the firmware repo) it would help a lot. |
@pelwell - Testing with Raspbian Stretch Lite (November 2018) seems to successfully run the rsync task without errors (tested 9 times in a row).
Lots of uncontrolled variables with this comparison thought.... same hardware but different architecture and kernel versions. I believe rsync and nfs-utils are also different versions. In any case, where to go from here to troubleshoot? |
The next step would be to take the kernel and modules from our firmware repo and drop them into your ARCH system. As long as the version names differ slightly the modules should coexist happily - you just need to overwrite kernel7.img to switch. |
OK. I have my Arch ARM armv7h image and confirmed the bug on it as with aarch64. For the firmware, is this the file you recommend I use: https://github.com/raspberrypi/firmware/archive/1.20181112.tar.gz I believe I have that from the raspberrypi-firmware package provided by Arch ARM.... but I am happy to manually place files to help figure out this error. Can you provide me with some more specific steps so we're clear that I am doing the right experiment? Thanks! EDIT: For reference, here is a complete list of the files provided by the Arch ARM raspberrypi-firmware package. Note that |
I suggest downloading the zip file from https://github.com/raspberrypi/firmware?files=1 (see Clone or Download), and using just the modules, and the kernel from /boot. |
OK. I downloaded and extracted the zip file:
To be clear, on the RPi3B+, you want to do the following?
EDIT: OK... I did it. Booting into the system got lots of systemd failures (rngd, systemd-nostnamed, systemd-modules-load, systemd-resolved, systemd-timesyncd, systemd-update-utmp) and a read-only root partition which likely caused the aforementioned failures. I was able to log in locally and:
Now I am able to connect via sshd and conduct the test. Running now. Will post back. BTW, the uname output changed as expected |
You only really needed to copy the kernel IMG from the boot directory. Did you copy the modules as well? |
I did not copy the modules... I read your previous line too literally (ie only the things under /boot). In any case, the test completed 8 iterations without a single error. I will replace the Arch-provided /boot contents and then copy only |
And the modules - you do need the modules. |
..are you sure? Perhaps they are the reason why my root filesystem was readonly, but even without them, I can do the rsync task 8 times in a row without an error. |
You are clearly managing to do basic things without them, but lots of tasks may not be running which would make the test less representative. |
OK... where shall I copy them? The distro default is Like:
|
Yes, that looks right to me. |
Interesting. We actually missed a step in the Raspbian test, which was to use rpi-update to get the latest kernel. If you still have the Raspbian Lite card around it would help to narrow down whether this is a straight kernel issue introduced with newer builds or whether Arch is using it in a different way. |
I actually ran the following on the Raspbian box:
But now I booted to it and ran
|
The Raspbian kernel and firmware packages are periodically updated from the firmware repo, but rpi-update always gets the most recent builds. Please run rpi-update, but first make a note of the kernel version before updating. |
Before:
After:
Test running now... I will be unavailable for the next few hours unfortunately but I have it set to run for 21 iterations. |
No problem. You should now have identical kernels and modules in Raspbian and Arch. |
OK... all 21 iterations completed without error on Raspbian (updated). I am repeating the Arch stock image + kernel7.img/modules from your zip file as a double-check. |
@pelwell - OK. I took the fresh Arch ARM armv7h image, updated it (kernel version 4.14.90 and firmware dated 20181221). Then I replaced:
It boots fine and completed 21 iterations of the rsync copy test without error. I rebooted it and the nfs server and again, it completed 13 iterations of the rsync copy test without error (I only ran 13 this time). Conclusion: booting with the zip file kernel and modules on an otherwise native Arch ARM armv7h image does not experience the bug. |
That's good because the results are consistent, but it puts the ball in the Arch court - I don't think there is much I can do here. |
Remember that, as well as a potentially different kernel commit, Arch may be using a different toolchain. |
@pelwell - I'm sure we are building from the same commits, for example but you are correct about the tool chain. Our current one is based on gcc v8.2.0. What version of gcc are you using? ... is there anything obvious to you in our kernel config that might be driving this? Here is an online diff via diffchecker of the two configs to aid in reading them. |
I substituted the raspbian config (
|
To try to get at the differences in the tool chain, I used the Arch packages from March 12th, 2017 (actually the entire system from that date) which provides gcc version 6.3.1. I see that the Raspbian image I have is using gcc 6.3.0-18+rpi+deb9u1 and the Arch ARM version I cited is the nearest match although obviously not identical. I built kernel version 4.14.90 using it and after I booted into it, the rsync test failed with similar errors (system was frozen so I could not ssh in to paste out the dmesg output/I didn't see anything related to the errors in the journalctl output. |
@graysky2 Did you try to copy the ARCH build kernel on the Raspbian SD card? |
@lategoodbye - Good idea. I copied the Arch modules and |
Good. This is definitely a kernel issue, but only the Arch rootfs seems to trigger it.
The most important is to identify the trigger. |
@lategoodbye - Thanks for your help. Let me take these in order:
As to networking.... by default, Arch ARM uses Network summary: counting I did notice that Raspbian uses an older version of NFS: I did install
List of user-enabled systemd services I mentioned as the minimal set I tried on ArchARM
Same output from Raspbian:
|
@graysky2 Can you point me to the Arch armv7h images ( https://archlinuxarm.org/platforms/armv8/broadcom/raspberry-pi-3 )? |
This is interesting: if I use rysnc over ssh to either a read or to write, I do not experience the bug. I only get the bug if I use rsync from the mount to the mount. Does that have any implications? In other words:
BUT neither this nor the one after it triggers the bug (I have run it up to 21 iterations):
Or
|
Perhaps this bug is triggered by saturating the lan78xx bus. I was reading the
I ran the script increasing the RATE value from 2000 up to 11000 in steps of 1000. To my surprise, I found I could get the box to run through 9 iterations (no errors) until I set it to 10000. Just to verify, I also used a value of 11000.
|
@graysky2 Are you able to make a wireshark trace (i think a minute before until 4 minutes after the lockup should be sufficient)? |
I've never used wireshark before but I'd be happy to do it if you can send the me stanza to provide the output you're seeking.
|
wireshark is a graphical to capture network traffic. There are two options where you can capture the traffic, either on the server side or on the client side. I assume you want to capture on the Raspberry Pi side because of privacy or permission. In this case i would recommend to use tcpdump as a commandline replacement for wireshark. You can start tracing the network with following command (not recommend via SSH connection, make sure there is enough disc space) With Strg+C you can finish the trace. Drawback in this case is that you don't know when issue appears. So you will need to restart with every rsync attempt. |
I will try EDIT: I have a trace now but even compressed with @pelwell - Are you still following this ticket? If you see this comment I posted introducing the bandwidth saturation possible triggering the bug hypothesis? |
@graysky2 This size is expected. You only need to provide a download link and a timestamp when the issue occurred. |
Do you have this patch, along with turning TSO off via the module parameter? This was a fix required when transferring large files on the 3B+ |
@JamesH65 - I believe that patch is part of the kernel now, no? Can you share the module parameter you referenced and how you are applying it? I am happy to test. |
The patch disables TSO by default, with a module parameter ( |
@graysky2 I don't know which kernel you are building, and the symptoms are similar to the ones that patch fixed, so thought I would suggest it. It may well be in your kernel already, in which case a red herring. |
@JamesH65 - I get this bug on armv7h kernels as recent as 4.14.90. I'm pretty sure the patch was applied months ago to the 4.14.x series. |
For others experiencing this bug, refer to this comment. For me, a work-around is to append the |
That's pretty interesting, good find. |
@graysky2 Still no chance to upload the tcpdump trace to a cloud service? Another possibility to reduce the filesize is to open the pcap in Wireshark and re-export only the relevant time. |
Hi, Newly bought RBP3B+, with a AUKRU 5V3A power supply. Symptom is the same every time : any light load on the ethernet link will lockup the Pi3B+ within seconds. A simple "apt-get update" will lockup the Pi3B+ No problem if using the WLAN. So the problem is definitely linked to the lan78xx/Ethernet adapter I've switched SD cards, same behavior. $ uname -a $ vcgencmd version $ cat /etc/debian_version I've connected the Pi3B+ to: I made a tcpdump capture using a network SPAN on the Cisco Switch, and i can see some packets being DUP ack'ed up to five times and multiple TCP window update packets. The file being retrieved is http://raspbian.raspberrypi.org/raspbian/dists/stretch/main/binary-armhf/Packages.xz and the capture stops after 5Mo (out of 11Mo) Let me know anything I could try (new kernel, other firmware,...) in order to solve this bug, as for now, this makes my Pi3B+ totally unusable. Thanks, |
@phil0u This sound like a HW fault - ethernet should simply work well out of the box with the standard Raspbian. I just tried gabbing that file on a Pi3B+ and all was fine. I'd suggest sending it back for replacement. |
@JamesH65 Thanks for the advice. I spent some more time trying to install older LibreELEC version which included kernel fixes (after reading the following thread #2608 ), still same problem (4.14 kernel). |
Describe the bug
Errors and timeouts when using rsync to copy files from an NFS share to the same NFS share.
Perhaps this is related to #2482
To reproduce
dmesg -e -w
as the rsync job grinds on.Expected behaviour
Rsync should complete without errors.
The source dir,
/scratch/armc8/root/
contains approx 26,600 files around 760 MB.Actual behaviour
Rsync does not complete; dmesg reports a number of problems.
I attached the complete dmesg output in the logs section below but here is a partial output illustrating the error:
System
Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:
vcgencmd version
) =Dec 17 2018 23:56:39 version da468960fe03ecbaa8e3f1ee01c7217c3bd01fa8 (clean) (release)
Note that the firmware package is only available on the armv7h flavor; no firmware is available on aarch64.
Linux workbench 4.14.89-1-ARCH #1 SMP Tue Dec 18 14:02:18 EST 2018 armv7l GNU/Linux
Linux workbench 4.19.10-1-ARCH #1 SMP Tue Dec 18 19:48:51 MST 2018 aarch64 GNU/Linux
I have been reviewing #2482 and came across this comment suggesting this setting:
ethtool -K eth0 tx-tcp-segmentation off
prior to the copy command.I tested this with my rsync command above and still the bug is present.
Logs
Link to complete dmesg.
The text was updated successfully, but these errors were encountered: