Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel warning and network failure when attempting to use the network after bootloader times out. #144

Closed
dickontoo opened this issue Jun 3, 2020 · 9 comments
Labels
bug Something isn't working

Comments

@dickontoo
Copy link

If the Pi 4 is booted with a boot order which includes the network before the ultimately successful boot device, but no network cable is plugged in, the kernel will crash when attempting to use it, once the link is restored. I am using NFS boot, but I don't believe that is related. On a control-alt-delete reboot, the bootloader no longer displays the diagnostics screen.

Kernel version:
[ 0.000000] Linux version 5.4.42-v7l+ (dom@buildbot) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611)) #1319 SMP Wed May 20 14:12:03 BST 2020

To Reproduce:

Set BOOT_ORDER to, say, 0xf124. Remove all USB bootable devices. Install a bootable uSD card. Remove the network cable. Boot the Pi.

The bootloader will time out waiting for the USB and network stages, then fall back to the uSD card. At this point, the kernel should boot. With an NFS root, it will wait undefinitely for a cable. Plugging the cable in at any time after the network boot timeout, including during the rainbow screen, will cause the kernel to crash when it attempts to send a packet.

Expected behaviour

The kernel run as usual.

Screenshots

Backtrace:

[    0.000000] Kernel command line: coherent_pool=1M 8250.nr_uarts=1 snd_bcm2835.enable_compat_alsa=0 snd_bcm2835.enable_hdmi=1 snd_bcm2835.enable_headphones=1 
video=HDMI-A-1:3840x2160M@60,margin_left=48,margin_right=48,margin_top=48,margin
_bottom=48 smsc95xx.macaddr=DC:A6:32:03:0F:D3 vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000  dwc_otg.lpm_enable=0 console=tty1 console=ttyS0,115200 root=/dev/nfs elevator=deadline rootwait rw nfsroot=172.29.23.1:/var/local/nfsroot/telly-buster ip=dhcp consoleblank=0 net.ifnames=0 netbooted
[...]
[    2.413704] bcmgenet fd580000.ethernet: failed to get enet clock
[    2.413788] bcmgenet fd580000.ethernet: GENET 5.0 EPHY: 0x0000
[    2.413869] bcmgenet fd580000.ethernet: failed to get enet-wol clock
[    2.413954] bcmgenet fd580000.ethernet: failed to get enet-eee clock
[    2.414045] bcmgenet: Skipping UMAC reset
[    2.425245] libphy: bcmgenet MII bus: probed
[...]
[    4.200646] bcmgenet: Skipping UMAC reset
[    4.223694] bcmgenet fd580000.ethernet: configuring instance for external RGMII
[    4.225243] usb 1-1: new high-speed USB device number 2 using xhci_hcd
[    4.249894] bcmgenet fd580000.ethernet eth0: Link is Down
[..]
[   14.635407] bcmgenet fd580000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   14.675229] Sending DHCP requests .
[   17.035192] ------------[ cut here ]------------
[   17.055474] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x310/0x314
[   17.069941] NETDEV WATCHDOG: eth0 (bcmgenet): transmit queue 1 timed out
[   17.082776] Modules linked in:
[   17.091964] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.42-v7l+ #1319
[   17.104794] Hardware name: BCM2711
[   17.114268] Backtrace:
[   17.122825] [<c020d46c>] (dump_backtrace) from [<c020d768>] (show_stack+0x20/0x24)
[   17.136618]  r6:ef914000 r5:00000000 r4:c129c338 r3:194aeea3
[   17.148415] [<c020d748>] (show_stack) from [<c0a35ae4>] (dump_stack+0xe0/0x124)
[   17.161928] [<c0a35a04>] (dump_stack) from [<c0221c50>] (__warn+0xec/0x104) 
[   17.175114]  r8:000001c0 r7:00000009 r6:c0e356c0 r5:00000000 r4:ef915d3c r3:194aeea3
[   17.189106] [<c0221b64>] (__warn) from [<c0221cec>] (warn_slowpath_fmt+0x84/0xc0)
[   17.202859]  r9:c0e356c0 r8:000001c0 r7:c0943684 r6:00000009 r5:c0e356d8 r4:c1204f88
[   17.216841] [<c0221c6c>] (warn_slowpath_fmt) from [<c0943684>] (dev_watchdog+0x310/0x314)
[   17.231264]  r9:ffff9178 r8:ef2c0000 r7:00000002 r6:c1203d00 r5:ef2c02a8 r4:00000001
[   17.245329] [<c0943374>] (dev_watchdog) from [<c02a0538>] (call_timer_fn+0x40/0x180)
[   17.259423]  r8:c0943374 r7:00000100 r6:ef914000 r5:ef2c02a8 r4:eff34440
[   17.272491] [<c02a04f8>] (call_timer_fn) from [<c02a1688>] (run_timer_softirq+0x288/0x654)
[   17.287161]  r9:00000000 r8:ef2c02a8 r7:ef914000 r6:ffff9178 r5:ef915e18 r4:eff34440
[   17.301279] [<c02a1400>] (run_timer_softirq) from [<c020249c>] (__do_softirq+0x1a4/0x418)
[   17.315839]  r10:00000004 r9:00000282 r8:ef848800 r7:00000100 r6:ef914000 r5:00000001
[   17.330019]  r4:c1203084
[   17.338881] [<c02022f8>] (__do_softirq) from [<c0227ce8>] (irq_exit+0x100/0x110)
[   17.352695]  r10:00000000 r9:ef914000 r8:ef848800 r7:00000001 r6:00000000 r5:00000000
[   17.366883]  r4:c10a82e4
[   17.375726] [<c0227be8>] (irq_exit) from [<c0282fa8>] (__handle_domain_irq+0x70/0xc4)
[   17.389995] [<c0282f38>] (__handle_domain_irq) from [<c02022b8>] (gic_handle_irq+0x4c/0x88)
[   17.404812]  r8:f0815000 r7:f0814000 r6:ef915f38 r5:f081400c r4:c1205a14 r3:ef915f38
[   17.419027] [<c020226c>] (gic_handle_irq) from [<c0201a3c>] (__irq_svc+0x5c/0x7c)
[   17.432964] Exception stack(0xef915f38 to 0xef915f80)
[   17.444478] 5f20:                                                       c0209b94 00000000
[   17.459205] 5f40: 60000093 c021c160 c1204fb4 ef914000 c1204ffc 00000004 c12a2ce5 410fd083
[   17.473948] 5f60: 00000000 ef915f94 c12053ac ef915f88 00000000 c0209b98 60000013 ffffffff
[   17.488757]  r8:c12a2ce5 r7:ef915f6c r6:ffffffff r5:60000013 r4:c0209b98 r3:194aeea3
[   17.503150] [<c0209b64>] (arch_cpu_idle) from [<c0a56ae4>] (default_idle_call+0x34/0x48)
[   17.517850] [<c0a56ab0>] (default_idle_call) from [<c0255a84>] (do_idle+0xec/0x170)
[   17.532103] [<c0255998>] (do_idle) from [<c0255de4>] (cpu_startup_entry+0x28/0x2c)
[   17.546264]  r8:00007000 r7:c12b5390 r6:30c0387d r5:00000002 r4:0000008a r3:194aeea3
[   17.560744] [<c0255dbc>] (cpu_startup_entry) from [<c0210b10>] (secondary_start_kernel+0x138/0x144)
[   17.576605] [<c02109d8>] (secondary_start_kernel) from [<002027ac>] (0x2027ac)
[   17.590627]  r5:00000000 r4:2f86dac0
[   17.600858] ---[ end trace 4f2b174eb0e5e481 ]---

Full log attached.
bt.txt

Bootloader version and configuration

root@pi4:~# vcgencmd bootloader_version
Jun  3 2020 13:53:47
version b5de8c32f4f45a12a1fdfe107254df82965f9d56 (release)
timestamp 1591188827
root@pi4:~# vcgencmd bootloader_config
[all]
BOOT_UART=1
WAKE_ON_GPIO=1
POWER_OFF_ON_HALT=0
DHCP_TIMEOUT=45000
DHCP_REQ_TIMEOUT=4000
TFTP_FILE_TIMEOUT=15000
TFTP_IP=
BOOT_ORDER=0xf124
SD_BOOT_MAX_RETRIES=3
NET_BOOT_MAX_RETRIES=2
[none]
FREEZE_VERSION=0

Also present in 2020-05-28. I reported it on the thread over the weekend, but I believe it was missed. Niche, yes. Hope this helps.

@timg236
Copy link
Collaborator

timg236 commented Jun 3, 2020

I think this should be raised as a Linux issue. Once the kernel is loaded the network interface is reset before starting the kernel. There is no shared network state between the firmware and the kernel.

@dickontoo
Copy link
Author

I thought that, but the reason I raised it here is because it does seem to matter. If you insert the cable in the window when the bootloader is loading everything from the uSD card, the link comes up but the kernel still crashes when it tries to send a packet. If there really is no state maintained, then it shouldn't do that, surely?

@timg236
Copy link
Collaborator

timg236 commented Jun 3, 2020

I think the kernel is assuming that genet registers are in a particular state and I think that's probably a regression in the 5.4 Kernel or nobody has spotted this in a 4.19 kernel.
Anyway, I'll take a look. If it's just a matter of zapping a few registers in start.elf its a quick fix but UMAC setup for Ethernet in Linux has been a bit fragile for a long time. That's one reason why I haven't touched the network boot driver for a long time!

@lurch
Copy link
Contributor

lurch commented Jun 3, 2020

ping @pelwell to see if he thinks this ought to be moved to the Linux repo?

@pelwell
Copy link
Collaborator

pelwell commented Jun 4, 2020

It might be a kernel problem, but I'm not being given the option to move it, and demanding that it be closed and another issue opened in raspberrypi/linux seems a bit draconian.

@timg236
Copy link
Collaborator

timg236 commented Jun 4, 2020

Leave it here until we know what the issue is. If it really is a Kernel problem then that's a separate discussion for an upstream driver.
I suspect the Kernel assumes some part of the MAC is in the reset state which might not be true if netboot has been started. The driver should really handle that because u-boot might do the same thing but that's a nice to have

@timg236
Copy link
Collaborator

timg236 commented Jun 11, 2020

I was able to reproduce the failure. The problem only seems to occur if network boot as attempted but the link was never established i.e. cable always disconnected until waiting in Linux. Adding genet.skip_umac_reset=n doesn't help.

Anyway, this now works for me (boot without ethernet cable then insert cable when rootfs timeout messages start to appear). It resets the MAC and PHY in the bootloader if network boot was selected. The MAC/PHY reset sequence has been a bit fragile on Pi 4 so I think it's going to need to be tested on various boards + switchers for us to be sure. I also don't know why Linux is crashing other than TX gets stuck.

rpi-eeprom-recovery-ce4ad3c14.zip

@timg236 timg236 added the bug Something isn't working label Jun 11, 2020
@timg236 timg236 changed the title Kernel crashes when attempting to use the network after bootloader times out. Kernel warning and network failure when attempting to use the network after bootloader times out. Jun 12, 2020
@timg236
Copy link
Collaborator

timg236 commented Jun 12, 2020

Updated title to be more specific. The backtrace is a warning that the ethernet driver is stuck, with NFS boot the OS will be inoperable but I think you could get an Ethernet freeze from an SD boot if the timing was just "right" when the kernel does DHCP

timg236 added a commit to timg236/rpi-eeprom that referenced this issue Jun 12, 2020
* Reset Ethernet MAC + PHY if final boot mode is not network boot
  See: Kernel warning and network failure when attempting to use
       the network after bootloader times out. raspberrypi#144
* Improve handling of multiple bootable USB devices and remove USB_MSD_BOOT_MAX_RETRIES
* Resolve: No DHCPACK with DHCP relay agent raspberrypi#58
* Toggle USB root hub port power for 200ms on the first USB MSD boot attempt
  See: Bootloader can't boot via USB-HDD after system reboot raspberrypi#151
* Update bootloader handover to support uart_2ndstage - requires a newer
  start.elf firmware which will be via rpi-update.
* Assert PCIe fundamental reset if the final bootmode was not USB-MSD
  because the OS might not do this before starting XHCI.
@timg236
Copy link
Collaborator

timg236 commented Jun 16, 2020

Resolved in pieeprom-2020-06-15
@dickontoo please re-open if this still fails for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants