Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPi 5 bootloader doesn't load kernel with GPT partition array not at LBA 2 #585

Closed
sairon opened this issue Jun 27, 2024 · 10 comments
Closed
Labels
bug Something isn't working

Comments

@sairon
Copy link

sairon commented Jun 27, 2024

Describe the bug

Home Assistant OS started using Genimage for creating OS images in release 12.4. After that change we started to get reports that freshly imaged SD card fails to boot on Raspberry Pi 5 (home-assistant/operating-system#3437). The change that was introduced by using Genimage was that the first LBA field in the GPT header was changed from 34 (which is typically what sgdisk sets) to 2048 (which is what's actually more appropriate with 1MiB partition alignment) - something that looked quite harmless.

This worked perfectly fine when the image was flashed to the card from Linux, however, it failed to boot when Raspberry Pi Imager was used on Windows. Turned out the Imager itself isn't to blame, but the Windows kernel (presumably on Windows 10+, Windows 7 don't show that behavior), which alters the partition table even when a drive is just plugged into the computer. The change it does is that it moves the backup LBA to the real end of the drive, and if the first LBA isn't 34, it also moves the partition array start to a different block (seems to be first_LBA - 32). In the second case, the actual array of partition entries still exists on both places - at LBA 2 and LBA 2016 (i.e. the old data from the image is not nulled).

The problem is that Raspberry Pi bootloader can't boot if the partition entries are relocated. It doesn't seem that it makes assumptions the partition table is at LBA 2 - because it is still there, so maybe something is miscalculated or read is attempted outside of expected boundaries? Anyway, although what Windows do is quite a nasty thing, the GPT table is still perfectly valid, so I'd expect it shouldn't cause trouble.

Here's the diff of the part of the GPT table before and after it's modified by Windows - note the change in the last 8 bytes:

-00000210: 7e39 180f 0000 0000 0100 0000 0000 0000  ~9..............
+00000210: cf3d ab66 0000 0000 0100 0000 0000 0000  .=.f............
          | crc32   | reserved|current LBA        |
-00000220: ffff 3f00 0000 0000 0008 0000 0000 0000  ..?.............
+00000220: ffff ba03 0000 0000 0008 0000 0000 0000  ................
          | backup LBA        | first LBA         |
-00000230: deff 3f00 0000 0000 a211 3e2d a149 fa44  ..?.......>-.I.D
+00000230: deff ba03 0000 0000 a211 3e2d a149 fa44  ..........>-.I.D
          | last LBA          | GUID  0-7B        |
-00000240: a850 55b6 af53 4f62 0200 0000 0000 0000  .PU..SOb........
+00000240: a850 55b6 af53 4f62 e007 0000 0000 0000  .PU..SOb........
          | GUID 8-15B        | partition array   |

I can confirm that the partition entries start at LBA 2016 (0x7e0 = byte offset 0xfc000) after that change:

-000fc000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
+000fc000: 2873 2ac1 1ff8 d211 ba4b 00a0 c93e c93b  (s*......K...>.;

For completeness, this is what Windows do with the partition table if the first LBA is set to 34:

-00000210: 99f9 2492 0000 0000 0100 0000 0000 0000  ..$.............
+00000210: deb0 923f 0000 0000 0100 0000 0000 0000  ...?............
          | crc32   | reserved|current LBA        |
-00000220: ffff 3f00 0000 0000 2200 0000 0000 0000  ..?.....".......
+00000220: ffff ba03 0000 0000 2200 0000 0000 0000  ........".......
          | backup LBA        | first LBA         |
-00000230: deff 3f00 0000 0000 faec 4e7f 7d9a 0f4b  ..?.......N.}..K
+00000230: deff ba03 0000 0000 faec 4e7f 7d9a 0f4b  ..........N.}..K
          | GUID 8-15B        | partition array   |
 00000240: 8bbc 2d37 8b85 a2f0 0200 0000 0000 0000  ..-7............

Steps to reproduce the behaviour

Install Home Assistant OS 12.4 image (e.g. using Raspberry Pi Imager) on Windows.

- alternatively -

Run system image with GUID partition table with array of partition entries starting at LBA != 2.

Device (s)

Raspberry Pi 5

Bootloader configuration.

Bootloader versions up to latest 6fe0b091 2024/06/05.

System

No response

Bootloader logs

image

USB boot

No response

NVMe boot

No response

Network (TFTP boot)

No response

@timg236
Copy link
Collaborator

timg236 commented Jul 2, 2024

No plans to support this right now but leaving it open as possibly part of 4K native sector / large boot drive support if/when that's supported.

@sairon
Copy link
Author

sairon commented Jul 2, 2024

@timg236 Okay, it's a pity but I planned on accommodating to it anyway. However, do you have some more insights why that happens? The GPT records are perfectly valid and the drive can be used in Windows and Linux without issues, it's just the bootloader which can't read data from the boot partition then.

@timg236
Copy link
Collaborator

timg236 commented Jul 2, 2024

The bootloader has never supported the backup LBA and IIRC expects it to be able to use 2, maybe that constraint can be relaxed on Pi5 (only) since it's only read by the SPI bootloader.

@timg236
Copy link
Collaborator

timg236 commented Jul 2, 2024

No promises but if someone could post images then we could see if a quick fix is possible. The image just needs to include all the gpt table rather than all the data

@sairon
Copy link
Author

sairon commented Jul 3, 2024

Not sure we're on the same page. The backup LBA is not the problem here - the problem is when the PartitionEntryLBA (of the primary GPT) points to LBA 2016. Usually this is LBA 2 but if FirstUsableLBA is not 34, Windows relocates it. Also, the actual partition entries are both at LBA 2 and LBA 2016 then, so it seems something else goes wrong there.

The issue can be reproduced by flashing Home Assistant OS 12.4 image from Windows environment (or plugging an SD card with that image to Windows 10+ PC). If you want, I can prepare a minimal image from that for reproducing the bootloader issue, i.e. without the whole rootfs.

@timg236
Copy link
Collaborator

timg236 commented Jul 3, 2024

I think the bootloader only understands the normal/default case. That might be fixable.
I've asked someone to reproduce it and look at it has a background task since this sounds like something that possibly used to work until Windows started doing relocation. Probably a couple of weeks though due to other projects.

@timg236 timg236 added bug Something isn't working and removed enhancement New feature or request labels Sep 30, 2024
@learmj
Copy link

learmj commented Oct 2, 2024

Hi @sairon
Thanks for reporting this. A fix is in the works. fyi it can be reproduced without involving Windows. Use genimage to create a disk image with GPT and set the location of the Partition Entry Array to something other than LBA 2.
For example:

image disk.img {
   hdimage {
      partition-table-type = "gpt"
      gpt-location = 2048
   }

This locates it at LBA 4 (512B sectors) rather than the default of 2, which prevents the partition entries from being read.

@learmj
Copy link

learmj commented Oct 3, 2024

The attached should resolve the problem you're having with the HA images. Please feel free to give it a try. It would be great to know it resolves the problem you reported.

sudo rpi-eeprom-update -d -f /path/to/new/pieeprom.bin

Should yield:

$ sudo vcgencmd bootloader_version
2024/10/03 11:45:08
version 5fe3f5dc0a3ea9983fb42927170827c0935727ce (release)

The usual caveats apply about updating your eeprom.
rpi-eeprom-recovery.zip

@sairon
Copy link
Author

sairon commented Oct 9, 2024

Hi @learmj, thanks for looking into this. I can confirm the provided bootloader fixes the problem - a card that failed to boot (right before I updated the bootloader) now boots without issues. Thank you!

timg236 added a commit that referenced this issue Oct 10, 2024
…latest)

* Introduce a new boot-menu feature where pressing SPACE at power on
  gives the user a one-shot option to select a different boot mode.
  e.g. Select USB boot if the default SD card is corrupted or unavailable.
* Display the bootloader network-install UI for longer on a cold boot to make
  this feature more visible to first time users.
  To revert to the previous behaviour remove NET_INSTALL_AT_POWER_ON=1
  from the bootloader config.
* Support non-UUID HAT mapping
  Extend the HAT map support to allow matching on product and vendor
  strings, as well as product ID and version. As a minimum, there must
  be a product string - if that matches, the other keys are considered.
  Without a product key, the UUID is compared as before.
* Remove requirement for GPT ptable array  to be at LBA-2
  See: #585
* 2712C1 clock manager improvements to slightly reduce idle power ~50mW saving
* Adjust SDRAM page-hold and auto-precharge to improve performance.
  ~2% improvement with Geekbench 6
* armstubs: 2712: Rebuild with updated max-power throttle and direct stream settings
  See: raspberrypi/arm-trusted-firmware@fc45bc4
* debug: Only display the program_pubkey log if configuring secure-boot
timg236 added a commit that referenced this issue Oct 10, 2024
…latest)

* Introduce a new boot-menu feature where pressing SPACE at power on
  gives the user a one-shot option to select a different boot mode.
  e.g. Select USB boot if the default SD card is corrupted or unavailable.
* Display the bootloader network-install UI for longer on a cold boot to make
  this feature more visible to first time users.
  To revert to the previous behaviour remove NET_INSTALL_AT_POWER_ON=1
  from the bootloader config.
* Support non-UUID HAT mapping
  Extend the HAT map support to allow matching on product and vendor
  strings, as well as product ID and version. As a minimum, there must
  be a product string - if that matches, the other keys are considered.
  Without a product key, the UUID is compared as before.
* Remove requirement for GPT ptable array  to be at LBA-2
  See: #585
* 2712C1 clock manager improvements to slightly reduce idle power ~50mW saving
* Adjust SDRAM page-hold and auto-precharge to improve performance.
  ~2% improvement with Geekbench 6
* armstubs: 2712: Rebuild with updated max-power throttle and direct stream settings
  See: raspberrypi/arm-trusted-firmware@fc45bc4
* debug: Only display the program_pubkey log if configuring secure-boot
@timg236
Copy link
Collaborator

timg236 commented Oct 17, 2024

Thanks @sairon for the test closing this issue as fixed

@timg236 timg236 closed this as completed Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants