Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu 20.04 Root on ZFS: Mistake in how bpool is configured causes boot to get garbled, ends up with unbootable machine #54

Closed
ned14 opened this issue Sep 9, 2020 · 4 comments
Assignees

Comments

@ned14
Copy link

ned14 commented Sep 9, 2020

@rlaager

Firstly thank you SO MUCH for the Ubuntu 20.04 Root on ZFS HOWTO. Using qemu, last month I installed a Ubuntu 20.04 root on ZFS on two budget Intel Atom dedicated servers from their rescue boot, they work surprisingly well, considering.

One of the servers two days ago suddenly vanished however. It took some effort to figure out why, but I narrowed it down to a problem in the current HOWTO. I used the HOWTO from August, so post the current Erratum.

Right now, you say:

zpool create \
    -o ashift=12 -d \
    -o feature@async_destroy=enabled \
    -o feature@bookmarks=enabled \
    -o feature@embedded_data=enabled \
    -o feature@empty_bpobj=enabled \
    -o feature@enabled_txg=enabled \
    -o feature@extensible_dataset=enabled \
    -o feature@filesystem_limits=enabled \
    -o feature@hole_birth=enabled \
    -o feature@large_blocks=enabled \
    -o feature@lz4_compress=enabled \
    -o feature@spacemap_histogram=enabled \
    -o feature@zpool_checkpoint=enabled \
    -O acltype=posixacl -O canmount=off -O compression=lz4 \
    -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
    -O mountpoint=/boot -R /mnt \
    bpool ${DISK}-part3
...
zfs create -o canmount=noauto -o mountpoint=/boot \
    bpool/BOOT/ubuntu_$UUID

Note that two separate datasets both have a mountpoint of /boot.

Now, I'm not sure exactly how it happened, but I believe that on some boot or other of that server, the boot mounting service got confused, and didn't mount /boot. The system booted just fine though. However, when unattended-upgrades ran at some point, it called update-grub, that installed into the root ZFS pool which has flags grub can't parse, and BOOM bye bye server.

So, firstly can I suggest to not set the same mountpoint on two datasets?

Secondly, can I suggest that you recommend in the guide that people reboot a few times and make SURE that /boot, /boot/efi, and /boost/efi/grub are all coming up every time?

Finally, thirdly, I got badly caught out first install by missing the step which makes space for the MBR grub. May I suggest that you fuse the EFI and MBR partitioning instructions into one set which works fine for both schemes, so it's a single config right up to when you choose to install UEFI or MBR grub, and that's the only difference?

Thanks once again for the instructions, and taking all that time to write and maintain them. Indeed, if ZFS native encryption on the ZFS in Ubuntu 20.04 weren't so slow, I'd recommend a lot more hard key coded encryption, plus PAM-unlocked home drive encryption, so if your remote server ever dies, you don't leak all your secrets. However, ZFS native encryption is indeed very very slow in Ubuntu 20.04. It gets very much faster if you chose the gcm variant in future ZFSs.

@rlaager rlaager self-assigned this Sep 9, 2020
rlaager added a commit that referenced this issue Sep 9, 2020
This makes it easier to follow (and specifically, harder to miss the
second step).

Reported-by: Niall Douglas <ned14>
Issue #54
@rlaager
Copy link
Member

rlaager commented Sep 9, 2020

zpool create \
...
    -O acltype=posixacl -O canmount=off -O compression=lz4 \
    -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
    -O mountpoint=/boot -R /mnt \
    bpool ${DISK}-part3
...
zfs create -o canmount=noauto -o mountpoint=/boot \
    bpool/BOOT/ubuntu_$UUID

Note that two separate datasets both have a mountpoint of /boot.

This is intentional. Note that bpool is canmount=off. The boot pool has a mountpoint of /boot so that child datasets (created by the admin, not the HOWTO) would inherit nice mountpoints. This comes from the Ubuntu installer design, so I don't intend to change it.

Now, I'm not sure exactly how it happened, but I believe that on some boot or other of that server, the boot mounting service got confused, and didn't mount /boot.

That's not good, and I can see how that set off a cascade of problems resulting in an unbootable server, but without knowing the root cause of this, it's hard to know what to do next.

Secondly, can I suggest that you recommend in the guide that people reboot a few times and make SURE that /boot, /boot/efi, and /boost/efi/grub are all coming up every time?

I could, but without more detail on what happened, it's really hard to say how many times a person should reboot. And that's really not a great step to write into instructions... "Hey, reboot a few times because this might intermittently break." That's scary and not particularly actionable. If there are intermittent problems that are reproducible enough that such an instruction would help, I'd like to just fix them.

Finally, thirdly, I got badly caught out first install by missing the step which makes space for the MBR grub. May I suggest that you fuse the EFI and MBR partitioning instructions into one set which works fine for both schemes, so it's a single config right up to when you choose to install UEFI or MBR grub, and that's the only difference?

When I read the email from this issue on my phone, my thought was, "Didn't I already do that?" I'm looking at the HOWTO now, and I think I see what you're asking for. I've flattened two steps and the associated notes:

sgdisk     -n1:1M:+512M   -t1:EF00 $DISK
sgdisk -a1 -n5:24K:+1000K -t5:EF02 $DISK

Thanks once again for the instructions, and taking all that time to write and maintain them. Indeed, if ZFS native encryption on the ZFS in Ubuntu 20.04 weren't so slow, I'd recommend a lot more hard key coded encryption, plus PAM-unlocked home drive encryption, so if your remote server ever dies, you don't leak all your secrets. However, ZFS native encryption is indeed very very slow in Ubuntu 20.04. It gets very much faster if you chose the gcm variant in future ZFSs.

Great news on that front. "The AES-GCM patches [to userspace] were applied to zfs 0.8.3-1ubuntu12.1" That is in focal-updates, which currently has 0.8.3-1ubuntu12.4. The userspace side is primarily making encryption=on mean GCM. The kernel changes were backported in 5.4.0-43.47 "zfs: backport AES-GCM performance accelleration (LP: #1881107)". So an up-to-date 20.04 system should be much faster already.

There was a PAM module merged upstream, but that's not in 20.04 and I haven't tested it: openzfs/zfs@221e670

The Ubuntu folks are (or were, last we talked) working on encryption integration.

You mentioned a remote server, so the dropbear SSH support might be interesting to you: #46 (comment)

@ned14
Copy link
Author

ned14 commented Sep 9, 2020

This is intentional. Note that bpool is canmount=off. The boot pool has a mountpoint of /boot so that child datasets (created by the admin, not the HOWTO) would inherit nice mountpoints. This comes from the Ubuntu installer design, so I don't intend to change it.

Ah, you're right. That can't be the cause then.

That's not good, and I can see how that set off a cascade of problems resulting in an unbootable server, but without knowing the root cause of this, it's hard to know what to do next.

Ok, my next idea for the cause then is that there is some sort of race in the ZFS dataset mounting. /etc/zfs/zfs-list.cache/bpool and /etc/zfs/zfs-list.cache/rpool say what order to mount right? Could this from my rpool cache be the cause?

rpool/USERDATA  /       off     on      on      on      on      off     on      off     rpool/USERDATA  file:///boot/userdata.key
rpool/USERDATA/ned_enc  /home/ned       on      on      on      on      on      off     on      off     rpool/USERDATA/ned_enc  file:///boot/userdata.key
rpool/USERDATA/root     /root   on      on      on      on      on      off     on      off     rpool/USERDATA  none

Y'see, if the mounting process mounts bpool and rpool concurrently, then there is a race on whether /boot/userdata.key is available by the time that rpool/USERDATA gets mounted. If it can't find the key, the whole mounting process would surely abort right there. If sufficient datasets have indeed mounted to allow the system to boot, it comes up, next time we touch /root or /home/ned it could then be automounting those. But the remainder of the mounting session i.e. the entries in /etc/fstab never get mounted. Thus /boot/efi never gets mounted. Then grub install goes to the wrong place.

Does this sound plausible to you?

When I read the email from this issue on my phone, my thought was, "Didn't I already do that?" I'm looking at the HOWTO now, and I think I see what you're asking for. I've flattened two steps and the associated notes:

Cool, thanks.

Great news on that front. "The AES-GCM patches [to userspace] were applied to zfs 0.8.3-1ubuntu12.1" That is in focal-updates, which currently has 0.8.3-1ubuntu12.4. The userspace side is primarily making encryption=on mean GCM. The kernel changes were backported in 5.4.0-43.47 "zfs: backport AES-GCM performance accelleration (LP: #1881107)". So an up-to-date 20.04 system should be much faster already.

Mine is zfs-0.8.3-1ubuntu12.4 and kernel 5.4.0-47-generic using aes-128-gcm. I see an 8x performance loss using encryption. This is for a 1.7Ghz Intel Atom with AES-NI. More suspiciously, I get identically slow results for any crypto algorithm. It could just be the Atom CPU of course, they have unusual bottlenecks.

There was a PAM module merged upstream, but that's not in 20.04 and I haven't tested it: openzfs/zfs@221e670

You can replicate that same right now using https://talldanestale.dk/2020/04/06/zfs-and-homedir-encryption/. It doesn't support SSH key authentication obviously because the dataset is only mounted when the password is checked.

The Ubuntu folks are (or were, last we talked) working on encryption integration.

You mentioned a remote server, so the dropbear SSH support might be interesting to you: #46 (comment)

Ultimately for a remote server I only care about the server dying suddenly, and the cheapo hosting provider failing to securely wipe the drive before it goes onto ebay. You don't need secure crypto for this, just obfuscation to defeat the automated scanner programs ebay buyers use to hunt for personal info, credit cards etc. For that, a global static crypto key is just fine, though I had rather not wanted to put it into the root directory with the suffix .key. Looks like now I might have to, though.

I'll close this issue now as I think your HOWTO is no longer the cause. Thanks for the useful response.

@ned14 ned14 closed this as completed Sep 9, 2020
@rlaager
Copy link
Member

rlaager commented Sep 9, 2020

Ok, my next idea for the cause then is that there is some sort of race in the ZFS dataset mounting. /etc/zfs/zfs-list.cache/bpool and /etc/zfs/zfs-list.cache/rpool say what order to mount right?

They are involved, but there's relevant indirection. Those are cache files which are read by a systemd mount generator /lib/systemd/system-generators/zfs-mount-generator. It generates unit files into /run/systemd/generator. Look there for the mount units. The dependencies expressed in those mount units are what actually controls. Also check your systemd journal for messages about dependency loops.

There have been quite a few changes to the zfs-mount-generator recently. It's hard for me to keep track in my head which ones have landed in 20.04. There are definitely things that haven't landed that could be relevant. You might want to grab OpenZFS from git master and look at: git log etc/systemd/system-generators/zfs-mount-generator.in

You might try something like this too, which avoids the need to ./configure and build OpenZFS master:

cd etc/systemd/system-generators
sed "s|@sbindir@|/usr/sbin|;s|@sysconfdir@|/etc|" \
    zfs-mount-generator.in > zfs-mount-generator
chmod +x zfs-mount-generator

mkdir /tmp/before /tmp/after
/lib/systemd/system-generators/zfs-mount-generator /tmp/before /tmp/before /tmp/before
./zfs-mount-generator /tmp/after /tmp/after /tmp/after

diff -urN /tmp/before /tmp/after

That should give you a working zfs-mount-generator in the tree, then use that and the system one to generate mounts, then show you the difference between them.

If the changes look sane, try running with the newer mount generator:
sudo install -m 755 zfs-mount-generator /etc/systemd/system-generators

Putting the mount generator in /etc should cause it to override the one in /lib. Reboot and verify that the mounts in /run/systemd/generator have the expected changes.

@ned14
Copy link
Author

ned14 commented Sep 9, 2020

They are involved, but there's relevant indirection. Those are cache files which are read by a systemd mount generator /lib/systemd/system-generators/zfs-mount-generator. It generates unit files into /run/systemd/generator. Look there for the mount units. The dependencies expressed in those mount units are what actually controls. Also check your systemd journal for messages about dependency loops.

That was useful. I traced through all the files, building up a dependency graph. There is no dependency chain between the dataset needing the key from /boot upon the /boot dataset. So I guess it's reasonable that both trees could be executed in any order, or concurrently. I fixed this by moving the key into the root directory.

I did check the systemd journal for dependency loops long before filing this issue. It complains about cryptsetup, but that's totally unrelated to this (cryptsetup cannot determine the root mount if it's ZFS on 20.04, so it adds an unnecessary entry. They've fixed it upstream. The cycle gets broken during boot in non-harmful way). There were no other entries about dependency loops.

Thanks for your help on this. Here's hoping these servers stay up longer this time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants