Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading Fedora Atomic Host 20180515.1 to 20180722.0 failed #1471

Closed
shannara opened this issue Jul 25, 2018 · 31 comments
Closed

Upgrading Fedora Atomic Host 20180515.1 to 20180722.0 failed #1471

shannara opened this issue Jul 25, 2018 · 31 comments
Labels

Comments

@shannara
Copy link

shannara commented Jul 25, 2018

Host system details

State: idle; auto updates disabled
Deployments:
  ostree://fedora-atomic:fedora/28/x86_64/atomic-host
                   Version: 28.20180722.0 (2018-07-23 00:38:05)
                    Commit: 106a7095fd46741948ba60dbd0586668ae7abe6336671444ec790d7541a88778
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1

● ostree://fedora-atomic:fedora/28/x86_64/atomic-host
                   Version: 28.20180515.1 (2018-05-15 16:32:35)
                    Commit: a29367c58417c28e2bd8306c1f438b934df79eba13706e078fe8564d9e0eb32b
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1

Steps to reproduce it

Install Fedora Atomic 28 from 20180515.1 iso.

First issue need to create /var/cache/rpm-ostree (fix with #1366)

Second issue, after run rpm-ostree upgrade, and reboot to 20180722.0, various errors display,

waiting to mount /var/home, /boot, /boot/efi and blocked to :
"Configure read-only root support" (5mn .. / no limit)

Need to hard reboot and switch to previous ostree (20180515) that work.

@shannara shannara changed the title Uppgrade ostree from 20180515.1 to 20180722.0 failed. Upgrade ostree from 20180515.1 to 20180722.0 failed. Jul 25, 2018
@cgwalters cgwalters added the bug label Jul 25, 2018
@jlebon
Copy link
Member

jlebon commented Jul 25, 2018

So, it looks like "Configure read-only root support" comes from fedora-readonly.service, which I hadn't come across before. It sounds like something we don't actually need given that root is already read-only, though I haven't looked deeper into it yet.

Can you also paste /proc/cmdline and /etc/sysconfig/readonly-root?

@shannara
Copy link
Author

/proc/cmdline

BOOT_IMAGE=/ostree/fedora-atomic-eeed31e4c67469d3fe08d0fe25a704f4208aec0702dce0c98b5e4efa9bddaca2/vmlinuz-4.16.8-300.fc28.x86_64 rd.lvm.lv=fedora_sle-fedora/root rd.luks.uuid=luks-0d7cb32b-2ff7-4fca-9788-4449ff70fbf0 root=/dev/mapper/fedora_sle--fedora-root ostree=/ostree/boot.1/fedora-atomic/eeed31e4c67469d3fe08d0fe25a704f4208aec0702dce0c98b5e4efa9bddaca2/0

/etc/sysconfig/readonly-root

# Set to 'yes' to mount the system filesystems read-only.
READONLY=no
# Set to 'yes' to mount various temporary state as either tmpfs
# or on the block device labelled RW_LABEL. Implied by READONLY
TEMPORARY_STATE=no
# Place to put a tmpfs for temporary scratch writable space
RW_MOUNT=/var/lib/stateless/writable
# Label on local filesystem which can be used for temporary scratch space
RW_LABEL=stateless-rw
# Options to use for temporary mount
RW_OPTIONS=
# Label for partition with persistent data
STATE_LABEL=stateless-state
# Where to mount to the persistent data
STATE_MOUNT=/var/lib/stateless/state
# Options to use for persistent mount
STATE_OPTIONS=
# NFS server to use for persistent data?
CLIENTSTATE=
# Use slave bind-mounts
SLAVE_MOUNTS=yes

@cgwalters
Copy link
Member

Hmm. First, we should nuke fedora-readonly.service from orbit - OSTree is just superior to it and they overlap a lot.

That said, that looks like the default config, and AFAICS from reading /usr/lib/systemd/fedora-readonly it should be a no-op service if READONLY=no.

I think this is a symptom of something else; can you paste your /etc/fstab? I wonder if we regressed something related to the /var generator in a different way.

@shannara
Copy link
Author

Hi @cgwalters,

This is my /etc/fstab

#
# /etc/fstab
# Created by anaconda on Wed Jul 25 14:07:11 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/fedora_sle--fedora-root /                       xfs     defaults,x-systemd.device-timeout=0 0 0
UUID=6ccb7110-c197-47ec-b26b-4e406872b0fa /boot                   ext4    defaults        1 2
UUID=7B30-E2CA          /boot/efi               vfat    umask=0077,shortname=winnt 0 2
/dev/mapper/fedora_sle--fedora-home /home                   xfs     defaults,x-systemd.device-timeout=0 0 0

Indicate /home and / but effective mountpoint as respectivelly /sysroot and /var/home

@jlebon
Copy link
Member

jlebon commented Jul 25, 2018

Hmm, I can't reproduce this. I tried upgrading a cloud VM from that same version to the latest with /var/roothome and /var/home mounts. Though I didn't try to install from an older ISO. (Any reason you're not using the latest ISO BTW? See the bottom of https://getfedora.org/en/atomic/download/ for links).

@shannara
Copy link
Author

@jlebon, thanks to try reproduce them.

I use this ISO because present on my download directory, and no think to download latest.
Because new machine, i can reinstall with latest iso, but strange if work to you with same version, maybe environment (UEFI / Secure Boot / LVM encrypt).

@jlebon
Copy link
Member

jlebon commented Jul 25, 2018

For context, from IRC, @shannara also tried the latest ISO but got issues booting:

11:49:28 < shannara> with last iso, not grub display, boot seem break.

Though I just tried the latest ISO here as well in a UEFI VM with encrypted VG and a /home mount, and it worked OK (I did have to wait a bit after entering the passphrase at boot for some units that were taking a while). So it's potentially a VM vs bare metal issue, which would make this especially tricky to nail down.

@cgwalters
Copy link
Member

Assuming the latest ISO picked up the updated kernel, I'm guessing this is a kernel regression; maybe something going wrong with the driver for whatever disks are in use.

You can test this by using rpm-ostree override to install just the kernel from 28.20180722.0.

@shannara
Copy link
Author

shannara commented Jul 27, 2018

Hi,

Sorry to late response, off office yesterday :(, but good news, i run rpm-ostree upgrade this morning and work.
kernel : 4.17.7-200

hwme7c2

:/
docker Container failed at start, need start manually /var/log/firewalld folder is missing.

Jul 27 10:00:48 sle-fedora systemd[1]: Starting Docker Application Container Engine...
Jul 27 10:00:48 sle-fedora dockerd-current[1035]: time="2018-07-27T10:00:48.241016706+02:00" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
Jul 27 10:00:48 sle-fedora dockerd-current[1035]: time="2018-07-27T10:00:48.241301402+02:00" level=warning msg="[!] DON'T BIND ON ANY IP ADDRESS WITHOUT setting -tlsverify IF YOU DON'T KNOW WHAT YOU'RE DOING [!>
Jul 27 10:00:48 sle-fedora dockerd-current[1035]: listen tcp 10.192.10.10:2376: bind: cannot assign requested address
Jul 27 10:00:48 sle-fedora systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jul 27 10:00:48 sle-fedora systemd[1]: docker.service: Failed with result 'exit-code'.
Jul 27 10:00:48 sle-fedora systemd[1]: Failed to start Docker Application Container Engine.
Jul 27 10:02:40 sle-fedora systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 27 10:02:40 sle-fedora firewalld[1681]: ERROR: Failed to open log file '/var/log/firewalld': [Errno 13] Permission denied: '/var/log/firewalld'

@cgwalters
Copy link
Member

It sounds like there's a problem with /var in general. Any errors in systemctl status systemd-tmpfiles-setup?

@shannara
Copy link
Author

This is status of systemctl status systemd-tmpfiles-setup

[root@sle-fedora ~]# systemctl status systemd-tmpfiles-setup
● systemd-tmpfiles-setup.service - Create Volatile Files and Directories
   Loaded: loaded (/usr/lib/systemd/system/systemd-tmpfiles-setup.service; static; vendor preset: disabled)
   Active: active (exited) since Fri 2018-07-27 10:00:47 CEST; 7h ago
     Docs: man:tmpfiles.d(5)
           man:systemd-tmpfiles(8)
  Process: 858 ExecStart=/usr/bin/systemd-tmpfiles --create --remove --boot --exclude-prefix=/dev (code=exited, status=0/SUCCESS)
 Main PID: 858 (code=exited, status=0/SUCCESS)

Jul 27 10:00:47 sle-fedora systemd-tmpfiles[858]: [/usr/lib/tmpfiles.d/systemd.conf:43] Duplicate line for path "/var/lib/systemd/coredump", ignoring.
Jul 27 10:00:47 sle-fedora systemd-tmpfiles[858]: [/usr/lib/tmpfiles.d/tmp.conf:12] Duplicate line for path "/var/tmp", ignoring.
Jul 27 10:00:47 sle-fedora systemd-tmpfiles[858]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Jul 27 10:00:47 sle-fedora systemd-tmpfiles[858]: [/usr/lib/tmpfiles.d/var.conf:19] Duplicate line for path "/var/cache", ignoring.
Jul 27 10:00:47 sle-fedora systemd-tmpfiles[858]: [/usr/lib/tmpfiles.d/var.conf:21] Duplicate line for path "/var/lib", ignoring.
Jul 27 10:00:47 sle-fedora systemd-tmpfiles[858]: [/usr/lib/tmpfiles.d/var.conf:23] Duplicate line for path "/var/spool", ignoring.
Jul 27 10:00:47 sle-fedora systemd-tmpfiles[858]: "/home" already exists and is not a directory.
Jul 27 10:00:47 sle-fedora systemd-tmpfiles[858]: "/srv" already exists and is not a directory.
Jul 27 10:00:47 sle-fedora systemd-tmpfiles[858]: "/tmp" already exists and is not a directory.
Jul 27 10:00:47 sle-fedora systemd[1]: Started Create Volatile Files and Directories.

@cgwalters cgwalters changed the title Upgrade ostree from 20180515.1 to 20180722.0 failed. Upgrading Fedora Atomic Host 20180515.1 to 20180722.0 failed Jul 30, 2018
@cgwalters
Copy link
Member

FWIW https://src.fedoraproject.org/rpms/initscripts/c/7276c52c2cf057f5a82fd73498b1c7137a615e54?branch=master split out the fedora-readonly.service - though I don't think it's really related here.

@dustymabe
Copy link
Member

I hit an issue when trying out atomic workstation (silverblue). This was on a macbook pro (2015) so bare metal EFI and no dual boot. I installed from ISO (just happened to be the beta ISO Fedora-AtomicWorkstation-ostree-x86_64-28_Beta-1.3.iso) and then did an rpm-ostree upgrade from a local mirror. After the upgrade I got an error with ostree-remount.service attempting to boot the older tree worked fine. Attached are all the logs.

I should reall try this with the officially released ISO but i'm tethered to my hotspot right now as local internet ISP is having an outage.

journal.txt
ostree-remount-journal.txt
rpm-ostree-status.txt
systemctl-status-ostree-remount.txt

@dustymabe
Copy link
Member

also, no /var/ mount.

@cgwalters
Copy link
Member

That's quite strange...the service exits successfully according to the audit logs. But for some reason systemd (probably?) is trying to execute it again.

I wonder if there's a core race condition here somewhere in systemd where ostree-remount is going in and changing the mount status, systemd detects that and decides that it hasn't completed local-fs.target or something, and then tries again?

@jlebon
Copy link
Member

jlebon commented Jul 31, 2018

Are you able to log on the serial console when you reach emergency.target? If so, can you post findmnt?

Alternatively/additionally, before running upgrade (run cleanup -p if already done), can you override ostree-remount.service so that the ExecStart is prefixed with /usr/bin/strace?

@cgwalters
Copy link
Member

Humm...wait I have an idea, today our fstab generator does:

                               /* We need /sysroot mounted writable first */
                               "After=ostree-remount.service\n"

But we don't have RemainAfterExit=yes in ostree-remount...I bet systemd may try to run it once for each mount...

@cgwalters
Copy link
Member

Testing out that theory now 🔬

@dustymabe
Copy link
Member

dustymabe commented Jul 31, 2018

waiting on colin to come back with the results of his test.. after that I'll try to answer @jlebon's questions

cgwalters added a commit to cgwalters/ostree that referenced this issue Jul 31, 2018
This is standard practice for units like this; e.g. it's what
`systemd-remount-fs.service` does.  I think it may be part of
or the whole cause for
coreos/rpm-ostree#1471

I haven't reproduced the problem exactly but it seems to me that
if the unit starts and is GC'd, then when systemd goes to execute
a later unit it might end up restarting it.

A noticeable side effect of this is that `systemctl status ostree-remount`
exits with code `0` as expected.
@jlebon
Copy link
Member

jlebon commented Jul 31, 2018

At least https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750683 confirms start-limit-hit can happen if not using RemainAfterExit.

@cgwalters
Copy link
Member

Yeah...I wasn't able to reproduce this in some quick testing but it's clearly a race condition per the quote:

Without RemainAfterExit, when another service is started and depends on the
oneshot service (directly or indirectly, typically by depending on
something like basic.target which depends on the oneshot), systemd will
not consider the oneshot to be already active if it has finished
executing and exited, and it will thus be started again.

I actually find myself wondering now what the heck a use case is for a Type=oneshot without RemainAfterExit=yes apart from systemd-run type things.

rh-atomic-bot pushed a commit to ostreedev/ostree that referenced this issue Jul 31, 2018
This is standard practice for units like this; e.g. it's what
`systemd-remount-fs.service` does.  I think it may be part of
or the whole cause for
coreos/rpm-ostree#1471

I haven't reproduced the problem exactly but it seems to me that
if the unit starts and is GC'd, then when systemd goes to execute
a later unit it might end up restarting it.

A noticeable side effect of this is that `systemctl status ostree-remount`
exits with code `0` as expected.

Closes: #1697
Approved by: jlebon
@dustymabe
Copy link
Member

Are you able to log on the serial console when you reach emergency.target? If so, can you post findmnt?

findmnt.txt

Alternatively/additionally, before running upgrade (run cleanup -p if already done), can you override ostree-remount.service so that the ExecStart is prefixed with /usr/bin/strace?

unfortunately no strace in the initramfs

@dustymabe
Copy link
Member

dustymabe commented Aug 1, 2018

ostreedev/ostree#1697 seems to have solved it for me: ostreedev/ostree#1697 (comment).

I just wonder why we only see this sometimes? I'd expect to see this a lot more for some reason.

@cgwalters
Copy link
Member

I just wonder why we only see this sometimes? I'd expect to see this a lot more for some reason.

I don't know offhand; do you have a lot of mount units configured? Any unusual configuration?

@dustymabe
Copy link
Member

I don't know offhand; do you have a lot of mount units configured? Any unusual configuration?

just did a kickstart install with the following partition setup and then rpm-ostree upgrade after first login and configuring the remote.

install
cdrom
reboot
lang en_US.UTF-8
keyboard us
rootpw  foo
authconfig --enableshadow --passalgo=sha512
selinux --enforcing
timezone --utc America/New_York
bootloader --location=mbr --driveorder=sda --append="crashkernel=auto"

# Partition info
zerombr
clearpart --all
reqpart
part /boot --size=1024 --fstype="ext4"
part pv.01 --grow --encrypted --passphrase foo
volgroup vgroot pv.01
logvol / --size=10240 --fstype="xfs" --name=root --vgname=vgroot
logvol swap --size=4096 --fstype="swap" --name=swap --vgname=vgroot

ostreesetup --osname="fedora" --remote="fedora" --url="file:///ostree/repo" --ref="fedora/28/x86_64/workstation" --nogpg

maybe the fact that the disk is encrypted ?

@cgwalters
Copy link
Member

@shannara Do you see something like this in your log?

Jul 31 16:24:57 localhost.localdomain systemd[1]: Starting OSTree Remount OS/ bind mounts...
Jul 31 16:24:57 localhost.localdomain ostree-remount[689]: Remounted: /sysroot
Jul 31 16:24:57 localhost.localdomain systemd[1]: Started OSTree Remount OS/ bind mounts.
Jul 31 16:24:58 localhost.localdomain systemd[1]: Starting OSTree Remount OS/ bind mounts...
Jul 31 16:24:58 localhost.localdomain systemd[1]: Started OSTree Remount OS/ bind mounts.
Jul 31 16:24:58 localhost.localdomain systemd[1]: Starting OSTree Remount OS/ bind mounts...
Jul 31 16:24:58 localhost.localdomain systemd[1]: Started OSTree Remount OS/ bind mounts.
Jul 31 16:24:58 localhost.localdomain systemd[1]: Starting OSTree Remount OS/ bind mounts...
Jul 31 16:24:58 localhost.localdomain systemd[1]: Started OSTree Remount OS/ bind mounts.
Jul 31 16:24:58 localhost.localdomain systemd[1]: Starting OSTree Remount OS/ bind mounts...
Jul 31 16:24:58 localhost.localdomain systemd[1]: Started OSTree Remount OS/ bind mounts.
Jul 31 16:24:58 localhost.localdomain systemd[1]: ostree-remount.service: Start request repeated too quickly.
Jul 31 16:24:58 localhost.localdomain systemd[1]: ostree-remount.service: Failed with result 'start-limit-hit'.
Jul 31 16:24:58 localhost.localdomain systemd[1]: Failed to start OSTree Remount OS/ bind mounts.
Jul 31 16:24:58 localhost.localdomain systemd[1]: ostree-remount.service: Triggering OnFailure= dependencies.

?

@shannara
Copy link
Author

shannara commented Aug 3, 2018

Hi @cgwalters,

because i reinstalling with 20180625 iso, i not previous journal of used with 20180515, but i reminding that during boot screen, few lines with mount /var/home (already mount or similar).

@dustymabe
Copy link
Member

do we think ostreedev/ostree#1697 maybe fixed the problem for this issue? should we consider closing it?

@cgwalters
Copy link
Member

Maybe? I dunno. The OP's issue had a different symptom but it could have been related, it's hard to say for sure.

@cgwalters
Copy link
Member

@shannara Any luck with a more recent FAH release?

@dustymabe
Copy link
Member

I'm going to close this since I think ostreedev/ostree#1697 fixed the problem. @shannara - please re-open if you have more info to add here or if you think the problem is not fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants