Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker VM comes up in emergency mode when using tectonic install #318

Closed
bparees opened this issue Sep 26, 2018 · 28 comments
Closed

worker VM comes up in emergency mode when using tectonic install #318

bparees opened this issue Sep 26, 2018 · 28 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@bparees
Copy link

bparees commented Sep 26, 2018

the bootstrap+master nodes come up ok, the worker goes into emergency mode.

The only tweak is that i'm resizing the image filesystem:

wget http://aos-ostree.rhev-ci-vms.eng.rdu2.redhat.com/rhcos/images/cloud/latest/rhcos-qemu.qcow2.gz
gzip -d rhcos-qemu.qcow2.gz 
cp rhcos-qemu.qcow2 rhcos-qemu.new.qcow2 
qemu-img resize rhcos-qemu.new.qcow2  +20G
virt-resize --expand /dev/vda2 rhcos-qemu.qcow2 rhcos-qemu.new.qcow2 
@dustymabe
Copy link
Member

FYI: me and @jlebon are taking point on investigation

@dustymabe
Copy link
Member

virt-resize --expand /dev/vda2 rhcos-qemu.qcow2 rhcos-qemu.new.qcow2

note that with #293 this step shouldn't be necessary

@ashcrow
Copy link
Member

ashcrow commented Sep 26, 2018

Do we need to let those folks doing testing know to hold off for the time being?

@dustymabe
Copy link
Member

Do we need to let those folks doing testing know to hold off for the time being?

only if they resize their disks like @bparees did above

@jlebon
Copy link
Member

jlebon commented Sep 26, 2018

Are we sure this is related to resizing the image though?

@dustymabe
Copy link
Member

Are we sure this is related to resizing the image though?

not yet, hopefully will know soon.

@ashcrow
Copy link
Member

ashcrow commented Sep 26, 2018

/assign @dustymabe @jlebon

@ashcrow
Copy link
Member

ashcrow commented Sep 26, 2018

/kind bug

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2018
@jlebon
Copy link
Member

jlebon commented Sep 26, 2018

Booting the node after virt-resize definitely works. Passing it through the installer now to see if I can reproduce the worker failure.

@dustymabe
Copy link
Member

Booting the node after virt-resize definitely works. Passing it through the installer now to see if I can reproduce the worker failure.

yep I'm seeing the same. haven't run the installer yet.

@dustymabe
Copy link
Member

FYI opened a bug for filesystem not getting resized on boot: #319

@jlebon
Copy link
Member

jlebon commented Sep 26, 2018

OK, reproduced this. So I think this might be a bug in virt-resize. It looks like it's corrupting the superblock, which is causing sysroot.mount to fail:

Sep 26 20:39:36 worker-dsvgr.mco.testing kernel: XFS (vda2): last sector read failed
Sep 26 20:39:36 worker-dsvgr.mco.testing mount[504]: mount: /dev/vda2: can't read superblock
Sep 26 20:39:36 worker-dsvgr.mco.testing systemd[1]: sysroot.mount mount process exited, code=exited status=32
Sep 26 20:39:36 worker-dsvgr.mco.testing systemd[1]: Failed to mount /sysroot.
Sep 26 20:39:36 worker-dsvgr.mco.testing systemd[1]: Dependency failed for Initrd Root File System.
Sep 26 20:39:36 worker-dsvgr.mco.testing systemd[1]: Dependency failed for Reload Configuration from the Real Root.
Sep 26 20:39:36 worker-dsvgr.mco.testing systemd[1]: Job initrd-parse-etc.service/start failed with result 'dependency'.
Sep 26 20:39:36 worker-dsvgr.mco.testing systemd[1]: Triggering OnFailure= dependencies of initrd-parse-etc.service.

You can see the same issue when trying to guestmount:

[root@pet /]# LIBGUESTFS_BACKEND=direct guestmount --ro -d worker-dsvgr -m /dev/sda2 /mnt/tmp
libguestfs: error: mount_options: mount exited with status 32: mount: /sysroot: can't read superblock on /dev/sda2.
guestmount: ‘/dev/sda2’ could not be mounted.

@jlebon
Copy link
Member

jlebon commented Sep 26, 2018

So I'd say for now, let's figure out #319 and I'll see about digging deeper in libguestfs/report an issue if there isn't already one.

@bparees
Copy link
Author

bparees commented Sep 26, 2018

@jlebon any alternate way i can grow the filesystem in the meantime?

@jlebon
Copy link
Member

jlebon commented Sep 26, 2018

Are you able to get the installer to finish with the default size at least? If so, then you can just xfs_growfs /sysroot manually once the workers are up.

@dustymabe
Copy link
Member

I wonder if it's because you are overwriting an existing disk image. @jlebon can you try to reproduce with an empty outdisk ?

[dustymabe@media images]$ zcat rhcos-4.0.6179-qemu.qcow2.gz > rhcos-4.0.6179-qemu.qcow2
[dustymabe@media images]$ truncate -s 40G outdisk.qcow2
[dustymabe@media images]$ virt-resize --expand /dev/vda2 rhcos-4.0.6179-qemu.qcow2 outdisk.qcow2 | tee
[   0.0] Examining rhcos-4.0.6179-qemu.qcow2
**********

Summary of changes:

/dev/sda1: This partition will be left alone.

/dev/sda2: This partition will be resized from 7.7G to 39.7G.  The
filesystem xfs on /dev/sda2 will be expanded using the ‘xfs_growfs’
method.

**********
[   3.4] Setting up initial partition table on outdisk.qcow2
[   3.8] Copying /dev/sda1
[   4.7] Copying /dev/sda2
[  27.6] Expanding /dev/sda2 using the ‘xfs_growfs’ method

Resize operation completed with no errors.  Before deleting the old disk, 
carefully check that the resized disk boots and works correctly.

@jlebon
Copy link
Member

jlebon commented Sep 26, 2018

Ahh good point. I did see the truncate guidelines in virt-resize(1), though didn't think it should affect it. Will try that out.

@ashcrow
Copy link
Member

ashcrow commented Sep 26, 2018

So I'd say for now, let's figure out #319 and I'll see about digging deeper in libguestfs/report an issue if there isn't already one.

FWIW @jlebon worked on #319. #320 merged and should be in the next compose.

@bparees
Copy link
Author

bparees commented Sep 26, 2018

the truncate approach left me w/ a totally hosed install, no VMs would even come up, terraform just spun on:
module.libvirt_base_volume.libvirt_volume.coreos_base: Still creating... (1m30s elapsed)
module.bootstrap.libvirt_ignition.bootstrap: Still creating... (1m30s elapsed)

@dustymabe
Copy link
Member

no VMs would even come up

I bet this was because the installer is specifying the format of the disk to be qcow2 and the truncate essentially makes a raw image. the pipeline is running now with #320 in it so we should have a new image sometime soon.

@jlebon
Copy link
Member

jlebon commented Sep 26, 2018

This seems to work:

$ qemu-img create -f qcow2 -o preallocation=metadata rhcos-4.0.6179-qemu-larger.qcow2 12G
Formatting 'rhcos-4.0.6179-qemu-larger.qcow2', fmt=qcow2 size=12884901888 cluster_size=65536 preallocation=metadata lazy_refcounts=off refcount_bits=16
$ virt-resize --expand /dev/vda2 rhcos-4.0.6179-qemu.qcow2 rhcos-4.0.6179-qemu-larger.qcow2

module.libvirt_base_volume.libvirt_volume.coreos_base: Still creating... (1m30s elapsed)
module.bootstrap.libvirt_ignition.bootstrap: Still creating... (1m30s elapsed)

Hmm, I did notice it taking longer, but it finished in the end. I think it's from copying over the disk to /var/lib/libvirt/images?

(Though again, now that #319 is merged, there's not much use in doing this, esp. since virt-resize leaves you with a fully sized image sitting on your disk, x2 for the base layer copied to /var/lib/libvirt/images).

@dustymabe
Copy link
Member

the pipeline is running now with #320 in it so we should have a new image sometime soon

will be fixed in4.0.6185, so look for that image to show up in the output directory. in the next 30 minutes to an hour.

@dustymabe
Copy link
Member

so look for that image to show up in the output directory

just landed

@bparees
Copy link
Author

bparees commented Sep 26, 2018

worker comes up and seems to have expanded properly. Well, sort of properly. It's got 16gigs. I had grown the image by 20gigs and the master VM shows 36gigs as i'd expect.

$ ssh [email protected]
-bash-4.2$ df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/vda2       37430252 4798312  32631940  13% /
$ ssh [email protected]
-bash-4.2$ df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/vda2       16972780 4794784  12177996  29% /

I assume that may require enlisting the installer team, but i'm going to leave it with you guys for the moment...

@dustymabe
Copy link
Member

i think we can close this now since it seems like the way the virt-resize was being done (copying over an existing non-empty image) caused corruption.

The fix of issue #319 i think helps you too.

Fixed in #320.

@bparees
Copy link
Author

bparees commented Sep 29, 2018

@dustymabe you saw my last #318 (comment) right?

@dustymabe
Copy link
Member

@dustymabe you saw my last #318 (comment) right?

misread..

This issue has been migrated to openshift/cluster-api-provider-libvirt#28

@bparees
Copy link
Author

bparees commented Oct 1, 2018

thanks @dustymabe !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants