Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provision node fails with fcos image 33.20210301.3.1 - error setting value of extended attribute #566

Closed
msheldyakov opened this issue Mar 17, 2021 · 31 comments

Comments

@msheldyakov
Copy link

msheldyakov commented Mar 17, 2021

Provision node fails on aws "compatible" (but not aws!) provider with fcos image 33.20210301.3.1 (latest in stable stream).
Good on 33.20210217.3.0.

/run/bin/machine-config-daemon firstboot-complete-machineconfig
W0317 21:12:43.594924 3009 run.go:44] nice failed: running nice -- ionice -c 3 podman cp cd0b412099e22f6a0d227b667c098f38fc8b11047eed222940b179bf4efaf98b:/ /run/mco-machine-os-content/os-content-870423771 failed: Error: 2 errors occurred:
error copying to host: error during bulk transfer for copier.request{Request:"PUT", Root:"/", preservedRoot:"/run/mco-machine-os-content", rootPrefix:"/run/mco-machine-os-content", Directory:"/", preservedDirectory:"/run/mco-machine-os-content", Globs:[]string{}, preservedGlobs:[]string{}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string(nil), ExpandArchives:false, ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:false, Rename:map[string]string(nil)}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(0xc0005bf8f0), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc0005bf900), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:false, NoOverwriteDirNonDir:false, Rename:map[string]string(nil)}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: put: error setting extended attributes on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": error setting value of extended attribute "user.Zif.MdChecksum[1614897854]" on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": operation not supported

Version
4.7.0-0.okd-2021-03-07-090821
UPI

@vrutkovs
Copy link
Member

vrutkovs commented Mar 18, 2021

Hitting the same on GCP using FCOS 33.20210301.3.1
Mar 18 11:03:38 vrutkovs-sc9fr-bootstrap.c.openshift-gce-devel.internal release-image-download.sh[2945]: * error copying to host: error during bulk transfer for copier.request{Request:"PUT", Root:"/", preservedRoot:"/run/mco-machine-os-content", rootPrefix:"/run/mco-machine-os-content", Directory:"/", preservedDirectory:"/run/mco-machine-os-content", Globs:[]string{}, preservedGlobs:[]string{}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string(nil), ExpandArchives:false, ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:false, Rename:map[string]string(nil)}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(0xc00030ed40), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc00030ed50), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:false, NoOverwriteDirNonDir:false, Rename:map[string]string(nil)}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: put: error setting extended attributes on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": error setting value of extended attribute "user.Zif.MdChecksum[1615979305]" on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": operation not supported

Looks like something regressed in rpm-ostree:

  • 33.20210117.3.2 (currently used in CI): rpm-ostree 2020.10
  • 33.20210201.3.0 - rpm-ostree 2021.1
  • 33.20210217.3.0 - rpm-ostree 2021.1
  • 33.20210301.3.1 - rpm-ostree 2021.2

mcd remains the same, so I don't think its MCD change causing this.

cc'ing @cgwalters

@cgwalters
Copy link

Exciting. So...that xattr comes from librepo; this code was recently changed in rpm-software-management/librepo@193c4fd

I think indeed rpm-ostree should strip these xattrs when generating the extensions. But also the container stack should...well, it should probably ignore failures to set user.* xattrs.

I think the problem here is that the release image download script is trying to write to /tmp which famously does not support user.* xattrs for bad reasons.

@cgwalters
Copy link

A quick short term fix is probably a patch to coreos-assembler to strip all xattrs from content we write to the container image - there shouldn't be any.

@cgwalters
Copy link

I am almost certain this relates to the version of rpm-ostree in coreos-assembler, not the version on the host. I'm not reproducing this locally, but I have a newer librepo. And so should the latest coreos-assembler.

@cgwalters
Copy link

Hum...in the MCO the podman cp code is a fallback when the oc image extract path doesn't work; could this come down to something like not having oc on the host in OKD?

@vrutkovs
Copy link
Member

Could this come down to something like not having oc on the host in OKD?

Right, that's expected - during initial install we start with plain FCOS with just podman, so MCD pivot attempts oc first and then falls back to podman

@cgwalters
Copy link

OK yeah I'm pretty sure the bug here is around
https://github.com/openshift/okd-machine-os/blob/master/entrypoint.sh#L122-L128

$ sudo yumdownloader usbguard
...                                                                      2.0 MB/s | 512 kB     00:00    
$ ll
total 992K
-rw-r--r--. 1 root root 512K Mar 18 17:21 usbguard-1.0.0-1.fc33.i686.rpm
-rw-r--r--. 1 root root 479K Mar 18 17:21 usbguard-1.0.0-1.fc33.x86_64.rpm
$ getfattr -m . *
# file: usbguard-1.0.0-1.fc33.i686.rpm
security.selinux
user.Zif.MdChecksum[1616088084]

rpm-ostree compose extensions is I am pretty sure explicitly dropping these xattrs, but here you're directly using yumdownloader which uses librepo to write them directly.

So...try something like this in your build script:

for f in *.rpm; do attr=$(getfattr -m 'user.*' $f | grep -Ee '^user'); setfattr -x ${attr} $f; done

@cgwalters
Copy link

There should probably be yumdownloader --no-xattrs or something.

@vrutkovs
Copy link
Member

Blocked by another issue in this image - podman cp is broken, so bootstrap doesn't pass, see coreos/fedora-coreos-tracker#771

@nate-duke
Copy link

Is there any hope of a workaround for this?

@vrutkovs
Copy link
Member

Workaround: start with 33.20210217.3.0. (see https://builds.coreos.fedoraproject.org/browser?stream=stable)

@acsulli
Copy link

acsulli commented Mar 20, 2021

I'm still seeing this error with both the stable (33.20210301.3.1) and "next" (33.20210315.1.0) releases.

Mar 20 13:47:13 bootstrap.okd.work.lan release-image-download.sh[1904]: W0320 13:47:13.797667    1904 run.go:44] nice failed: running nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-716418937 quay.io/openshift/okd-content@sha256:21a90d62459c272718c77ce5e28cd38026b877cb3461c193f687c37a696db57e failed: ionice: failed to execute oc: No such file or directory

Bootstrap fails to complete during the /usr/local/bin/machine-config-daemon pivot action.

@acsulli
Copy link

acsulli commented Mar 20, 2021

Manually adding oc to the bootstrap gets past the error. More or less, as soon as it's available on the network:

scp oc core@bootstrap:
ssh core@bootstrap
sudo mv oc /usr/local/bin

However, this only allows bootstrap to complete, the other nodes have the same issue and do not pivot.

@vrutkovs
Copy link
Member

failed to execute oc: No such file or directory

This is expected - the node starts with plain FCOS which has no oc. The script should fall back to podman though

@acsulli
Copy link

acsulli commented Mar 20, 2021

The script should fall back to podman though

🤔 I'm seeing some odd behavior then, very similar to the OP.

These errors:

Mar 20 05:13:34 bootstrap.owv.work.lan release-image-download.sh[1921]: Error: 2 errors occurred:
Mar 20 05:13:34 bootstrap.owv.work.lan release-image-download.sh[1921]:         * error copying to host: error during bulk transfer for copier.request{Request:"PUT", Root:"/", preservedRoot:"/run/mco-machine-os-content", rootPrefix:"/run/mco-machine-os-content", Directory:"/", preservedDirectory:"/run/mco-machine-os-content", Globs:[]string{}, preservedGlobs:[]string{}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string(nil), ExpandArchives:false, ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:false, Rename:map[string]string(nil)}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(0xc000495970), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc000495980), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:false, NoOverwriteDirNonDir:false, Rename:map[string]string(nil)}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: put: error setting extended attributes on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": error setting value of extended attribute "user.Zif.MdChecksum[1614897854]" on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": operation not supported
Mar 20 05:13:34 bootstrap.owv.work.lan release-image-download.sh[1921]:         * error copying from container: error during bulk transfer for copier.request{Request:"GET", Root:"/", preservedRoot:"/var/lib/containers/storage/overlay/06faedd42b3d97ad5fb662546675bc1ebf2c69b6d22912c063a7459c5ea09052/merged", rootPrefix:"/var/lib/containers/storage/overlay/06faedd42b3d97ad5fb662546675bc1ebf2c69b6d22912c063a7459c5ea09052/merged", Directory:"/", preservedDirectory:"/var/lib/containers/storage/overlay/06faedd42b3d97ad5fb662546675bc1ebf2c69b6d22912c063a7459c5ea09052/merged", Globs:[]string{"/"}, preservedGlobs:[]string{"/var/lib/containers/storage/overlay/06faedd42b3d97ad5fb662546675bc1ebf2c69b6d22912c063a7459c5ea09052/merged/."}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string(nil), ExpandArchives:false, ChownDirs:(*idtools.IDPair)(0xc0004af6d0), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc0004af6e0), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:false, Rename:map[string]string(nil)}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:false, NoOverwriteDirNonDir:false, Rename:map[string]string(nil)}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: get: "/"("/"): error copying /extensions/okd/checkpolicy-3.1-3.fc33.x86_64.rpm: write bulk-writer: broken pipe

Eventually lead to this:

Mar 20 05:14:55 bootstrap.owv.work.lan release-image-download.sh[1921]: : exit status 1
Mar 20 05:14:55 bootstrap.owv.work.lan systemd[1]: release-image.service: Main process exited, code=exited, status=1/FAILURE
Mar 20 05:14:55 bootstrap.owv.work.lan systemd[1]: release-image.service: Failed with result 'exit-code'.
Mar 20 05:14:55 bootstrap.owv.work.lan systemd[1]: Failed to start Download the OpenShift Release Image.
Mar 20 05:14:55 bootstrap.owv.work.lan systemd[1]: Dependency failed for Bootstrap a Kubernetes cluster.
Mar 20 05:14:55 bootstrap.owv.work.lan systemd[1]: bootkube.service: Job bootkube.service/start failed with result 'dependency'.

Having manually stepped through the release-image-download.sh script, it does not make it past the oc missing error in this comment.

Is there some additional information I can provide which will be helpful?

@vrutkovs
Copy link
Member

Yup, you'd need to fall back to 33.20210217.3.0

@bohdan-udovenko-cognite
Copy link

Workaround: start with 33.20210217.3.0. (see https://builds.coreos.fedoraproject.org/browser?stream=stable)

How to start with specific fcos image version? in machine-config I can see osImageURL, but it points to quay.io. How to match version? Or I just went a wrong way?

@vrutkovs
Copy link
Member

vrutkovs commented Apr 9, 2021

Still no luck on testing stream with podman 3.1, xattrs are stripped, including drpms and yet the same error message coming from podman cp.

@cgwalters any ideas we could try?

@vrutkovs
Copy link
Member

vrutkovs commented Apr 9, 2021

(ideally I'd replace OS extensions with a fully-baked OKD image, but CI build system limitations are restricting us to osExtensions solution)

@bohdan-udovenko-cognite

Workaround: start with 33.20210217.3.0. (see https://builds.coreos.fedoraproject.org/browser?stream=stable)

How to start with specific fcos image version? in machine-config I can see osImageURL, but it points to quay.io. How to match version? Or I just went a wrong way?

Answering own question: need to edit Machine Set and replace spec.template.spec.disks.image from okd....rhos-image to projects/fedora-coreos-cloud/global/images/fedora-coreos-33-20210217-3-0-gcp-x86-64

@davidjsherman
Copy link

Yup, you'd need to fall back to 33.20210217.3.0

How does one prevent the nodes from pivoting to a later version? I am installing from scratch on UPI bare metal, and I successfully boot the nodes to 33.20210217.3.0, but they always pivot to 33.20210328.3.0 and subsequently fail. The logs show an error in bulk copy, when podman cp tries to set xattrs.

Is there someplace where I can specify that I want the nodes to pivot to 33.20210217.3.0, rather than to the latest stable FCOS33?

@tnozicka
Copy link

tnozicka commented Apr 16, 2021

one way to avoid podman issues is to rsync -avhP --rsync-path="sudo rsync" `which oc` core@<node>:/usr/local/bin/ and reboot or restart the release-image unit

@davidjsherman
Copy link

@tnozicka thanks for the comment, yes, crocking in oc does work as a temporary fix; restarting machine-config-daemon-firstboot.service works for me. I suppose one could automate the hack by adding an openshift manifest.

But I'm especially asking about @vrutkovs's comment, because it suggests to me that there is a general mechanism for controlling pivots that I don't know about!

@vrutkovs
Copy link
Member

they always pivot to 33.20210328.3.0

The nodes are expected to update to this version.

and subsequently fail

After nodes are updated to the expected version they should no longer use podman and use oc extract (unless your pull secret is invalid, see #578)

@davidjsherman
Copy link

@vrutkovs thanks for your help. I'm using a pull secret from cloud.openshift.com and openshift-install 4.7.0-0.okd-2021-04-11-124433. My nodes initially boot 33.20210217.3.0 and then pivot to 33.20210328.3.0 as you describe.

According to the logs for machine-config-daemon-firstboot the 3 attempts to use oc image extract fail because there is no oc.

Apr 16 18:15:39 wk-1 machine-config-daemon[2092]: I0416 18:15:39.691368    2092 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-863800975 --registry-config /var/lib/kubelet/config.json quay.io/openshift/okd-content@sha256:16da407404e6cedf64b0b6680ddac2bc2e2a3cc6b761fb8ff92d23ef55c7cfdb
Apr 16 18:15:39 wk-1 machine-config-daemon[2092]: ionice: failed to execute oc: No such file or directory

The pull of quay.io/openshift/okd-content@sha256:16da4074 succeeds but the 6 attempts to use podman cp fail as above.

Apr 16 18:15:54 wk-1 machine-config-daemon[2092]: I0416 18:15:54.702624    2092 update.go:368] Falling back to using podman cp to fetch OS image content
Apr 16 18:15:54 wk-1 machine-config-daemon[2092]: I0416 18:15:54.702647    2092 run.go:18] Running: nice -- ionice -c 3 podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift/okd-content@sha256:16da407404e6cedf64b0b6680ddac2bc2e2a3cc6b761fb8ff92d23ef55c7cfdb
Apr 16 18:17:20 wk-1 podman[2607]: 2021-04-16 18:17:20.582417889 +0000 UTC m=+85.865046843 image pull  
Apr 16 18:17:20 wk-1 machine-config-daemon[2092]: 6452c626a971168f1a0abe2a86c0f65665288be0d6fa82ef713e8a960bf4cc3c
Apr 16 18:17:20 wk-1 machine-config-daemon[2092]: I0416 18:17:20.590721    2092 rpm-ostree.go:261] Running captured: podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name ostree-container-pivot-02bbf86e-1383-4772-aee1-6881eda8a8fe quay.io/openshift/okd-content@sha256:16da407404e6cedf64b0b6680ddac2bc2e2a3cc6b761fb8ff92d23ef55c7cfdb
Apr 16 18:17:20 wk-1 podman[2717]: 2021-04-16 18:17:20.805589603 +0000 UTC m=+0.201228558 container create 19933f7431eaa6e344ff8cded9f7356c0814ce1b6e48728e06309076c379ecfb (image=quay.io/openshift/okd-content@sha256:16da407404e6cedf64b0b6680ddac2bc2e2a3cc6b761fb8ff92d23ef55c7cfdb, name=ostree-container-pivot-02bbf86e-1383-4772-aee1-6881eda8a8fe, io.openshift.build.commit.id=570737f3da52045a7082d24effea2f95272524ef, io.openshift.build.commit.author=OpenShift Merge Robot <[email protected]>, io.openshift.build.versions=machine-os=34.20210406.10, io.openshift.release.operator=true, io.openshift.build.namespace=ci-op-3b4279f7, io.openshift.build.version-display-names=machine-os=Fedora CoreOS, id-machine-config-operator-rpms=sha256:3d734a0fc058388e8bdc244c9b96ff22761828532e9bdee71243a1577d9fcd7d, id-artifacts=sha256:291c1695e7c02d77695c75b6ff7999eb35065bcb87b1eac20a680a4f5481b126, io.openshift.build.commit.ref=release-4.7, io.buildah.version=1.16.4, io.openshift.build.commit.message=Merge pull request #112 from vrutkovs/4.7-no-xattrs, io.openshift.build.commit.date=Mon Mar 29 15:36:42 2021 +0000, io.openshift.build.name=machine-os-content, io.openshift.build.source-location=https://github.com/openshift/okd-machine-os, version=34.20210406.10.0)
Apr 16 18:17:20 wk-1 machine-config-daemon[2092]: I0416 18:17:20.810711    2092 run.go:18] Running: nice -- ionice -c 3 podman cp 19933f7431eaa6e344ff8cded9f7356c0814ce1b6e48728e06309076c379ecfb:/ /run/mco-machine-os-content/os-content-863800975
Apr 16 18:17:34 wk-1 machine-config-daemon[2092]: Error: 2 errors occurred:

@nnachefski
Copy link

Was there a work-around for this issue? I'm hitting this on a AWS GovCloud deployment using latest FCOS.

@Reamer
Copy link
Contributor

Reamer commented Apr 21, 2021

Was there a work-around for this issue? I'm hitting this on a AWS GovCloud deployment using latest FCOS.

Please do not use the latest image. More information in our blog. https://www.okd.io/blog/2021/03/19/please-avoid-using-fcos-33-20210301-3-1.html

@bohdan-udovenko-cognite

Was there a work-around for this issue? I'm hitting this on a AWS GovCloud deployment using latest FCOS.

Need to edit MachineSet and replace spec.template.spec.disks.[0].image from okd....rhos-image to projects/fedora-coreos-cloud/global/images/fedora-coreos-33-20210217-3-0-gcp-x86-64

@Nurlan199206
Copy link

same issue on fedora-coreos-34.20210518.3.0

@vrutkovs
Copy link
Member

This is resolved in https://github.com/openshift/okd/releases/tag/4.7.0-0.okd-2021-06-13-090745 - machine-os-content no longer ships separate RPMs, so xattrs for these files are no longer breaking release-image-download

@vrutkovs vrutkovs unpinned this issue Jun 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests