daemon: Validate kernel arguments #3105

cgwalters · 2022-04-22T00:52:12Z

I'm debugging https://bugzilla.redhat.com/show_bug.cgi?id=2075126
and while I haven't verified this is the case, as far as I can tell
from looking through the code and thinking about things, if we somehow
fail to apply the expected kernel arguments (which can occur if
ostree-finalize-staged fails) then we will (on the next boot)
drop in to validateOnDiskState() which has for a long time
checked that all the expected files exist and mark the update as
complete. But we didn't check the kernel arguments.

That can then cause later problems because in trying to apply further
updates we'll ask rpm-ostree to try to remove kernel arguments that
aren't actually present.

Worse, often these kernel arguments are actually quite important
and may even have security relevant properties (e.g. nosmt).

Now...I am actually increasingly convinced that we really need
to move opinionated kernel argument handling into ostree (and rpm-ostree).

There's ye olde ostreedev/ostree#2217
and the solution may look something like that. Particularly now
with the layering philosophy that it makes sense to support
e.g. customizations dropping content in /usr/lib and such.

For now though, validating that we didn't get the expected kargs
should make things go Degraded, the same as if there was a file conflict.
And that in turn should make it easier to debug failures.

As of right now, it will appear that updates are complete, and then
we'll only find out much later that the kargs are actually missing.
And in turn, because kubelet spams the journal, any error messages
from e.g. ostree-finalize-staged.service may be lost.

cgwalters · 2022-04-22T00:53:08Z

/test unit
/test verify
/test e2e-gcp-op

cgwalters · 2022-04-22T13:56:39Z

I0422 02:07:58.058686    1734 daemon.go:1226] Current config: rendered-worker-5629dfd4b317e78efc898af5b5b4f23b
I0422 02:07:58.058786    1734 daemon.go:1227] Desired config: rendered-infra-03ad0a748134c19c3f553beb6388112e
I0422 02:07:58.058826    1734 daemon.go:1229] state: Working
I0422 02:07:58.071906    1734 update.go:1971] Disk currentConfig rendered-infra-03ad0a748134c19c3f553beb6388112e overrides node's currentConfig annotation rendered-worker-5629dfd4b317e78efc898af5b5b4f23b
I0422 02:07:58.077462    1734 daemon.go:1512] Validating against pending config rendered-infra-03ad0a748134c19c3f553beb6388112e
I0422 02:07:58.077548    1734 rpm-ostree.go:324] Running captured: rpm-ostree kargs
E0422 02:07:58.153518    1734 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-infra-03ad0a748134c19c3f553beb6388112e: Missing expected kernel arguments: [foo=baz]

Interesting...

yuqi-zhang

I think this may cause more errors to surface, which is probably better than covering our eyes and pretending the kargs on the system are correct.

I don't quite recall if there was any context behind us not doing this initially though, so hopefully we're not forgetting something

pkg/daemon/daemon.go

cgwalters · 2022-04-22T15:24:43Z

One thing I want to try to do a bit more is not allow nontrivial PRs that I submit to be auto-approved - feels like in some cases it should require another approver. This is definitely one of those so
/approve cancel

cgwalters · 2022-04-22T16:08:53Z

E0422 02:07:58.153518 1734 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-infra-03ad0a748134c19c3f553beb6388112e: Missing expected kernel arguments: [foo=baz]

Pretty sure that's because I didn't trim the trailing \n from the output. Should be fixed.

pkg/daemon/daemon.go

I'm debugging https://bugzilla.redhat.com/show_bug.cgi?id=2075126 and while I haven't verified this is the case, as far as I can tell from looking through the code and thinking about things, if we somehow fail to apply the expected kernel arguments (which can occur if `ostree-finalize-staged` fails) then we will (on the next boot) drop in to `validateOnDiskState()` which has for a long time checked that all the expected *files* exist and mark the update as complete. But we didn't check the kernel arguments. That can then cause later problems because in trying to apply further updates we'll ask rpm-ostree to try to remove kernel arguments that aren't actually present. Worse, often these kernel arguments are actually *quite important* and may even have security relevant properties (e.g. `nosmt`). Now...I am actually increasingly convinced that we *really* need to move opinionated kernel argument handling into ostree (and rpm-ostree). There's ye olde ostreedev/ostree#2217 and the solution may look something like that. Particularly now with the layering philosophy that it makes sense to support e.g. customizations dropping content in `/usr/lib` and such. For now though, validating that we didn't get the expected kargs should make things go Degraded, the same as if there was a file conflict. And *that* in turn should make it easier to debug failures. As of right now, it will appear that updates are complete, and then we'll only find out much later that the kargs are actually missing. And in turn, because kubelet spams the journal, any error messages from e.g. `ostree-finalize-staged.service` may be lost.

cgwalters · 2022-04-22T22:18:39Z

OK yeah, I verified that the current 4.11 (i.e. before this PR) will silently claim success on rolling out kargs, even if some fail. I hopped on one worker with oc debug node, and did unshare -m /bin/sh -c 'mount -o remount,rw /boot && chattr +i /boot'. Then I created a MC to roll out a karg to all workers:

All nodes claimed completion to the target config (including the node that I injected the service failure on), the worker mcp was healthy etc. But I saw the service fail:

[root@ip-10-0-255-79 /]# journalctl -b -1 -u ostree-finalize-staged
-- Logs begin at Fri 2022-04-22 21:31:03 UTC, end at Fri 2022-04-22 22:15:15 UTC. --
Apr 22 22:03:35 ip-10-0-255-79 systemd[1]: Started OSTree Finalize Staged Deployment.
Apr 22 22:04:01 ip-10-0-255-79 systemd[1]: Stopping OSTree Finalize Staged Deployment...
Apr 22 22:04:01 ip-10-0-255-79 ostree[34184]: Finalizing staged deployment
Apr 22 22:04:02 ip-10-0-255-79 ostree[34184]: Copying /etc changes: 14 modified, 0 removed, 181 added
Apr 22 22:04:02 ip-10-0-255-79 ostree[34184]: Copying /etc changes: 14 modified, 0 removed, 181 added
Apr 22 22:04:03 ip-10-0-255-79 ostree[34184]: error: mkdir(boot/loader.1): Operation not permitted
Apr 22 22:04:03 ip-10-0-255-79 systemd[1]: ostree-finalize-staged.service: Control process exited, code=exited status=1
Apr 22 22:04:03 ip-10-0-255-79 systemd[1]: ostree-finalize-staged.service: Failed with result 'exit-code'.

And indeed, just that node was missing the desired kernel argument.

This is a quite bad bug.

cgwalters · 2022-04-25T13:28:32Z

Error reading signature from https://registry.redhat.io/containers/sigstore/redhat/redhat-marketplace-index@sha256=e2f659ddaa8c590efc5f11d2862ade42ebb35336b43d6aebabfdf3e996144597/signature-1: status 503 (Service Unavailable)

Argh. It's rather annoying that we fail pull requests when registry.redhat.io is unavailable.
Anyways
/retest

kikisdeliveryservice · 2022-04-25T19:32:32Z

Could we have a test for this? Presuming we didn't before which is how we end up with kargas not applied but things being "fine".

cgwalters · 2022-04-25T20:50:58Z

This one already flaked on our e2e test timeouts once - xref #3039

A test of this form is going to be as expensive as TestKernelArguments or TestKernelType - about 5-7 minutes, and that's going to push right up against our headroom and greatly increase the flake rate unless we bump test time.

I think medium term we're going to need to split our e2e-gcp-op into a "baseline" tests that we run on every PR; things like TestKernelType, the config drift monitor tests etc. are only going to be affected by 2% of pull requests, and we could reasonably run them as an opt-in e2e-gcp-op-all or something.

I'm uncertain about blocking this PR on getting an e2e for it; I did verify this manually today, and anyone can do so after this lands.

cgwalters · 2022-04-25T21:10:59Z

Put the test in a separate PR for now #3115
/test e2e-gcp-op

cgwalters · 2022-04-25T21:14:10Z

Also I do want to clarify - we do have an e2e that verifies that when kernelArguments are set, they get rolled out. And that test has been passing very reliably because we don't have anything in our e2e tests that cause ostree-finalize-staged to fail - if we did, we would have found this a lot earlier!

The problem was failure to detect failures, so to test this we need to inject a failure, which I have done in that other PR.

But to say it another way, we are covering the happy path already with our existing TestKernelArguments.

yuqi-zhang

I think generally having more validation is good. This may cause additional failures, but without this we would be either:

missing kargs forever, or
failing on the next update that touches kargs

So in that sense, let's put this in, which will help us surface errors

cgwalters · 2022-04-28T21:05:22Z

Anyone want to drop a lgtm?

I think we have a good window to discover any fallout. The most likely fallout I can think of is not in CI, but in "pet" clusters that upgrade to this and we suddenly discover they don't have the expected kernel arguments.

But if that's true, things would fail on their next attempt to change kernel arguments anyways...

kikisdeliveryservice · 2022-04-28T21:18:55Z

sure let's merge. we can sort out #3115 after
/approve

kikisdeliveryservice · 2022-04-28T21:19:10Z

wait! this already has approve 😆

/lgtm

openshift-ci · 2022-04-28T21:19:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kikisdeliveryservice, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kikisdeliveryservice,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2022-04-29T05:48:06Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-04-29T06:25:05Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2022-04-29T07:49:02Z

@cgwalters: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters marked this pull request as draft April 22, 2022 00:52

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2022

openshift-ci bot requested review from jkyros and kikisdeliveryservice April 22, 2022 00:52

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2022

cgwalters force-pushed the check-karg-drift branch from d6967df to 2df1fdb Compare April 22, 2022 12:26

cgwalters marked this pull request as ready for review April 22, 2022 13:24

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2022

openshift-ci bot requested review from sinnykumari and yuqi-zhang April 22, 2022 13:26

cgwalters force-pushed the check-karg-drift branch from 2df1fdb to cd0ae68 Compare April 22, 2022 13:29

yuqi-zhang reviewed Apr 22, 2022

View reviewed changes

pkg/daemon/daemon.go Show resolved Hide resolved

openshift-ci bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2022

cgwalters force-pushed the check-karg-drift branch from cd0ae68 to 2b75ebe Compare April 22, 2022 15:39

kikisdeliveryservice reviewed Apr 22, 2022

View reviewed changes

pkg/daemon/daemon.go Outdated Show resolved Hide resolved

cgwalters force-pushed the check-karg-drift branch from 2b75ebe to 233fed9 Compare April 22, 2022 20:20

cgwalters force-pushed the check-karg-drift branch from 233fed9 to d9b3e02 Compare April 22, 2022 20:39

cgwalters mentioned this pull request Apr 25, 2022

Test karg drift #3115

Closed

yuqi-zhang approved these changes Apr 25, 2022

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2022

openshift-ci bot assigned kikisdeliveryservice Apr 28, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2022

openshift-merge-robot merged commit 561b89a into openshift:master Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daemon: Validate kernel arguments #3105

daemon: Validate kernel arguments #3105

cgwalters commented Apr 22, 2022

cgwalters commented Apr 22, 2022

cgwalters commented Apr 22, 2022

yuqi-zhang left a comment

cgwalters commented Apr 22, 2022

cgwalters commented Apr 22, 2022

cgwalters commented Apr 22, 2022 •

edited

Loading

cgwalters commented Apr 25, 2022

kikisdeliveryservice commented Apr 25, 2022

cgwalters commented Apr 25, 2022

cgwalters commented Apr 25, 2022

cgwalters commented Apr 25, 2022 •

edited

Loading

yuqi-zhang left a comment

cgwalters commented Apr 28, 2022

kikisdeliveryservice commented Apr 28, 2022

kikisdeliveryservice commented Apr 28, 2022

openshift-ci bot commented Apr 28, 2022

openshift-bot commented Apr 29, 2022

openshift-bot commented Apr 29, 2022

openshift-ci bot commented Apr 29, 2022

daemon: Validate kernel arguments #3105

daemon: Validate kernel arguments #3105

Conversation

cgwalters commented Apr 22, 2022

cgwalters commented Apr 22, 2022

cgwalters commented Apr 22, 2022

yuqi-zhang left a comment

Choose a reason for hiding this comment

cgwalters commented Apr 22, 2022

cgwalters commented Apr 22, 2022

cgwalters commented Apr 22, 2022 • edited Loading

cgwalters commented Apr 25, 2022

kikisdeliveryservice commented Apr 25, 2022

cgwalters commented Apr 25, 2022

cgwalters commented Apr 25, 2022

cgwalters commented Apr 25, 2022 • edited Loading

yuqi-zhang left a comment

Choose a reason for hiding this comment

cgwalters commented Apr 28, 2022

kikisdeliveryservice commented Apr 28, 2022

kikisdeliveryservice commented Apr 28, 2022

openshift-ci bot commented Apr 28, 2022

openshift-bot commented Apr 29, 2022

openshift-bot commented Apr 29, 2022

openshift-ci bot commented Apr 29, 2022

cgwalters commented Apr 22, 2022 •

edited

Loading

cgwalters commented Apr 25, 2022 •

edited

Loading