-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: Validate kernel arguments #3105
daemon: Validate kernel arguments #3105
Conversation
/test unit |
d6967df
to
2df1fdb
Compare
2df1fdb
to
cd0ae68
Compare
Interesting... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may cause more errors to surface, which is probably better than covering our eyes and pretending the kargs on the system are correct.
I don't quite recall if there was any context behind us not doing this initially though, so hopefully we're not forgetting something
One thing I want to try to do a bit more is not allow nontrivial PRs that I submit to be auto-approved - feels like in some cases it should require another approver. This is definitely one of those so |
cd0ae68
to
2b75ebe
Compare
Pretty sure that's because I didn't trim the trailing |
2b75ebe
to
233fed9
Compare
I'm debugging https://bugzilla.redhat.com/show_bug.cgi?id=2075126 and while I haven't verified this is the case, as far as I can tell from looking through the code and thinking about things, if we somehow fail to apply the expected kernel arguments (which can occur if `ostree-finalize-staged` fails) then we will (on the next boot) drop in to `validateOnDiskState()` which has for a long time checked that all the expected *files* exist and mark the update as complete. But we didn't check the kernel arguments. That can then cause later problems because in trying to apply further updates we'll ask rpm-ostree to try to remove kernel arguments that aren't actually present. Worse, often these kernel arguments are actually *quite important* and may even have security relevant properties (e.g. `nosmt`). Now...I am actually increasingly convinced that we *really* need to move opinionated kernel argument handling into ostree (and rpm-ostree). There's ye olde ostreedev/ostree#2217 and the solution may look something like that. Particularly now with the layering philosophy that it makes sense to support e.g. customizations dropping content in `/usr/lib` and such. For now though, validating that we didn't get the expected kargs should make things go Degraded, the same as if there was a file conflict. And *that* in turn should make it easier to debug failures. As of right now, it will appear that updates are complete, and then we'll only find out much later that the kargs are actually missing. And in turn, because kubelet spams the journal, any error messages from e.g. `ostree-finalize-staged.service` may be lost.
233fed9
to
d9b3e02
Compare
OK yeah, I verified that the current 4.11 (i.e. before this PR) will silently claim success on rolling out kargs, even if some fail. I hopped on one worker with All nodes claimed completion to the target config (including the node that I injected the service failure on), the worker mcp was healthy etc. But I saw the service fail:
And indeed, just that node was missing the desired kernel argument. This is a quite bad bug. |
Argh. It's rather annoying that we fail pull requests when registry.redhat.io is unavailable. |
Could we have a test for this? Presuming we didn't before which is how we end up with kargas not applied but things being "fine". |
This one already flaked on our e2e test timeouts once - xref #3039 A test of this form is going to be as expensive as I think medium term we're going to need to split our e2e-gcp-op into a "baseline" tests that we run on every PR; things like I'm uncertain about blocking this PR on getting an e2e for it; I did verify this manually today, and anyone can do so after this lands. |
Put the test in a separate PR for now #3115 |
Also I do want to clarify - we do have an e2e that verifies that when The problem was failure to detect failures, so to test this we need to inject a failure, which I have done in that other PR. But to say it another way, we are covering the happy path already with our existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally having more validation is good. This may cause additional failures, but without this we would be either:
- missing kargs forever, or
- failing on the next update that touches kargs
So in that sense, let's put this in, which will help us surface errors
Anyone want to drop a lgtm? I think we have a good window to discover any fallout. The most likely fallout I can think of is not in CI, but in "pet" clusters that upgrade to this and we suddenly discover they don't have the expected kernel arguments. But if that's true, things would fail on their next attempt to change kernel arguments anyways... |
sure let's merge. we can sort out #3115 after |
wait! this already has approve 😆 /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kikisdeliveryservice, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@cgwalters: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I'm debugging https://bugzilla.redhat.com/show_bug.cgi?id=2075126
and while I haven't verified this is the case, as far as I can tell
from looking through the code and thinking about things, if we somehow
fail to apply the expected kernel arguments (which can occur if
ostree-finalize-staged
fails) then we will (on the next boot)drop in to
validateOnDiskState()
which has for a long timechecked that all the expected files exist and mark the update as
complete. But we didn't check the kernel arguments.
That can then cause later problems because in trying to apply further
updates we'll ask rpm-ostree to try to remove kernel arguments that
aren't actually present.
Worse, often these kernel arguments are actually quite important
and may even have security relevant properties (e.g.
nosmt
).Now...I am actually increasingly convinced that we really need
to move opinionated kernel argument handling into ostree (and rpm-ostree).
There's ye olde ostreedev/ostree#2217
and the solution may look something like that. Particularly now
with the layering philosophy that it makes sense to support
e.g. customizations dropping content in
/usr/lib
and such.For now though, validating that we didn't get the expected kargs
should make things go Degraded, the same as if there was a file conflict.
And that in turn should make it easier to debug failures.
As of right now, it will appear that updates are complete, and then
we'll only find out much later that the kargs are actually missing.
And in turn, because kubelet spams the journal, any error messages
from e.g.
ostree-finalize-staged.service
may be lost.