-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kola: Do failed units/SELinux checks both before *and* after tests #2067
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -186,6 +186,8 @@ type Options struct { | |
OSContainer string | ||
|
||
SSHOnTestFailure bool | ||
// SuppressDefaultChecks will skip built-in checks for systemd unit failures and SELinux AVC denials, etc. | ||
SuppressDefaultChecks bool | ||
} | ||
|
||
// RuntimeConfig contains cluster-specific configuration. | ||
|
@@ -435,20 +437,46 @@ func CheckMachine(ctx context.Context, m Machine) error { | |
return fmt.Errorf("not a supported instance: %v", string(out)) | ||
} | ||
|
||
if !m.RuntimeConf().AllowFailedUnits { | ||
// ensure no systemd units failed during boot | ||
out, stderr, err = m.SSH("systemctl --no-legend --state failed list-units") | ||
if err != nil { | ||
return fmt.Errorf("systemctl: %s: %v: %s", out, err, stderr) | ||
} | ||
if len(out) > 0 { | ||
return fmt.Errorf("some systemd units failed:\n%s", out) | ||
} | ||
// Validate that nothing is wrong before we even run the test | ||
if err := CheckMachineBasics(ctx, m); err != nil { | ||
return err | ||
} | ||
|
||
return ctx.Err() | ||
} | ||
|
||
// CheckMachineBasics validates that no systemd units have | ||
// failed and no SELinux AVC denials were logged. It may be | ||
// extended in the future - the idea is these are "baseline" | ||
// errors that shouldn't be seen across any tests. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm concerned this might not be a safe assumption. The tests themselves could be testing these errors. I don't really imagine us testing the SELinux policy really, but failed systemd units sounds plausible. Though I guess... the test then could fix the error condition and restart the unit so that it's not in a failed state? Anyway, willing to try! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, but anything intentionally causing a systemd unit failure could do e.g. Perhaps simplest is a per-test flag to turn this off. |
||
func CheckMachineBasics(ctx context.Context, m Machine) error { | ||
if m.RuntimeConf().AllowFailedUnits { | ||
plog.Debug("Allowing failed units") | ||
cgwalters marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return nil | ||
} | ||
|
||
// ensure no systemd units failed during boot | ||
out, stderr, err := m.SSH("systemctl --no-legend --state failed list-units") | ||
if err != nil { | ||
return fmt.Errorf("systemctl: %s: %v: %s", out, err, stderr) | ||
} | ||
if len(out) > 0 { | ||
return fmt.Errorf("some systemd units failed:\n%s", out) | ||
} | ||
// Also no SELinux denials; RHCOS currently ships auditd, FCOS doesn't, so handle | ||
// both places. | ||
out, stderr, err = m.SSH("if test -f /var/log/audit/audit.log; then grep 'avc.*denied' /var/log/audit/audit.log || true; else journalctl -q --no-pager --grep='avc.*denied' _TRANSPORT=audit || true; fi") | ||
if err != nil { | ||
return fmt.Errorf("failed to query audit logs: %s: %v: %s", out, err, stderr) | ||
} | ||
if len(out) > 0 { | ||
return fmt.Errorf("Found SELinux AVC denials:\n%s", out) | ||
} | ||
plog.Debug("Validated machine") | ||
|
||
return nil | ||
} | ||
|
||
type machineInfo struct { | ||
Id string `json:"id"` | ||
PublicIp string `json:"public_ip"` | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
#!/bin/bash | ||
# Intentionally cause a systemd unit failure that should be skipped with --no-default-checks | ||
set -xeuo pipefail | ||
systemd-run --wait --unit should-fail.service -q false || true | ||
echo "ok started failed service" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more on this; WDYT about splitting this into two separate switches? One for systemd units and one for SELinux denials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate: I think eventually we should have precise control over what checks we want to allow to fail. Refining that can come later (because it requires more kola work), but at least right now we already have two broad categories we can split on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Agree that unit failures are usually much worse. But making it more granular feels tricky because we're likely to add more here. A good example is systemd generator failures and systemd ordering failures are also missed today.
And another good one is coredumps (which usually result in a unit failure, but not always).
Maybe what we really want is a syntax to ignore specific AVC denials or unit failures? Something like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. But while having a single switch is useful for dev, it's too blunt an instrument for CI/prod because it makes it too easy to miss out on regressions because of e.g. one AVC we're working around. (This was part of the motivation for #2064.)
Yes, 100%. (This is what I mean with #1920 (comment); though yeah it'd be cleaner to first make the top-level object a dict rather than a list).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And probably before that, move this denylist directly into kola.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're willing to blanket-ignore denials/failures in specific tests, we have the existing test flags mechanism for this. It doesn't currently support more granular filters, but in principle it could.