Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[21.05] agent backport, part 3 #832

Merged
merged 12 commits into from
Nov 20, 2023
Merged

Conversation

dpausp
Copy link
Member

@dpausp dpausp commented Nov 17, 2023

Backport of #818 to NixOS 21.05, with a small, additional fix for a request comment merging bug and a sudo rule to allow admins to request a scheduled reboot. In combination with fc-maintenance run --run-all-now it's possible to reboot a system immediately in a safe way.

PL-131813

@flyingcircusio/release-managers

Release process

Impact:

Changelog: (internal)

Security implications

  • Security requirements defined? (WHERE)
  • Security requirements tested? (EVIDENCE)
    • checked sudo rules on a test VM, only reboots can be requested without password by admins users
    • tried out non-root fc-maintenance commands on a test VM
    • checked automated maintenance with a script activity on a dev KVM host

Scheduling requests that are already running or are in a state
to be archived isn't useful and may cause unneccessary errors.

PL-131813

(cherry picked from commit 91466ec)
PL-131813

(cherry picked from commit 528be4f)
… permission errors

Output of nix commands in UpdateActivity.run() is now logged. A typical
problem are failing services on system activation which we want to be
logged to a separate file like for `fc-manage switch`.

Main log file permissions problems are now ignored.
No need to abort when the main log file cannot be used.
We still have journal logging and console output.

PL-131813

(cherry picked from commit 4bea2c5)
Before, exceptions just bubbled up to Typer's excepthook which
pretty-prints them. This is nice for interactive use but not for agent
tasks run by systemd units. They ended up in stdout while other log
messages are send to journal directly. This means that exceptions had a
different SYSLOG_IDENTIFIER than log messages which is annoying for
debugging.

Also, we didn't log unhandled exceptions that occurred in
interactive use at all.

This change adds a new FCTyperApp which is used for fc-manage and
fc-maintenance commands. The class extends Typer's app class and adds
exception logging. It still passes exceptions to Typer's excepthook when
used interactively.

PL-131813

(cherry picked from commit 9db289d)
We now separate non-invasive and invasive code paths better.
Moves around existing methods to a more logical order, grouping
invasive and non-invasive methods.

Some read-only commands for showing requests, metrics and the Sensu
check can now be called by non-root users without the need for sudo.

Clean up some uses of rm.scan() which are now handled by
__enter__ which must be called for all invasive methods. This also
takes care of loading requests and creating missing directory now.

Add missing @require_lock decorators for invasive methods and give
internal methods a underscore prefix. The latter don't have the
decorator but it should be fairly obvious how to use them.

PL-131813

(cherry picked from commit 12abf01)
We need to use timezone-aware objects here.

PL-131813

(cherry picked from commit 85859de)
Use the `stamina` library for automatic retries with
integrated logging, exponential backoff and jitter.

PL-131813

(cherry picked from commit f07eb19)
This has some advantages:

* Dmidecode as external tool is called less often, only when really
  needed (init, just before reboot). Before, this happened on every
  agent run.
* Less debug log messages.
* Activity doesn't change just by loading it.
* Non-privileged users can show the activity now.

PL-131813

(cherry picked from commit ad0e262)
Exceptions from _enter_maintenance don't bubble up anymore but are
logged and treated like temporary failures now.

Output from enter commands is now shown directly on the trace log level
and added to the exception if the command fails or logged when
postpone/tempfail is requested.

PL-131813

(cherry picked from commit e4593f7)
Now, when merging requests the comment of the new request
is only concatenated to the old one when it's not contained in the
old commit to avoid repeating content. We saw this with RebootActivity
which is created on every fc-agent run when a reboot for the kernel is
needed. When the kernel version changed again, the new comment was
concatenated over and over again.
@dpausp dpausp marked this pull request as ready for review November 18, 2023 22:39
@dpausp dpausp changed the title 21.05 agent backport, part 3 [21.05] agent backport, part 3 Nov 18, 2023
Copy link
Member

@ctheune ctheune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ctheune ctheune merged commit b863d7c into fc-21.05-dev Nov 20, 2023
1 check passed
@ctheune ctheune deleted the PL-131813-21.05-agent-backport-3 branch November 20, 2023 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants