Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: reduce directory errors, easier debugging, various improvements #818

Merged
merged 11 commits into from
Nov 9, 2023

Conversation

dpausp
Copy link
Member

@dpausp dpausp commented Oct 23, 2023

A collection of commits to reduce the need for sudo for non-invasive fc-maintenance commands, improve error handling, logging, avoiding unneccessary directory API calls and retrying failed calls. See commit messages for details.

@flyingcircusio/release-managers

Release process

Impact:

Changelog:

  • fc-maintenance commands for viewing requests (show and list) can now be used without sudo by all users. The same applies to the commands used for monitoring and telemetry (check and metrics).
  • fc-maintenance show <request ID prefix>: now has improved output and also works for archived maintenance requests while active requests are preferred when there are multiple matches.
  • Command output generated during execution of an update activity is now logged to separate files in /var/log/fc-agent, like for fc-manage calls.
  • Unhandled errors in fc-manage and fc-maintenance are now logged properly to the system journal and can be viewed with journalctl SYSLOG_IDENTIFIER=fc-agent.

Security implications

This change affects sudo rules for fc-maintenance commands but the set of commands that can be run and the allowed groups have not been extended.

  • Security requirements defined? (WHERE)
    • reduce need to use sudo for commands that only access data readable by any user
    • make sure that invasive commands cannot be run by normal users without sudo and that sudo permissions are limited properly (admin, sudo-srv and service may only run fc-maintenance (-v) delete without password)
  • Security requirements tested? (EVIDENCE)
    • checked on a test VM that fc-maintenance commands work as intended, checked sudo rules, checked logs
    • automated test have been extended

Copy link
Member

@ctheune ctheune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, see the small comment about making retry behaviour configurable.

@dpausp dpausp force-pushed the PL-131813-agent-reduce-directory-errors branch 3 times, most recently from 0426d27 to addacc1 Compare October 30, 2023 23:11
Copy link
Member

@osnyx osnyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, but with a few minor changes.
Also this is still a draft, requiring fixing the pre-commit failures and the usual PR documentation.

pkgs/fc/agent/fc/maintenance/tests/test_cli.py Outdated Show resolved Hide resolved
pkgs/fc/agent/default.nix Show resolved Hide resolved
pkgs/fc/agent/fc/maintenance/reqmanager.py Outdated Show resolved Hide resolved
@dpausp dpausp force-pushed the PL-131813-agent-reduce-directory-errors branch from addacc1 to 3f2e4e3 Compare November 2, 2023 09:15
dpausp added 11 commits November 9, 2023 11:46
Scheduling requests that are already running or are in a state
to be archived isn't useful and may cause unneccessary errors.

PL-131813
… permission errors

Output of nix commands in UpdateActivity.run() is now logged. A typical
problem are failing services on system activation which we want to be
logged to a separate file like for `fc-manage switch`.

Main log file permissions problems are now ignored.
No need to abort when the main log file cannot be used.
We still have journal logging and console output.

PL-131813
We cannot/don't want to build pyslurm on MacOS.

PL-131813
Before, exceptions just bubbled up to Typer's excepthook which
pretty-prints them. This is nice for interactive use but not for agent
tasks run by systemd units. They ended up in stdout while other log
messages are send to journal directly. This means that exceptions had a
different SYSLOG_IDENTIFIER than log messages which is annoying for
debugging.

Also, we didn't log unhandled exceptions that occurred in
interactive use at all.

This change adds a new FCTyperApp which is used for fc-manage and
fc-maintenance commands. The class extends Typer's app class and adds
exception logging. It still passes exceptions to Typer's excepthook when
used interactively.

PL-131813
We now separate non-invasive and invasive code paths better.
Moves around existing methods to a more logical order, grouping
invasive and non-invasive methods.

Some read-only commands for showing requests, metrics and the Sensu
check can now be called by non-root users without the need for sudo.

Clean up some uses of rm.scan() which are now handled by
__enter__ which must be called for all invasive methods. This also
takes care of loading requests and creating missing directory now.

Add missing @require_lock decorators for invasive methods and give
internal methods a underscore prefix. The latter don't have the
decorator but it should be fairly obvious how to use them.

PL-131813
We need to use timezone-aware objects here.

PL-131813
Use the `stamina` library for automatic retries with
integrated logging, exponential backoff and jitter.

PL-131813
This has some advantages:

* Dmidecode as external tool is called less often, only when really
  needed (init, just before reboot). Before, this happened on every
  agent run.
* Less debug log messages.
* Activity doesn't change just by loading it.
* Non-privileged users can show the activity now.

PL-131813
Exceptions from _enter_maintenance don't bubble up anymore but are
logged and treated like temporary failures now.

Output from enter commands is now shown directly on the trace log level
and added to the exception if the command fails or logged when
postpone/tempfail is requested.

PL-131813
Also logs the number of error lines to parse and the last 25 lines
if parsing fails.

PL-131813
@dpausp dpausp force-pushed the PL-131813-agent-reduce-directory-errors branch from 3f2e4e3 to 0b8432a Compare November 9, 2023 12:41
@dpausp dpausp changed the title agent: reduce directory errors, easier debugging agent: reduce directory errors, easier debugging, various improvements Nov 9, 2023
@dpausp dpausp requested a review from osnyx November 9, 2023 13:33
@dpausp dpausp marked this pull request as ready for review November 9, 2023 13:33
@dpausp dpausp merged commit 6a6b2e7 into fc-23.05-dev Nov 9, 2023
1 check passed
@dpausp dpausp deleted the PL-131813-agent-reduce-directory-errors branch November 9, 2023 16:13
@dpausp dpausp mentioned this pull request Nov 17, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants