-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent: reduce directory errors, easier debugging, various improvements #818
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general, see the small comment about making retry behaviour configurable.
0426d27
to
addacc1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, but with a few minor changes.
Also this is still a draft, requiring fixing the pre-commit failures and the usual PR documentation.
addacc1
to
3f2e4e3
Compare
Scheduling requests that are already running or are in a state to be archived isn't useful and may cause unneccessary errors. PL-131813
… permission errors Output of nix commands in UpdateActivity.run() is now logged. A typical problem are failing services on system activation which we want to be logged to a separate file like for `fc-manage switch`. Main log file permissions problems are now ignored. No need to abort when the main log file cannot be used. We still have journal logging and console output. PL-131813
We cannot/don't want to build pyslurm on MacOS. PL-131813
Before, exceptions just bubbled up to Typer's excepthook which pretty-prints them. This is nice for interactive use but not for agent tasks run by systemd units. They ended up in stdout while other log messages are send to journal directly. This means that exceptions had a different SYSLOG_IDENTIFIER than log messages which is annoying for debugging. Also, we didn't log unhandled exceptions that occurred in interactive use at all. This change adds a new FCTyperApp which is used for fc-manage and fc-maintenance commands. The class extends Typer's app class and adds exception logging. It still passes exceptions to Typer's excepthook when used interactively. PL-131813
We now separate non-invasive and invasive code paths better. Moves around existing methods to a more logical order, grouping invasive and non-invasive methods. Some read-only commands for showing requests, metrics and the Sensu check can now be called by non-root users without the need for sudo. Clean up some uses of rm.scan() which are now handled by __enter__ which must be called for all invasive methods. This also takes care of loading requests and creating missing directory now. Add missing @require_lock decorators for invasive methods and give internal methods a underscore prefix. The latter don't have the decorator but it should be fairly obvious how to use them. PL-131813
We need to use timezone-aware objects here. PL-131813
Use the `stamina` library for automatic retries with integrated logging, exponential backoff and jitter. PL-131813
This has some advantages: * Dmidecode as external tool is called less often, only when really needed (init, just before reboot). Before, this happened on every agent run. * Less debug log messages. * Activity doesn't change just by loading it. * Non-privileged users can show the activity now. PL-131813
Exceptions from _enter_maintenance don't bubble up anymore but are logged and treated like temporary failures now. Output from enter commands is now shown directly on the trace log level and added to the exception if the command fails or logged when postpone/tempfail is requested. PL-131813
Also logs the number of error lines to parse and the last 25 lines if parsing fails. PL-131813
3f2e4e3
to
0b8432a
Compare
A collection of commits to reduce the need for sudo for non-invasive fc-maintenance commands, improve error handling, logging, avoiding unneccessary directory API calls and retrying failed calls. See commit messages for details.
@flyingcircusio/release-managers
Release process
Impact:
Changelog:
fc-maintenance
commands for viewing requests (show
andlist
) can now be used without sudo by all users. The same applies to the commands used for monitoring and telemetry (check
andmetrics
).fc-maintenance show <request ID prefix>
: now has improved output and also works for archived maintenance requests while active requests are preferred when there are multiple matches./var/log/fc-agent
, like forfc-manage
calls.fc-manage
andfc-maintenance
are now logged properly to the system journal and can be viewed withjournalctl SYSLOG_IDENTIFIER=fc-agent
.Security implications
This change affects sudo rules for fc-maintenance commands but the set of commands that can be run and the allowed groups have not been extended.