Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update_agent: delay reboot if ongoing interactive sessions #485

Merged
merged 3 commits into from
Mar 15, 2021
Merged

update_agent: delay reboot if ongoing interactive sessions #485

merged 3 commits into from
Mar 15, 2021

Conversation

kelvinfan001
Copy link
Member

@kelvinfan001 kelvinfan001 commented Mar 1, 2021

Allow update finalizations (reboots) to be postponed a number of times for an amount of time if it is detected that there are users (with a tty) currently logged in on the system. Add a counter for number of postponements to the UpdateStaged state of the update agent; if the maximum number of allowed postponements has been reached, proceed with the finalization, disregarding active users.

Closes #115

@kelvinfan001 kelvinfan001 added this to the v0.0.18 milestone Mar 1, 2021
@kelvinfan001 kelvinfan001 requested a review from lucab March 1, 2021 14:39
@kelvinfan001 kelvinfan001 changed the title Delay reboot update_agent: delay reboot if users logged in Mar 1, 2021
@kelvinfan001 kelvinfan001 changed the title update_agent: delay reboot if users logged in update_agent: delay reboot if ongoing interactive sessions Mar 1, 2021
@kelvinfan001 kelvinfan001 removed this from the v0.0.18 milestone Mar 1, 2021
@lucab lucab added this to the vNext milestone Mar 2, 2021
src/update_agent/mod.rs Outdated Show resolved Hide resolved
src/update_agent/mod.rs Show resolved Hide resolved
/// Maximum failed deploy attempts in a row in `UpdateAvailable` state
/// before abandoning a target update.
const MAX_DEPLOY_ATTEMPTS: u8 = 12;

/// Maximum number of postponements to finalizing an update in the
/// `UpdateStaged` state before forcing an update finalization and reboot.
const MAX_FINALIZE_POSTPONEMENTS: u8 = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-note: this was 5 minutes in locksmith but I think that we are doing here (splitting into 10 pauses, each long 1 minute) is nicer both for the user and for the system.

// Set up dummy interactive sessions.
let foo_session = InteractiveSession {
user: String::from("fakeuser"),
tty_dev: String::from("/dev/tty1"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use a tempfile here. As a bonus, you can check its content afterwards too.

for session in sessions.iter() {
// Write message to tty device.
let user = &session.user;
let tty = &session.tty_dev;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to drop these two and just use &session.<FIELD> everywhere.

@@ -38,16 +46,27 @@ lazy_static::lazy_static! {
"zincati_update_agent_updates_enabled",
"Whether auto-updates logic is enabled."
)).unwrap();
static ref UPDATE_FINALIZATION_POSTPONEMENTS: IntCounter = register_int_counter!(opts!(
"zincati_update_agent_finalization_postponements_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postponed_finalization_total maybe?

@@ -381,27 +394,39 @@ impl UpdateAgent {
release: Release,
) -> ResponseActFuture<Self, Result<Release, ()>> {
if !can_finalize {
// Reset number of postponements if finalization attempt failed due to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are at least two state transitions nested very deeply inside this function, and I think it would better to split the concerns so that in the top-level tick_finalize_update it is clearly visible that there three cases:

  • strategy_can_finalize: false -> reset the number of attempts
  • strategy_can_finalize: true -> handle interactive users, then:
    • usersessions_can_finalize: false -> update the number of attempts
    • usersessions_can_finalize: true -> call finalize_deployment(true)

Copy link
Member Author

@kelvinfan001 kelvinfan001 Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to address this in the latest WIP commit. I've made it a WIP commit because I feel like I actually made the code more difficult to read. It'd be great if you can look at the diff of just that commit and see if I'm going in the right general direction. Thanks!

Update: I think I've fixed it up. Squashed everything into the original commit now.

let warning_msg;
if postponements == 0 {
let max_reboot_delay_sec =
(MAX_FINALIZE_POSTPONEMENTS as u64).saturating_mul(DEFAULT_POSTPONEMENT_TIME_SECS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit odd to read. Did you try inverting the direction of postponements counter, going from MAX to 0, so that it gives you directly the correct number?

Copy link
Member Author

@kelvinfan001 kelvinfan001 Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah that would spare us from having to check postponements == MAX_FINALIZE_POSTPONEMENTS.saturating_sub(1) and instead just check of whether postponements_left == 1.

@kelvinfan001 kelvinfan001 changed the title update_agent: delay reboot if ongoing interactive sessions [WIP] update_agent: delay reboot if ongoing interactive sessions Mar 4, 2021
@kelvinfan001
Copy link
Member Author

Restructured a lot of the original code and addressed comments. Ready for another round of review :)

/cc @lucab

@kelvinfan001 kelvinfan001 changed the title [WIP] update_agent: delay reboot if ongoing interactive sessions update_agent: delay reboot if ongoing interactive sessions Mar 5, 2021
assert_eq!("1 minute and 1 second", format_seconds(61));
assert_eq!("1 minute and 30 seconds", format_seconds(90));
assert_eq!("2 minutes", format_seconds(120));
assert_eq!("42 minutes and 23 seconds", format_seconds(2543));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easier to believe these are correctly written if you write e.g. 2543 as 42*60 + 23. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, right, just 2543 really isn't too useful there :)

"Total number of update finalization postponements due to active users."
)).unwrap();
static ref DETECTED_ACTIVE_USERS: IntGauge = register_int_gauge!(opts!(
"zincati_update_agent_detected_active_users",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: something like finalization_detected_active_users makes it easier to understand when this is relevant.


let (release, postponements_remaining) = match self.clone() {
UpdateAgentState::UpdateStaged((r, p)) => (r, p),
_ => unreachable!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-note: later on we should probably try to re-arrange the internals of the FSM data-structure so that it doesn't require this matching.

kelvinfan001 and others added 3 commits March 14, 2021 21:12
Allow update finalizations (reboots) to be postponed a number of times
for an amount of time if it is detected that there are users (with a tty)
currently logged in on the system.
Add a counter for number of postponements remaining to the UpdateStaged
state of the update agent; if there are no allowed postponements remaining,
proceed with the finalization, disregarding active users.
The FSM may stay in the UpdateStaged state if it cannot finalize
either due to strategy constraints or detection of logged in users
during a finalization attempt. Update zincati-fsm to reflect this.
Add `zincati_update_agent_finalization_detected_active_users` and
`zincati_update_agent_postponed_finalizations_total` metrics.
We should expect that the postponed finalizations total should only
increase when there are active users detected.
These metrics could also be useful for investigating whether certain
admin maintenance patterns and update strategy combinations may be
unexpectedly causing frequent occurences of reboot delays/postponements.
@lucab lucab enabled auto-merge March 15, 2021 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

agent: delay reboot if ongoing interactive sessions
3 participants