-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update_agent: delay reboot if ongoing interactive sessions #485
Conversation
/// Maximum failed deploy attempts in a row in `UpdateAvailable` state | ||
/// before abandoning a target update. | ||
const MAX_DEPLOY_ATTEMPTS: u8 = 12; | ||
|
||
/// Maximum number of postponements to finalizing an update in the | ||
/// `UpdateStaged` state before forcing an update finalization and reboot. | ||
const MAX_FINALIZE_POSTPONEMENTS: u8 = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-note: this was 5 minutes
in locksmith but I think that we are doing here (splitting into 10 pauses, each long 1 minute) is nicer both for the user and for the system.
src/update_agent/mod.rs
Outdated
// Set up dummy interactive sessions. | ||
let foo_session = InteractiveSession { | ||
user: String::from("fakeuser"), | ||
tty_dev: String::from("/dev/tty1"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use a tempfile here. As a bonus, you can check its content afterwards too.
src/update_agent/mod.rs
Outdated
for session in sessions.iter() { | ||
// Write message to tty device. | ||
let user = &session.user; | ||
let tty = &session.tty_dev; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be able to drop these two and just use &session.<FIELD>
everywhere.
src/update_agent/mod.rs
Outdated
@@ -38,16 +46,27 @@ lazy_static::lazy_static! { | |||
"zincati_update_agent_updates_enabled", | |||
"Whether auto-updates logic is enabled." | |||
)).unwrap(); | |||
static ref UPDATE_FINALIZATION_POSTPONEMENTS: IntCounter = register_int_counter!(opts!( | |||
"zincati_update_agent_finalization_postponements_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postponed_finalization_total
maybe?
src/update_agent/actor.rs
Outdated
@@ -381,27 +394,39 @@ impl UpdateAgent { | |||
release: Release, | |||
) -> ResponseActFuture<Self, Result<Release, ()>> { | |||
if !can_finalize { | |||
// Reset number of postponements if finalization attempt failed due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are at least two state transitions nested very deeply inside this function, and I think it would better to split the concerns so that in the top-level tick_finalize_update
it is clearly visible that there three cases:
strategy_can_finalize: false
-> reset the number of attemptsstrategy_can_finalize: true
-> handle interactive users, then:usersessions_can_finalize: false
-> update the number of attemptsusersessions_can_finalize: true
-> callfinalize_deployment(true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to address this in the latest WIP commit. I've made it a WIP commit because I feel like I actually made the code more difficult to read. It'd be great if you can look at the diff of just that commit and see if I'm going in the right general direction. Thanks!
Update: I think I've fixed it up. Squashed everything into the original commit now.
src/update_agent/mod.rs
Outdated
let warning_msg; | ||
if postponements == 0 { | ||
let max_reboot_delay_sec = | ||
(MAX_FINALIZE_POSTPONEMENTS as u64).saturating_mul(DEFAULT_POSTPONEMENT_TIME_SECS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit odd to read. Did you try inverting the direction of postponements
counter, going from MAX to 0, so that it gives you directly the correct number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah that would spare us from having to check postponements == MAX_FINALIZE_POSTPONEMENTS.saturating_sub(1)
and instead just check of whether postponements_left == 1
.
Restructured a lot of the original code and addressed comments. Ready for another round of review :) /cc @lucab |
src/update_agent/mod.rs
Outdated
assert_eq!("1 minute and 1 second", format_seconds(61)); | ||
assert_eq!("1 minute and 30 seconds", format_seconds(90)); | ||
assert_eq!("2 minutes", format_seconds(120)); | ||
assert_eq!("42 minutes and 23 seconds", format_seconds(2543)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easier to believe these are correctly written if you write e.g. 2543 as 42*60 + 23
. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, right, just 2543
really isn't too useful there :)
src/update_agent/mod.rs
Outdated
"Total number of update finalization postponements due to active users." | ||
)).unwrap(); | ||
static ref DETECTED_ACTIVE_USERS: IntGauge = register_int_gauge!(opts!( | ||
"zincati_update_agent_detected_active_users", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion: something like finalization_detected_active_users
makes it easier to understand when this is relevant.
|
||
let (release, postponements_remaining) = match self.clone() { | ||
UpdateAgentState::UpdateStaged((r, p)) => (r, p), | ||
_ => unreachable!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-note: later on we should probably try to re-arrange the internals of the FSM data-structure so that it doesn't require this matching.
Allow update finalizations (reboots) to be postponed a number of times for an amount of time if it is detected that there are users (with a tty) currently logged in on the system. Add a counter for number of postponements remaining to the UpdateStaged state of the update agent; if there are no allowed postponements remaining, proceed with the finalization, disregarding active users.
The FSM may stay in the UpdateStaged state if it cannot finalize either due to strategy constraints or detection of logged in users during a finalization attempt. Update zincati-fsm to reflect this.
Add `zincati_update_agent_finalization_detected_active_users` and `zincati_update_agent_postponed_finalizations_total` metrics. We should expect that the postponed finalizations total should only increase when there are active users detected. These metrics could also be useful for investigating whether certain admin maintenance patterns and update strategy combinations may be unexpectedly causing frequent occurences of reboot delays/postponements.
Allow update finalizations (reboots) to be postponed a number of times for an amount of time if it is detected that there are users (with a tty) currently logged in on the system. Add a counter for number of postponements to the UpdateStaged state of the update agent; if the maximum number of allowed postponements has been reached, proceed with the finalization, disregarding active users.
Closes #115