update_agent: delay reboot if ongoing interactive sessions #485

kelvinfan001 · 2021-03-01T14:39:51Z

Allow update finalizations (reboots) to be postponed a number of times for an amount of time if it is detected that there are users (with a tty) currently logged in on the system. Add a counter for number of postponements to the UpdateStaged state of the update agent; if the maximum number of allowed postponements has been reached, proceed with the finalization, disregarding active users.

Closes #115

src/update_agent/actor.rs

src/update_agent/mod.rs

lucab · 2021-03-02T13:01:54Z

src/update_agent/mod.rs

 /// Maximum failed deploy attempts in a row in `UpdateAvailable` state
 /// before abandoning a target update.
 const MAX_DEPLOY_ATTEMPTS: u8 = 12;

+/// Maximum number of postponements to finalizing an update in the
+/// `UpdateStaged` state before forcing an update finalization and reboot.
+const MAX_FINALIZE_POSTPONEMENTS: u8 = 10;


Self-note: this was 5 minutes in locksmith but I think that we are doing here (splitting into 10 pauses, each long 1 minute) is nicer both for the user and for the system.

lucab · 2021-03-02T13:15:26Z

src/update_agent/mod.rs

+        // Set up dummy interactive sessions.
+        let foo_session = InteractiveSession {
+            user: String::from("fakeuser"),
+            tty_dev: String::from("/dev/tty1"),


It would be better to use a tempfile here. As a bonus, you can check its content afterwards too.

lucab · 2021-03-02T13:16:20Z

src/update_agent/mod.rs

+    for session in sessions.iter() {
+        // Write message to tty device.
+        let user = &session.user;
+        let tty = &session.tty_dev;


I think you should be able to drop these two and just use &session.<FIELD> everywhere.

lucab · 2021-03-02T13:39:08Z

src/update_agent/mod.rs

@@ -38,16 +46,27 @@ lazy_static::lazy_static! {
        "zincati_update_agent_updates_enabled",
        "Whether auto-updates logic is enabled."
    )).unwrap();
+    static ref UPDATE_FINALIZATION_POSTPONEMENTS: IntCounter = register_int_counter!(opts!(
+        "zincati_update_agent_finalization_postponements_total",


postponed_finalization_total maybe?

lucab · 2021-03-02T14:02:35Z

src/update_agent/actor.rs

@@ -381,27 +394,39 @@ impl UpdateAgent {
        release: Release,
    ) -> ResponseActFuture<Self, Result<Release, ()>> {
        if !can_finalize {
+            // Reset number of postponements if finalization attempt failed due to


There are at least two state transitions nested very deeply inside this function, and I think it would better to split the concerns so that in the top-level tick_finalize_update it is clearly visible that there three cases:

strategy_can_finalize: false -> reset the number of attempts

strategy_can_finalize: true -> handle interactive users, then:

usersessions_can_finalize: false -> update the number of attempts

usersessions_can_finalize: true -> call finalize_deployment(true)

I've tried to address this in the latest WIP commit. I've made it a WIP commit because I feel like I actually made the code more difficult to read. It'd be great if you can look at the diff of just that commit and see if I'm going in the right general direction. Thanks!

Update: I think I've fixed it up. Squashed everything into the original commit now.

lucab · 2021-03-02T14:08:36Z

src/update_agent/mod.rs

+        let warning_msg;
+        if postponements == 0 {
+            let max_reboot_delay_sec =
+                (MAX_FINALIZE_POSTPONEMENTS as u64).saturating_mul(DEFAULT_POSTPONEMENT_TIME_SECS);


This feels a bit odd to read. Did you try inverting the direction of postponements counter, going from MAX to 0, so that it gives you directly the correct number?

Hmm yeah that would spare us from having to check postponements == MAX_FINALIZE_POSTPONEMENTS.saturating_sub(1) and instead just check of whether postponements_left == 1.

src/update_agent/actor.rs

kelvinfan001 · 2021-03-05T15:25:10Z

Restructured a lot of the original code and addressed comments. Ready for another round of review :)

/cc @lucab

jlebon · 2021-03-05T16:39:10Z

src/update_agent/mod.rs

+        assert_eq!("1 minute and 1 second", format_seconds(61));
+        assert_eq!("1 minute and 30 seconds", format_seconds(90));
+        assert_eq!("2 minutes", format_seconds(120));
+        assert_eq!("42 minutes and 23 seconds", format_seconds(2543));


It's easier to believe these are correctly written if you write e.g. 2543 as 42*60 + 23. :)

haha, right, just 2543 really isn't too useful there :)

lucab · 2021-03-12T13:37:23Z

src/update_agent/mod.rs

+        "Total number of update finalization postponements due to active users."
+    )).unwrap();
+    static ref DETECTED_ACTIVE_USERS: IntGauge = register_int_gauge!(opts!(
+        "zincati_update_agent_detected_active_users",


Minor suggestion: something like finalization_detected_active_users makes it easier to understand when this is relevant.

lucab · 2021-03-12T13:42:21Z

src/update_agent/mod.rs

+
+        let (release, postponements_remaining) = match self.clone() {
+            UpdateAgentState::UpdateStaged((r, p)) => (r, p),
+            _ => unreachable!(


Self-note: later on we should probably try to re-arrange the internals of the FSM data-structure so that it doesn't require this matching.

Allow update finalizations (reboots) to be postponed a number of times for an amount of time if it is detected that there are users (with a tty) currently logged in on the system. Add a counter for number of postponements remaining to the UpdateStaged state of the update agent; if there are no allowed postponements remaining, proceed with the finalization, disregarding active users.

The FSM may stay in the UpdateStaged state if it cannot finalize either due to strategy constraints or detection of logged in users during a finalization attempt. Update zincati-fsm to reflect this.

Add `zincati_update_agent_finalization_detected_active_users` and `zincati_update_agent_postponed_finalizations_total` metrics. We should expect that the postponed finalizations total should only increase when there are active users detected. These metrics could also be useful for investigating whether certain admin maintenance patterns and update strategy combinations may be unexpectedly causing frequent occurences of reboot delays/postponements.

kelvinfan001 added kind/new-feature area/updates labels Mar 1, 2021

kelvinfan001 added this to the v0.0.18 milestone Mar 1, 2021

kelvinfan001 requested a review from lucab March 1, 2021 14:39

kelvinfan001 changed the title ~~Delay reboot~~ update_agent: delay reboot if users logged in Mar 1, 2021

kelvinfan001 changed the title ~~update_agent: delay reboot if users logged in~~ update_agent: delay reboot if ongoing interactive sessions Mar 1, 2021

kelvinfan001 commented Mar 1, 2021

View reviewed changes

src/update_agent/actor.rs Outdated Show resolved Hide resolved

kelvinfan001 removed this from the v0.0.18 milestone Mar 1, 2021

lucab added this to the vNext milestone Mar 2, 2021

kelvinfan001 commented Mar 2, 2021

View reviewed changes

src/update_agent/mod.rs Outdated Show resolved Hide resolved

lucab reviewed Mar 2, 2021

View reviewed changes

kelvinfan001 changed the title ~~update_agent: delay reboot if ongoing interactive sessions~~ [WIP] update_agent: delay reboot if ongoing interactive sessions Mar 4, 2021

kelvinfan001 commented Mar 4, 2021

View reviewed changes

src/update_agent/actor.rs Outdated Show resolved Hide resolved

kelvinfan001 commented Mar 4, 2021

View reviewed changes

src/update_agent/actor.rs Outdated Show resolved Hide resolved

kelvinfan001 changed the title ~~[WIP] update_agent: delay reboot if ongoing interactive sessions~~ update_agent: delay reboot if ongoing interactive sessions Mar 5, 2021

jlebon reviewed Mar 5, 2021

View reviewed changes

lucab reviewed Mar 12, 2021

View reviewed changes

kelvinfan001 and others added 3 commits March 14, 2021 21:12

docs/images: add new edge from UpdateStaged to itself

a11faf5

The FSM may stay in the UpdateStaged state if it cannot finalize either due to strategy constraints or detection of logged in users during a finalization attempt. Update zincati-fsm to reflect this.

lucab enabled auto-merge March 15, 2021 08:49

lucab approved these changes Mar 15, 2021

View reviewed changes

lucab merged commit b739ce0 into coreos:master Mar 15, 2021

jlebon mentioned this pull request Jun 1, 2021

daemon: Respect systemd inhibitor locks coreos/rpm-ostree#2862

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update_agent: delay reboot if ongoing interactive sessions #485

update_agent: delay reboot if ongoing interactive sessions #485

kelvinfan001 commented Mar 1, 2021 •

edited

Loading

lucab Mar 2, 2021

lucab Mar 2, 2021

lucab Mar 2, 2021

lucab Mar 2, 2021

lucab Mar 2, 2021

kelvinfan001 Mar 4, 2021 •

edited

Loading

lucab Mar 2, 2021

kelvinfan001 Mar 2, 2021 •

edited

Loading

kelvinfan001 commented Mar 5, 2021

jlebon Mar 5, 2021

kelvinfan001 Mar 5, 2021

lucab Mar 12, 2021

lucab Mar 12, 2021

update_agent: delay reboot if ongoing interactive sessions #485

update_agent: delay reboot if ongoing interactive sessions #485

Conversation

kelvinfan001 commented Mar 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kelvinfan001 Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kelvinfan001 Mar 2, 2021 • edited Loading

Choose a reason for hiding this comment

kelvinfan001 commented Mar 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kelvinfan001 commented Mar 1, 2021 •

edited

Loading

kelvinfan001 Mar 4, 2021 •

edited

Loading

kelvinfan001 Mar 2, 2021 •

edited

Loading