Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix last_active_ts Presence bug #15859

Closed

Conversation

realtyem
Copy link
Contributor

@realtyem realtyem commented Jul 1, 2023

Essentially, what was happening was that during an incremental sync, online is set repeatedly by the system(which is correct behavior, approximately each timeout). This means it was as simple as comparing the 'before' and 'after' state of online, if they matched then don't bump last_active_ts. I spent hours checking permutations, I'm pretty sure I caught them all but need more eyes to check me.

I did write a basic test to check the behavior(which fails on develop), but am open to ideas of what else to actually check for regression.

Consequences:

  • By not updating the timestamp every sync:
    • this avoids setting a staggered set of wheel timers. It appears they deduplicate into the same time bucket as was probably intended(5 seconds). Less processing of unnecessary timers is a small win.
    • this allows the last_active_change_online and last_active_change_not_online metrics to actually update correctly(which will be interesting to see how it plays out)
    • this may cause slightly more presence activity, as now users will be marked UNAVAILABLE(idle) correctly where they didn't before(it will probably be a wash, though).
  • One less call to _update_states() machinery, which is the heavy lift of sending notifications and persisting changes(and setting more timers).
A quick hackish(and incomplete) primer on how presence does timeouts
  • A local user that is online and currently syncing:
    • if they are currently sending messages, they have LAST_ACTIVE_GRANULARITY(60 seconds) before they are marked not 'currently_active'
    • if they are just sitting there and watching, they will be marked UNAVAILABLE(idle) in IDLE_TIMER(5 minutes) if they don't send a message or create a new 'pro-active' event(this was not happening correctly a lot of the time, however typing, sending read receipts, and sending read markers will bump last_active_ts, even though I don't believe they are expressly 'pro-active')
    • when they stop syncing:
      • in UPDATE_SYNCING_USERS_MS(10 seconds) we notify any workers that the user has stopped syncing
      • in SYNC_ONLINE_TIMEOUT(30 seconds) we assume the client is gone OFFLINE and mark them as such(unless they somehow create a 'pro-active' event). This ignores currently_active
  • if the user is remote:
    • if we haven't heard from their server in FEDERATION_TIMEOUT(30 minutes), we mark them OFFLINE
  • if the user is local but in a room with remote users:
    • we send presence every FEDERATION_PING_INTERVAL(25 minutes) so they don't mark us OFFLINE incorrectly.

Pull Request Checklist

Signed-off-by: Jason Little [email protected]

Comment on lines -1020 to -1031
# To keep the single process behaviour consistent with worker mode, run the
# same logic as `update_external_syncs_row`, even though it looks weird.
if prev_state.state == PresenceState.OFFLINE:
await self._update_states(
[
prev_state.copy_and_replace(
state=PresenceState.ONLINE,
last_active_ts=self.clock.time_msec(),
last_user_sync_ts=self.clock.time_msec(),
)
]
)
Copy link
Contributor Author

@realtyem realtyem Jul 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block preformed two functions, updating last_active_ts and last_user_sync_ts. last_active_ts is now handled by set_state(), and last_user_sync_ts will be set at the finish of the context manager, so is unneeded here. (As is actually commented on just below this)

@@ -1866,7 +1862,7 @@ def handle_timeout(
"""Checks the presence of the user to see if any of the timers have elapsed

Args:
state
state: the current state of the user, not the state at the setting of the timer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a driveby fix

Comment on lines -1258 to +1256
if presence == PresenceState.ONLINE or (
presence == PresenceState.BUSY and self._busy_presence_enabled
):
if (
prev_state.state != PresenceState.ONLINE
and presence == PresenceState.ONLINE
) or (presence == PresenceState.BUSY and self._busy_presence_enabled):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the meat and potatoes of the fix, if we are already online, we don't have to bump last_active_ts

self.presence_handler.get_state(UserID.from_string(user_id))
)
# last_active_ts should not have changed as no pro-active event has occurred
self.assertEqual(prev_state.last_active_ts, state.last_active_ts)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: on develop this fails with

twisted.trial.unittest.FailTest: 2000 != 4000

@realtyem realtyem marked this pull request as ready for review July 2, 2023 21:04
@realtyem realtyem requested a review from a team as a code owner July 2, 2023 21:04
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise seems good

# TODO: This will be removed in future work as it's
# literally done again at the '_end' of this context manager and the
# only time last_user_sync_ts matters is when processing timeouts in
# handle_timeout().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then can't we just remove it now rather than leaving a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. And I'm on the fence based on further experiments about putting the other code block back in. Let's put this on hold for the moment. I think I have something to show you(out-of-band though).

@realtyem
Copy link
Contributor Author

I'm going to close this for now, as there were unforeseen side-effects I'm not happy about. I will roll this into another pull request with more in it to abate those side-effects.

(Mainly, the side effects were that federation sending would ramp up and stay up for over an hour. I couldn't pin down exactly what was causing this, but fixing 'presence spam' seemed to make it stop and requires something broader than this one PR was scoped for.)

@realtyem realtyem closed this Jul 22, 2023
@realtyem
Copy link
Contributor Author

Replaced with #15989

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

last_active_ago is updated on sync when it shouldn't be
2 participants