-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix new scheduled tasks jumping the queue #17962
Conversation
# We need to wait for the next run of the scheduler loop | ||
self.reactor.advance((TaskScheduler.SCHEDULE_INTERVAL_MS / 1000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this hasn't actually changed, I'm just tightening up the test a bit. We didn't need to wait for the next run of the loop, we just had to wait 0.1 seconds. Looks like this changed in matrix-org/synapse#16313 and matrix-org/synapse#16660 without the test being updated.
Currently, when a new scheduled task is added and its scheduled time has already passed, we set it to ACTIVE. This is problematic, because it means it will jump the queue ahead of all other SCHEDULED tasks; furthermore, if the Synapse process gets restarted, it will jump ahead of any ACTIVE tasks which have been started but are taking a while to run. Instead, we leave it set to SCHEDULED, but kick off a call to `_launch_scheduled_tasks`, which will decide if we actually have capacity to start a new task, and start the newly-added task if so.
2865a43
to
afb7d08
Compare
@@ -495,7 +495,7 @@ def to_line(self) -> str: | |||
|
|||
|
|||
class NewActiveTaskCommand(_SimpleCommand): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the NAME
to TASK_READY
(and related methods) to match the new behavior ?
Pro it's cleaner, con we may lost a message when doing a rolling restart. Not sure it's a big deal since the scheduler will go over all the tasks in DB when starting up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinions from me on this. Will await insight from @element-hq/synapse-core
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its particularly useful for now, and I think we can always add it in later if it is useful (e.g. for metrics maybe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what exactly are you suggesting, Erik? We remove the command altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, sorry, misunderstood what was going on 🤦
Eh, I don't mind. No one really ever sees what's on the wire anyway 🤷
# Don't bother trying to launch new tasks if we're already at capacity. | ||
if len(self._running_tasks) >= TaskScheduler.MAX_CONCURRENT_RUNNING_TASKS: | ||
return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to remove this? It does avoid spinning up a background process, even if its redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow.
This method was previously (only) called when a NEW_ACTIVE_TASK
replication command was received, which (previously) meant that another worker had added a new scheduled task and set it to ACTIVE immediately, so we should start running it if possible.
The problem is that now, the other worker doesn't make the decision about whether to set the task to ACTIVE or not -- we first have to check how many tasks are already running, so the other worker has to delegate the SCHEDULED -> ACTIVE transition to the background-tasks worker. Soooo, NEW_ACTIVE_TASK
now just means "there is a new task, and it its scheduled time has already passed", which is a hint to the background-tasks worker that it might want to run its scheduler now rather than waiting 60s for it to come round again.
All of which is a long way of saying: launch_task_by_id
isn't very useful any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh, it was just the if len(self._running_tasks) >= TaskScheduler.MAX_CONCURRENT_RUNNING_TASKS
you were talking about?
Right, yes, we could reinstate that. Bear with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
065b11b changes this for both the 60s timer and for the NEW_ACTIVE_TASK
command handler.
synapse/util/task_scheduler.py
Outdated
def on_new_task(self, task_id: str) -> None: | ||
"""Handle a notification that a new ready-to-run task has been added to the queue""" | ||
# Just run the scheduler | ||
self._launch_scheduled_tasks() | ||
|
||
@wrap_as_background_process("launch_scheduled_tasks") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can I just say: I am not a fan of this decorator. I keep forgetting that _launch_scheduled_tasks
runs as a background process.
Currently, when a new scheduled task is added and its scheduled time has already passed, we set it to ACTIVE. This is problematic, because it means it will jump the queue ahead of all other SCHEDULED tasks; furthermore, if the Synapse process gets restarted, it will jump ahead of any ACTIVE tasks which have been started but are taking a while to run. Instead, we leave it set to SCHEDULED, but kick off a call to `_launch_scheduled_tasks`, which will decide if we actually have capacity to start a new task, and start the newly-added task if so.
This release contains the security fixes from [v1.120.2](https://github.com/element-hq/synapse/releases/tag/v1.120.2). - Fix release process to not create duplicate releases. ([\#18025](element-hq/synapse#18025)) - Support for [MSC4190](matrix-org/matrix-spec-proposals#4190): device management for Application Services. ([\#17705](element-hq/synapse#17705)) - Update [MSC4186](matrix-org/matrix-spec-proposals#4186) Sliding Sync to include invite, ban, kick, targets when `$LAZY`-loading room members. ([\#17947](element-hq/synapse#17947)) - Use stable `M_USER_LOCKED` error code for locked accounts, as per [Matrix 1.12](https://spec.matrix.org/v1.12/client-server-api/#account-locking). ([\#17965](element-hq/synapse#17965)) - [MSC4076](matrix-org/matrix-spec-proposals#4076): Add `disable_badge_count` to pusher configuration. ([\#17975](element-hq/synapse#17975)) - Fix long-standing bug where read receipts could get overly delayed being sent over federation. ([\#17933](element-hq/synapse#17933)) - Add OIDC example configuration for Forgejo (fork of Gitea). ([\#17872](element-hq/synapse#17872)) - Link to element-docker-demo from contrib/docker*. ([\#17953](element-hq/synapse#17953)) - [MSC4108](matrix-org/matrix-spec-proposals#4108): Add a `Content-Type` header on the `PUT` response to work around a faulty behavior in some caching reverse proxies. ([\#17253](element-hq/synapse#17253)) - Fix incorrect comment in new schema delta. ([\#17936](element-hq/synapse#17936)) - Raise setuptools_rust version cap to 1.10.2. ([\#17944](element-hq/synapse#17944)) - Enable encrypted appservice related experimental features in the complement docker image. ([\#17945](element-hq/synapse#17945)) - Return whether the user is suspended when querying the user account in the Admin API. ([\#17952](element-hq/synapse#17952)) - Fix new scheduled tasks jumping the queue. ([\#17962](element-hq/synapse#17962)) - Bump pyo3 and dependencies to v0.23.2. ([\#17966](element-hq/synapse#17966)) - Update setuptools-rust and fix building abi3 wheels in latest version. ([\#17969](element-hq/synapse#17969)) - Consolidate SSO redirects through `/_matrix/client/v3/login/sso/redirect(/{idpId})`. ([\#17972](element-hq/synapse#17972)) - Fix Docker and Complement config to be able to use `public_baseurl`. ([\#17986](element-hq/synapse#17986)) - Fix building wheels for MacOS which was temporarily disabled in Synapse 1.120.2. ([\#17993](element-hq/synapse#17993)) - Fix release process to not create duplicate releases. ([\#17970](element-hq/synapse#17970), [\#17995](element-hq/synapse#17995)) * Bump bytes from 1.8.0 to 1.9.0. ([\#17982](element-hq/synapse#17982)) * Bump pysaml2 from 7.3.1 to 7.5.0. ([\#17978](element-hq/synapse#17978)) * Bump serde_json from 1.0.132 to 1.0.133. ([\#17939](element-hq/synapse#17939)) * Bump tomli from 2.0.2 to 2.1.0. ([\#17959](element-hq/synapse#17959)) * Bump tomli from 2.1.0 to 2.2.1. ([\#17979](element-hq/synapse#17979)) * Bump tornado from 6.4.1 to 6.4.2. ([\#17955](element-hq/synapse#17955))
Currently, when a new scheduled task is added and its scheduled time has already passed, we set it to ACTIVE. This is problematic, because it means it will jump the queue ahead of all other SCHEDULED tasks; furthermore, if the Synapse process gets restarted, it will jump ahead of any ACTIVE tasks which have been started but are taking a while to run.
Instead, we leave it set to SCHEDULED, but kick off a call to
_launch_scheduled_tasks
, which will decide if we actually have capacity to start a new task, and start the newly-added task if so.