Fix DISCONNECTED agent status leak #830
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
After moving to the
await_submit
semantics of1.0.0
, we successfully closed a number ofUnit
-Agent
status mismatches. This one however snuck by: timed out agents weren't properly being marked as disconnected. Downstream this led to a few disconnects slowing down and eventually fully stopping the system from running.Nitty-gritty details
When someone actually times out on their task without submitting, Mephisto would release their agent's
TaskRunner
thread, but theAgent
's status would still bein_task
(as anAgentTimeoutError
doesn't implicitly disconnect the agent). This would lead to an incomplete cleanup, and future attempts to accept the task would hang, eventually clogging up the system entirely.The fix is rather simple: if Mephisto exits on an
AbsentAgentError
of any type (therefore ending theUnit
), before finishing that cleanup we ensure the disconnecting agent is updated toSTATUS_DISCONNECTED
.Testing
Can no longer reproduce locally, would like to test under load before merging though.