Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DISCONNECTED agent status leak #830

Merged
merged 2 commits into from
Jul 15, 2022
Merged

Fix DISCONNECTED agent status leak #830

merged 2 commits into from
Jul 15, 2022

Conversation

JackUrb
Copy link
Contributor

@JackUrb JackUrb commented Jul 15, 2022

Overview

After moving to the await_submit semantics of 1.0.0, we successfully closed a number of Unit-Agent status mismatches. This one however snuck by: timed out agents weren't properly being marked as disconnected. Downstream this led to a few disconnects slowing down and eventually fully stopping the system from running.

Nitty-gritty details

When someone actually times out on their task without submitting, Mephisto would release their agent's TaskRunner thread, but the Agent's status would still be in_task (as an AgentTimeoutError doesn't implicitly disconnect the agent). This would lead to an incomplete cleanup, and future attempts to accept the task would hang, eventually clogging up the system entirely.

The fix is rather simple: if Mephisto exits on an AbsentAgentError of any type (therefore ending the Unit), before finishing that cleanup we ensure the disconnecting agent is updated to STATUS_DISCONNECTED.

Testing

Can no longer reproduce locally, would like to test under load before merging though.

@JackUrb JackUrb requested a review from spencerp July 15, 2022 15:34
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 15, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #830 (ae01e44) into main (dc8f72d) will increase coverage by 0.10%.
The diff coverage is 25.00%.

@@            Coverage Diff             @@
##             main     #830      +/-   ##
==========================================
+ Coverage   64.57%   64.67%   +0.10%     
==========================================
  Files         107      107              
  Lines        9259     9263       +4     
==========================================
+ Hits         5979     5991      +12     
+ Misses       3280     3272       -8     
Impacted Files Coverage Δ
...ephisto/abstractions/_subcomponents/task_runner.py 78.57% <25.00%> (-1.05%) ⬇️
...tractions/architects/channels/websocket_channel.py 76.56% <0.00%> (+8.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc8f72d...ae01e44. Read the comment docs.

@JackUrb
Copy link
Contributor Author

JackUrb commented Jul 15, 2022

Experimental evidence from @spencerp's most recent run:
workercount
taskrunnerthreads

Critical to note is the difference between the two runs before 10:30 compared to the one afterwards. Before it appeared active agents and active threads were out of sync, with agents leaking upwards. After this change, such an issue no longer happens.

@JackUrb JackUrb merged commit ca18821 into main Jul 15, 2022
@JackUrb JackUrb deleted the disconnect-leak-patch branch July 15, 2022 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants