Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ping-pong disconnect to node router #489

Merged
merged 1 commit into from
Jul 6, 2021

Conversation

JackUrb
Copy link
Contributor

@JackUrb JackUrb commented Jun 30, 2021

Overview

The node server running on Heroku hasn't been able to detect disconnects in the same way that it does when running locally. This leads to delays on marking tasks as disconnected, expiring tasks, and generally a mismatch between Agent and Unit status on MTurk compared to Mephisto.

This PR moves to record ping times on the router to allow it to determine without a hard disconnect whether an agent has disconnected, currently set to not having received a heartbeat in 15 seconds.

Resolves #485.

Implementation details

  • Added last_ping to the LocalAgentState of the node router, and update it whenever a HEARTBEAT is received.
  • Check liveliness on the router during the periodic status checks from the main Mephisto server, updating relevant agents to STATUS_DISCONNECT when relevant.

Additionally, found and resolved the following bugs:

  • Disconnect events which led to expiring tasks would prevent Mephisto from shutting down, even if there were no tasks left (as the task had expired). This is cleaned up by ensuring that, after a Unit transitions to EXPIRED it cannot move back to ASSIGNED.
  • Fixed reconnecting to the node router after a disconnect and receiving the correct status

Testing

Launched a server on Heroku with the parlai chat task, connected to it as two different agents, and then disconnected one. Ensured that, while this disconnect wasn't detected before, it is now.

Launched a static react task to ensure connection functionality on this hadn't changed (as no heartbeats are sent). Connected, and found myself still connected after 15 seconds.

Ensured both tasks shut down properly when completed.

@JackUrb JackUrb requested a review from pringshia June 30, 2021 21:13
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 30, 2021
@pringshia
Copy link
Contributor

pringshia commented Jul 1, 2021

Looks good to me. Thanks for the thorough PR description. I see quite a few new statuses introduced (and in the diff I don't see some of them being used) - is that an intentional omission?

@JackUrb JackUrb merged commit c4c0706 into master Jul 6, 2021
@JackUrb JackUrb deleted the live-disconnects-on-heroku branch July 6, 2021 19:14
@JackUrb JackUrb added this to the 🚀 Mephisto 1.0 milestone Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Probably getting null Worker ID in the frontend
3 participants