Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Automatically 'close' assignments with multiple units after some time if missing agents #483

Closed
federicoruggeri opened this issue Jun 17, 2021 · 6 comments
Labels
question Further information is requested

Comments

@federicoruggeri
Copy link

Let's consider a task where each assignment has multiple units (e.g. dialogue task).
I was wondering if it is possible (and what is the best possible way) to automatically mark an assignment as incomplete (or expired), after some time if not all units have been assigned to an agent.

An example to clarify:
task -> dialogue task
assignments -> [1, 2]
units -> [1, 2, 3, 4]

agent_1 connects and gets assigned to assignment_1, unit_1. After X minutes, none has connected to the application and agent_1 is the only one in assignment_1.

I would like to have a timeout that, once expired, marks unit_1 and unit_2 (assignment_1) as expired.

Q1: is it possible?
Q2: what's the best way?

My attempt so far (dialogue task template): I've set a timeout (front-end side, react) that marks episode_done (using onMessageSend). Everything seems fine front-end side (the client gets disconnected, the interface gets updated, etc..). However, it seems like nothing is happening to update the database state of the unit. It seems like something is not running since some worker has still to connect to the assignment.

Am I missing something?

I hope that the description is sufficiently clear.
Thanks in advance!

@JackUrb
Copy link
Contributor

JackUrb commented Jun 17, 2021

Hi @federicoruggeri, I'm not fully sure what the circumstances would be that this feature would be used, but the reason why the backend doesn't update is because there is no world running - we only launch the world for a dialogue task once everyone has connected:

# See if the concurrent unit is ready to launch
assignment = unit.get_assignment()
agents = assignment.get_agents()
if None in agents:
agent.update_status(AgentState.STATUS_WAITING)
return # need to wait for all agents to be here to launch
# Launch the backend for this assignment
agent_infos = [self.agents[a.db_id] for a in agents if a is not None]
assign_thread = threading.Thread(
target=self._launch_and_run_assignment,
args=(assignment, agent_infos, channel_info.job.task_runner),
name=f"Assignment-thread-{assignment.db_id}",
)
for agent_info in agent_infos:
agent_info.agent.update_status(AgentState.STATUS_IN_TASK)
agent_info.assignment_thread = assign_thread
assign_thread.start()

As far as an implementation that would allow this functionality, I'm unclear how to do this cleanly.

@federicoruggeri
Copy link
Author

Many thanks for the very quick reply! It makes sense.
The motivation behind this request is that, as far as I can understand, there's no way to 'exit' a unit once you are in (to free it again).
Thus, for instance, if worker_1 connects to an assignment with 2 units at time X. Then, at time Y (Y >> X), agent_2 connects. Agent_1 is still counted in and it is improbable that agent_1 is still active (because a lot of time has passed). Basically, assignments with more than 1 unit inherently require synchronization -> I was wondering if there's a good strategy to handle the scenario when you don't achieve synchronization.

Please, correct me if I'm saying something wrong :D

@JackUrb
Copy link
Contributor

JackUrb commented Jun 18, 2021

Hm, after a certain timeout, for synchronized (live) tasks, the first Agent should be issued a disconnect and Unit 1 should be put back into the pool (once the person leaves the page). It's possible this isn't happening though, I've heard from others that the disconnect event may not be registered by heroku servers in the last few months.

Likely, we're not triggering this function correctly:

function handle_possible_disconnect(agent) {

If the socket isn't disconnecting and sending the close error, we'd need to find a way to catch that here instead (as a ping to a disconnected agent would fail):

Unfortunately until I get a chance to add a local view for heroku logs, I don't imagine this to be an easy thing to debug.

@federicoruggeri
Copy link
Author

Many thanks for the clear explanation! Do you remember where the timeout is defined? I would like to check how many seconds does the system wait before returning the unit back into the pool.

Thanks in advance.

@JackUrb
Copy link
Contributor

JackUrb commented Jul 8, 2021

The timeout I believe should be around 15 seconds. I added a change in #489 that should cause this to trigger more consistently.

@pringshia pringshia added the question Further information is requested label Aug 19, 2021
@pringshia
Copy link
Contributor

Closing for now, please feel free to reopen the issue if there are any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants