🐛[bug] Master refuses to accept agents connection #8856

skynewborn · 2024-02-19T06:20:09Z

Describe the bug

Agent some times disappears from Master's agent list under high IO pressure. The tasks on this agent keep running, but cannot be seen from webui/cli. When trying to restart the agent service, the following error would show in Master's logs:

<info> [2024-02-19 05:09:17] adding agent: a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<error> [2024-02-19 05:09:17] failed to update agent start stats  component="agent" error="error: 0 rows affected on query \nINSERT INTO agent_stats (resource_pool, agent_id, slots, start_time)\nSELECT :resource_pool, :agent_id, :slots, CURRENT_TIMESTAMP\nWHERE NOT EXISTS (\n\tSELECT * FROM agent_stats WHERE agent_id = :agent_id AND end_time IS NULL\n)\n \narg &{a100-40g-pcie a1002 8}" id="a1002" resource-pool="a100-40g-pcie"
<error> [2024-02-19 05:09:33] agent crashed  address="192.168.1.219" component="agent" error="agent failed to reconnect by deadline" id="a1002" resource-pool="a100-40g-pcie" started="true"
<info> [2024-02-19 05:09:33] removing agent: a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<info> [2024-02-19 05:10:09] websocket closed gracefully, awaiting reconnect: master-agent-ws-a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<info> [2024-02-19 05:10:09] draining agent: a1002  component="agent-state-state" id="a1002"
<warning> [2024-02-19 05:10:09] failed to get agent state for agent a1002  component="agent" error="agent state is not available: agent not started" id="a1002" resource-pool="a100-40g-pcie"
<info> [2024-02-19 05:10:09] agent connected ip: 192.168.1.219 resource pool: a100-40g-pcie slots: 8  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<info> [2024-02-19 05:10:09] adding device: cuda0 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda1 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda2 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda3 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda4 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda5 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda6 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda7 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding agent: a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<error> [2024-02-19 05:10:34] agent crashed  address="192.168.1.219" component="agent" error="agent failed to reconnect by deadline" id="a1002" resource-pool="a100-40g-pcie" started="true"
<info> [2024-02-19 05:10:34] removing agent: a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"

Reproduction Steps

Expected Behavior

Master should accept agent's reconnection fine.

Screenshot

None

Environment

Version 0.28.0

Additional Context

No response

The text was updated successfully, but these errors were encountered:

ioga · 2024-02-19T19:34:19Z

hello, I don't see any reconnect attempts in your log snippet at all. however, I'd say agent is not supposed to fail in the first place "under high IO pressure".

what happens with the agent process? is there anything in the agent logs, can you please share them? does it crash to OOM, or is it a networking problem?

skynewborn · 2024-02-20T10:00:24Z

hello, I don't see any reconnect attempts in your log snippet at all. however, I'd say agent is not supposed to fail in the first place "under high IO pressure".

what happens with the agent process? is there anything in the agent logs, can you please share them? does it crash to OOM, or is it a networking problem?

Sorry for not saving agent logs during the recovery. But what also confused me is that there is actually no error info in agent: no reconnection, service status of determined-agent is still active, and task containers are still running. Restarting agent would not fix, until master was restarted and then agent was able to connect to master successfully.

ioga · 2024-02-20T18:39:43Z

these symptoms don't match anything we've seen before, sorry.

do you have a way to reproduce this issue? I'd like to see the logs from both master and agent around the issue, otherwise there's not much we can do.

as a radical solution, you can consider switching to kubernetes, which will eliminate the master<->agent interaction.

skynewborn · 2024-02-21T08:55:12Z

Sure. We will try to collect more information if the issue happens again. Thanks for your reply and suggestion.

skynewborn added the bug label Feb 19, 2024

skynewborn closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[bug] Master refuses to accept agents connection #8856

🐛[bug] Master refuses to accept agents connection #8856

skynewborn commented Feb 19, 2024

ioga commented Feb 19, 2024

skynewborn commented Feb 20, 2024

ioga commented Feb 20, 2024

skynewborn commented Feb 21, 2024

🐛[bug] Master refuses to accept agents connection #8856

🐛[bug] Master refuses to accept agents connection #8856

Comments

skynewborn commented Feb 19, 2024

Describe the bug

Reproduction Steps

Expected Behavior

Screenshot

Environment

Additional Context

ioga commented Feb 19, 2024

skynewborn commented Feb 20, 2024

ioga commented Feb 20, 2024

skynewborn commented Feb 21, 2024