Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[bug] Master refuses to accept agents connection #8856

Closed
skynewborn opened this issue Feb 19, 2024 · 4 comments
Closed

🐛[bug] Master refuses to accept agents connection #8856

skynewborn opened this issue Feb 19, 2024 · 4 comments
Labels

Comments

@skynewborn
Copy link

Describe the bug

Agent some times disappears from Master's agent list under high IO pressure. The tasks on this agent keep running, but cannot be seen from webui/cli. When trying to restart the agent service, the following error would show in Master's logs:

<info> [2024-02-19 05:09:17] adding agent: a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<error> [2024-02-19 05:09:17] failed to update agent start stats  component="agent" error="error: 0 rows affected on query \nINSERT INTO agent_stats (resource_pool, agent_id, slots, start_time)\nSELECT :resource_pool, :agent_id, :slots, CURRENT_TIMESTAMP\nWHERE NOT EXISTS (\n\tSELECT * FROM agent_stats WHERE agent_id = :agent_id AND end_time IS NULL\n)\n \narg &{a100-40g-pcie a1002 8}" id="a1002" resource-pool="a100-40g-pcie"
<error> [2024-02-19 05:09:33] agent crashed  address="192.168.1.219" component="agent" error="agent failed to reconnect by deadline" id="a1002" resource-pool="a100-40g-pcie" started="true"
<info> [2024-02-19 05:09:33] removing agent: a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<info> [2024-02-19 05:10:09] websocket closed gracefully, awaiting reconnect: master-agent-ws-a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<info> [2024-02-19 05:10:09] draining agent: a1002  component="agent-state-state" id="a1002"
<warning> [2024-02-19 05:10:09] failed to get agent state for agent a1002  component="agent" error="agent state is not available: agent not started" id="a1002" resource-pool="a100-40g-pcie"
<info> [2024-02-19 05:10:09] agent connected ip: 192.168.1.219 resource pool: a100-40g-pcie slots: 8  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<info> [2024-02-19 05:10:09] adding device: cuda0 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda1 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda2 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda3 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda4 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda5 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda6 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding device: cuda7 (NVIDIA A100-PCIE-40GB) on a1002  component="agent-state-state" id="a1002"
<info> [2024-02-19 05:10:09] adding agent: a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"
<error> [2024-02-19 05:10:34] agent crashed  address="192.168.1.219" component="agent" error="agent failed to reconnect by deadline" id="a1002" resource-pool="a100-40g-pcie" started="true"
<info> [2024-02-19 05:10:34] removing agent: a1002  component="agent" id="a1002" resource-pool="a100-40g-pcie"

Reproduction Steps

Expected Behavior

Master should accept agent's reconnection fine.

Screenshot

None

Environment

  • Version 0.28.0

Additional Context

No response

@skynewborn skynewborn added the bug label Feb 19, 2024
@ioga
Copy link
Contributor

ioga commented Feb 19, 2024

hello, I don't see any reconnect attempts in your log snippet at all. however, I'd say agent is not supposed to fail in the first place "under high IO pressure".

what happens with the agent process? is there anything in the agent logs, can you please share them? does it crash to OOM, or is it a networking problem?

@skynewborn
Copy link
Author

hello, I don't see any reconnect attempts in your log snippet at all. however, I'd say agent is not supposed to fail in the first place "under high IO pressure".

what happens with the agent process? is there anything in the agent logs, can you please share them? does it crash to OOM, or is it a networking problem?

Sorry for not saving agent logs during the recovery. But what also confused me is that there is actually no error info in agent: no reconnection, service status of determined-agent is still active, and task containers are still running. Restarting agent would not fix, until master was restarted and then agent was able to connect to master successfully.

@ioga
Copy link
Contributor

ioga commented Feb 20, 2024

these symptoms don't match anything we've seen before, sorry.

do you have a way to reproduce this issue? I'd like to see the logs from both master and agent around the issue, otherwise there's not much we can do.

as a radical solution, you can consider switching to kubernetes, which will eliminate the master<->agent interaction.

@skynewborn
Copy link
Author

Sure. We will try to collect more information if the issue happens again. Thanks for your reply and suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants