-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛[bug] Master refuses to accept agents connection #8856
Comments
hello, I don't see any reconnect attempts in your log snippet at all. however, I'd say agent is not supposed to fail in the first place "under high IO pressure". what happens with the agent process? is there anything in the agent logs, can you please share them? does it crash to OOM, or is it a networking problem? |
Sorry for not saving agent logs during the recovery. But what also confused me is that there is actually no error info in agent: no reconnection, service status of determined-agent is still active, and task containers are still running. Restarting agent would not fix, until master was restarted and then agent was able to connect to master successfully. |
these symptoms don't match anything we've seen before, sorry. do you have a way to reproduce this issue? I'd like to see the logs from both master and agent around the issue, otherwise there's not much we can do. as a radical solution, you can consider switching to kubernetes, which will eliminate the master<->agent interaction. |
Sure. We will try to collect more information if the issue happens again. Thanks for your reply and suggestion. |
Describe the bug
Agent some times disappears from Master's agent list under high IO pressure. The tasks on this agent keep running, but cannot be seen from webui/cli. When trying to restart the agent service, the following error would show in Master's logs:
Reproduction Steps
Expected Behavior
Master should accept agent's reconnection fine.
Screenshot
None
Environment
Additional Context
No response
The text was updated successfully, but these errors were encountered: