Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Ray failed to start dashboard when ray start is called by sky launch #2054

Closed
Michaelvll opened this issue Jun 9, 2023 · 1 comment · Fixed by #2055
Closed

[Core] Ray failed to start dashboard when ray start is called by sky launch #2054

Michaelvll opened this issue Jun 9, 2023 · 1 comment · Fixed by #2055
Labels

Comments

@Michaelvll
Copy link
Collaborator

Michaelvll commented Jun 9, 2023

Two users encountered the problem that on the cluster launched by sky launch, ray dashboard process does not exist. Even I tried to ray stop and ray start again manually. The dashboard still failed to be launched. This is a very serious issue.

A suspicious error in raylet.err:

[2023-06-09 01:52:53,319 E 9906 9971] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
- The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
- The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
- The agent is killed by the OS (e.g., out of memory).

dashboard_agent.log

@Michaelvll Michaelvll added the P0 label Jun 9, 2023
@Michaelvll
Copy link
Collaborator Author

Seems both user's remote VM has the grpcio==1.48.0 which causes the trouble with the ray dashboard. After upgrading the grpcio to 1.51.1 the problem goes away

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant