Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Launch] Avoid prlimit error fail the ray up #1236

Merged
merged 3 commits into from
Oct 13, 2022
Merged

Conversation

Michaelvll
Copy link
Collaborator

The current prlimit will fail sometimes due to the pid does not exist. We add || true to avoid the error, as the prlimit setting is not critical.

Tested:

  • sky cpunode --cloud gcp


# Worker commands are needed for TPU VM Pods
{%- if num_nodes > 1 or tpu_vm %}
worker_start_ray_commands:
- SKY_NUM_GPUS=0 && which nvidia-smi > /dev/null && SKY_NUM_GPUS=$(nvidia-smi --query-gpu=index,name --format=csv,noheader | wc -l);
ray stop; ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --num-gpus=$SKY_NUM_GPUS;
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id; done;
ray stop; ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --num-gpus=$SKY_NUM_GPUS || exit 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what case have you seen requires || exit 1 -- i.e., when would ray start fail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a case ray start would fail, but it is a safeguard for any possible situation when ray start fails, especially when custom image is specified, as this start_commands will fail before we add the last prlimit line (bash script will exit with the return value of the last command).

ray stop; ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --num-gpus=$SKY_NUM_GPUS;
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id; done;
ray stop; ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --num-gpus=$SKY_NUM_GPUS || exit 1;
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: do we know why is raylet's PID not found? Seems like || exit 1 above would catch the issue of raylet not being started. OK as a safeguard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems PID not found issue happens a bit frequently. I am guessing the reason could be that the string raylet/raylet can appear in some other short-running process as an argument.

@Michaelvll Michaelvll merged commit f3d506a into master Oct 13, 2022
@Michaelvll Michaelvll deleted the robust-template branch October 13, 2022 19:06
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
* avoid prlimit error fail the ray up

* Fail the ray up if the ray up fails

* surface the failed payload_str
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
* avoid prlimit error fail the ray up

* Fail the ray up if the ray up fails

* surface the failed payload_str
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants