-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Launch] Avoid prlimit error fail the ray up #1236
Conversation
|
||
# Worker commands are needed for TPU VM Pods | ||
{%- if num_nodes > 1 or tpu_vm %} | ||
worker_start_ray_commands: | ||
- SKY_NUM_GPUS=0 && which nvidia-smi > /dev/null && SKY_NUM_GPUS=$(nvidia-smi --query-gpu=index,name --format=csv,noheader | wc -l); | ||
ray stop; ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --num-gpus=$SKY_NUM_GPUS; | ||
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id; done; | ||
ray stop; ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --num-gpus=$SKY_NUM_GPUS || exit 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: what case have you seen requires || exit 1
-- i.e., when would ray start
fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a case ray start
would fail, but it is a safeguard for any possible situation when ray start
fails, especially when custom image is specified, as this start_commands
will fail before we add the last prlimit
line (bash script will exit with the return value of the last command).
ray stop; ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --num-gpus=$SKY_NUM_GPUS; | ||
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id; done; | ||
ray stop; ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --num-gpus=$SKY_NUM_GPUS || exit 1; | ||
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: do we know why is raylet's PID not found? Seems like || exit 1
above would catch the issue of raylet not being started. OK as a safeguard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems PID not found issue
happens a bit frequently. I am guessing the reason could be that the string raylet/raylet
can appear in some other short-running process as an argument.
* avoid prlimit error fail the ray up * Fail the ray up if the ray up fails * surface the failed payload_str
* avoid prlimit error fail the ray up * Fail the ray up if the ray up fails * surface the failed payload_str
The current prlimit will fail sometimes due to the pid does not exist. We add
|| true
to avoid the error, as theprlimit
setting is not critical.Tested:
sky cpunode --cloud gcp