-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spot] Fix multiprocessing signal handling in spot controller #1745
Conversation
This is a fascinating finding @suquark! Does that means if the teardown of the existing cluster takes longer than 10s, the spot cluster will be leaked as the controller process will be crashed? Can we try if adding time.sleep before the cluster termination will cause the leakage in the previous implementation? I will submit the review soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the excellent finding @suquark! I am testing it with pytest tests/test_smoke.py --managed-spot
as well. Left two comments.
Yes, it leaks the spot instance in the new provisioner implementation. I can reproduce it (in the new provisioner) constantly by adding |
It seems the current PR does not work by my test:
Seems the spot job was not canceled after more than an hour. By checking the
|
interesting, it indeed works on my laptop with AWS. let me test GCP manually. something may be broken deeply, and we need to understand why it does/does not work
|
Context: I find the previous commit works when the spot controller is running in AWS, but it stops to work in GCP. It is totally mystery to me. The only difference should be the operation system (GCP uses a older Debian distribution). It seems I have fixed this issue. The key idea is terminating the cluster in the controller daemon instead of the controller, since signal handling gets messy in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some early questions : )
@Michaelvll This PR is ready for review again. I have tested it against both AWS & GCP. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this important issue @suquark! The PR looks mostly good to me. Left several final questions.
# So we can only enable daemon after we no longer need to | ||
# start daemon processes like Ray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we use ray? I thought most of places we use ray are through subprocess.run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sky launch uses Ray as a subprocess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, do you mean that the subprocess.Popen
in run_with_log
will make it fail?
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 1443 to 1464 in 77e6f31
returncode, stdout, stderr = log_lib.run_with_log( | |
[sys.executable, script_path], | |
log_abs_path, | |
stream_logs=False, | |
start_streaming_at='Shared connection to', | |
line_processor=log_utils.RayUpLineProcessor(), | |
# Reduce BOTO_MAX_RETRIES from 12 to 5 to avoid long hanging | |
# time during 'ray up' if insufficient capacity occurs. | |
env=dict( | |
os.environ, | |
BOTO_MAX_RETRIES='5', | |
# Use environment variables to disable the ray usage collection | |
# (to avoid overheads and potential issues with the usage) | |
# as sdk does not take the argument for disabling the usage | |
# collection. | |
RAY_USAGE_STATS_ENABLED='0'), | |
require_outputs=True, | |
# Disable stdin to avoid ray outputs mess up the terminal with | |
# misaligned output when multithreading/multiprocessing are used | |
# Refer to: https://github.com/ray-project/ray/blob/d462172be7c5779abf37609aed08af112a533e1e/python/ray/autoscaler/_private/subprocess_output_util.py#L264 # pylint: disable=line-too-long | |
stdin=subprocess.DEVNULL) | |
return returncode, stdout, stderr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, daemon=True
+ subprocess.Popen
=> process leakage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That worth a comment. : )
sky/spot/controller.py
Outdated
# NOTE: it is ok to kill or join a killed process. | ||
controller_process.kill() | ||
# Kill the children processes launched by log_lib.run_with_log. | ||
subprocess_utils.kill_children_processes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Will this work when the daemon
is not enabled for the controller_process
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will work. it was never daemon
process in the original PR, so it does not make sense that it will not work here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think previously, the children processes are the spawn with subprocess.Popen
directly from the current process, instead of having an additional multiprocessing.Process
in between, i.e. previously the kill_children_processes
is directly called within the controller_processs
instead of the outer caller. I am not sure if that will make any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my observation it works, also as the note suggests it supports general *nix process:
"""Return the children of this process as a list of Process
instances, pre-emptively checking whether PID has been reused.
If *recursive* is True return all the parent descendants.
Example (A == this process):
A ─┐
│
├─ B (child) ─┐
│ └─ X (grandchild) ─┐
│ └─ Y (great grandchild)
├─ C (child)
└─ D (child)
>>> import psutil
>>> p = psutil.Process()
>>> p.children()
B, C, D
>>> p.children(recursive=True)
B, X, Y, C, D
Note that in the example above if process X disappears
process Y won't be listed as the reference to process A
is lost.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, got it. Thanks! I was worried about that the process will not appear as a child process when the daemon is not set, but seems it is not a problem. Tested it with the following code and seems working well:
import multiprocessing
import time
import os
import subprocess
from sky.utils import subprocess_utils
def worker():
subprocess.run('sleep 1000', shell=True)
time.sleep(10000)
if __name__ == '__main__':
multiprocessing.set_start_method('spawn', force=True)
p = multiprocessing.Process(target=worker)
p.start()
print(os.getpid())
x = input('input to continue\n')
print('kill')
p.kill()
subprocess_utils.kill_children_processes()
time.sleep(10)
p.join()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching the issue @suquark! I just tested out manually and it works well for me. I am now testing it with the smoke test.
-
pytest tests/test_smoke.py --managed-spot
sky/spot/controller.py
Outdated
# NOTE: it is ok to kill or join a killed process. | ||
controller_process.kill() | ||
# Kill the children processes launched by log_lib.run_with_log. | ||
subprocess_utils.kill_children_processes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, got it. Thanks! I was worried about that the process will not appear as a child process when the daemon is not set, but seems it is not a problem. Tested it with the following code and seems working well:
import multiprocessing
import time
import os
import subprocess
from sky.utils import subprocess_utils
def worker():
subprocess.run('sleep 1000', shell=True)
time.sleep(10000)
if __name__ == '__main__':
multiprocessing.set_start_method('spawn', force=True)
p = multiprocessing.Process(target=worker)
p.start()
print(os.getpid())
x = input('input to continue\n')
print('kill')
p.kill()
subprocess_utils.kill_children_processes()
time.sleep(10)
p.join()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me! Left some comments about the sleep time.
14aa49e
to
02b7690
Compare
02b7690
to
39d3dee
Compare
all related tests on both aws + gcp passed |
Awesome find @suquark! This may be unrelated to this PR, posting to check with you - a UX thing: Repro
On 322ffad
On 464b5db (this PR)
There's an extra |
Yes, this is caused by we setting the CANCELLED state after the cluster is cleaned up now, making this spot log streaming misclassified the job as preempted. It will be fixed by #1785 |
Previously, we send an interruption signal to the controller process and the controller process handles cleanup. However, we figure out the behavior differs from cloud to cloud (e.g., GCP ignore 'SIGINT'). A possible reason is https://unix.stackexchange.com/questions/356408/strange-problem-with-trap-and-sigint. But anyway, a clean solution is killing the controller process directly, and then cleanup the cluster state.
Tested (run the relevant ones):
pytest tests/test_smoke.py
pytest tests/test_smoke.py --managed-spot
(both AWS and GCP as spot controllers)bash tests/backward_comaptibility_tests.sh