Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spot] Fix multiprocessing signal handling in spot controller #1745

Merged
merged 3 commits into from
Mar 16, 2023

Conversation

suquark
Copy link
Collaborator

@suquark suquark commented Mar 5, 2023

Previously, we send an interruption signal to the controller process and the controller process handles cleanup. However, we figure out the behavior differs from cloud to cloud (e.g., GCP ignore 'SIGINT'). A possible reason is https://unix.stackexchange.com/questions/356408/strange-problem-with-trap-and-sigint. But anyway, a clean solution is killing the controller process directly, and then cleanup the cluster state.

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py --managed-spot (both AWS and GCP as spot controllers)
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@suquark suquark requested a review from Michaelvll March 5, 2023 06:22
@Michaelvll Michaelvll added the P0 label Mar 5, 2023
@Michaelvll
Copy link
Collaborator

Michaelvll commented Mar 5, 2023

This is a fascinating finding @suquark! Does that means if the teardown of the existing cluster takes longer than 10s, the spot cluster will be leaked as the controller process will be crashed? Can we try if adding time.sleep before the cluster termination will cause the leakage in the previous implementation? I will submit the review soon.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the excellent finding @suquark! I am testing it with pytest tests/test_smoke.py --managed-spot as well. Left two comments.

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
@suquark
Copy link
Collaborator Author

suquark commented Mar 5, 2023

This is a fascinating finding @suquark! Does that means if the teardown of the existing cluster takes longer than 10s, the spot cluster will be leaked as the controller process will be crashed? Can we try if adding time.sleep before the cluster termination will cause the leakage in the previous implementation? I will submit the review soon.

Yes, it leaks the spot instance in the new provisioner implementation. I can reproduce it (in the new provisioner) constantly by adding time.sleep(60) (actually less than 60s usually works, sometimes you just need to sleep 10s).

@Michaelvll
Copy link
Collaborator

Michaelvll commented Mar 6, 2023

It seems the current PR does not work by my test:

  1. sky spot launch --cloud gcp --cpus 2 "echo hi; sleep 10000"
  2. sky spot cancel -a
  3. watch -n 10 "sky spot queue"

image

Seems the spot job was not canceled after more than an hour.

By checking the sky spot logs --controller, it seems the SIGINT was sent, but did not trigger the KeyboardInterrupt exception:

(sky-2a29-gcpuser, pid=16315) I 03-05 21:37:49 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:09 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:11 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:11 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:31 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:33 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:33 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:53 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:55 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:55 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:15 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:16 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:16 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:37 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:38 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:38 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:43 controller.py:310] Cancelling spot job 360...
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:43 controller.py:312] Sending SIGINT to controller process 16400
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:58 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:00 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:00 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:20 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:22 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:22 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:42 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:44 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:44 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:04 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:06 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:06 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:26 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:28 spot_utils.py:76] Job status: JobStatus.RUNNING

@suquark
Copy link
Collaborator Author

suquark commented Mar 6, 2023

interesting, it indeed works on my laptop with AWS. let me test GCP manually. something may be broken deeply, and we need to understand why it does/does not work

It seems the current PR does not work by my test:

  1. sky spot launch --cloud gcp --cpus 2 "echo hi; sleep 10000"
  2. sky spot cancel -a
  3. watch -n 10 "sky spot queue"

Seems the spot job was not cancelled after more than an hour.

@suquark
Copy link
Collaborator Author

suquark commented Mar 6, 2023

Context: I find the previous commit works when the spot controller is running in AWS, but it stops to work in GCP. It is totally mystery to me. The only difference should be the operation system (GCP uses a older Debian distribution).

It seems I have fixed this issue. The key idea is terminating the cluster in the controller daemon instead of the controller, since signal handling gets messy in this case.

@suquark suquark requested a review from Michaelvll March 6, 2023 21:01
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some early questions : )

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
@suquark suquark requested a review from Michaelvll March 12, 2023 01:57
@suquark
Copy link
Collaborator Author

suquark commented Mar 12, 2023

@Michaelvll This PR is ready for review again. I have tested it against both AWS & GCP. Thank you!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this important issue @suquark! The PR looks mostly good to me. Left several final questions.

sky/spot/controller.py Outdated Show resolved Hide resolved
Comment on lines +302 to +290
# So we can only enable daemon after we no longer need to
# start daemon processes like Ray.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we use ray? I thought most of places we use ray are through subprocess.run

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sky launch uses Ray as a subprocess

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do you mean that the subprocess.Popen in run_with_log will make it fail?

returncode, stdout, stderr = log_lib.run_with_log(
[sys.executable, script_path],
log_abs_path,
stream_logs=False,
start_streaming_at='Shared connection to',
line_processor=log_utils.RayUpLineProcessor(),
# Reduce BOTO_MAX_RETRIES from 12 to 5 to avoid long hanging
# time during 'ray up' if insufficient capacity occurs.
env=dict(
os.environ,
BOTO_MAX_RETRIES='5',
# Use environment variables to disable the ray usage collection
# (to avoid overheads and potential issues with the usage)
# as sdk does not take the argument for disabling the usage
# collection.
RAY_USAGE_STATS_ENABLED='0'),
require_outputs=True,
# Disable stdin to avoid ray outputs mess up the terminal with
# misaligned output when multithreading/multiprocessing are used
# Refer to: https://github.com/ray-project/ray/blob/d462172be7c5779abf37609aed08af112a533e1e/python/ray/autoscaler/_private/subprocess_output_util.py#L264 # pylint: disable=line-too-long
stdin=subprocess.DEVNULL)
return returncode, stdout, stderr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, daemon=True + subprocess.Popen => process leakage

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That worth a comment. : )

sky/spot/controller.py Outdated Show resolved Hide resolved
# NOTE: it is ok to kill or join a killed process.
controller_process.kill()
# Kill the children processes launched by log_lib.run_with_log.
subprocess_utils.kill_children_processes()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Will this work when the daemon is not enabled for the controller_process?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will work. it was never daemon process in the original PR, so it does not make sense that it will not work here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think previously, the children processes are the spawn with subprocess.Popen directly from the current process, instead of having an additional multiprocessing.Process in between, i.e. previously the kill_children_processes is directly called within the controller_processs instead of the outer caller. I am not sure if that will make any difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my observation it works, also as the note suggests it supports general *nix process:

        """Return the children of this process as a list of Process
        instances, pre-emptively checking whether PID has been reused.
        If *recursive* is True return all the parent descendants.

        Example (A == this process):

         A ─┐
            │
            ├─ B (child) ─┐
            │             └─ X (grandchild) ─┐
            │                                └─ Y (great grandchild)
            ├─ C (child)
            └─ D (child)

        >>> import psutil
        >>> p = psutil.Process()
        >>> p.children()
        B, C, D
        >>> p.children(recursive=True)
        B, X, Y, C, D

        Note that in the example above if process X disappears
        process Y won't be listed as the reference to process A
        is lost.
        """

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, got it. Thanks! I was worried about that the process will not appear as a child process when the daemon is not set, but seems it is not a problem. Tested it with the following code and seems working well:

import multiprocessing
import time
import os
import subprocess

from sky.utils import subprocess_utils

def worker():
    subprocess.run('sleep 1000', shell=True)
    time.sleep(10000)

if __name__ == '__main__':
    multiprocessing.set_start_method('spawn', force=True)
    p = multiprocessing.Process(target=worker)
    p.start()
    print(os.getpid())
    x = input('input to continue\n')
    print('kill')
    p.kill()
    subprocess_utils.kill_children_processes()
    time.sleep(10)
    p.join()

@suquark suquark requested a review from Michaelvll March 14, 2023 23:22
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the issue @suquark! I just tested out manually and it works well for me. I am now testing it with the smoke test.

  • pytest tests/test_smoke.py --managed-spot

# NOTE: it is ok to kill or join a killed process.
controller_process.kill()
# Kill the children processes launched by log_lib.run_with_log.
subprocess_utils.kill_children_processes()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, got it. Thanks! I was worried about that the process will not appear as a child process when the daemon is not set, but seems it is not a problem. Tested it with the following code and seems working well:

import multiprocessing
import time
import os
import subprocess

from sky.utils import subprocess_utils

def worker():
    subprocess.run('sleep 1000', shell=True)
    time.sleep(10000)

if __name__ == '__main__':
    multiprocessing.set_start_method('spawn', force=True)
    p = multiprocessing.Process(target=worker)
    p.start()
    print(os.getpid())
    x = input('input to continue\n')
    print('kill')
    p.kill()
    subprocess_utils.kill_children_processes()
    time.sleep(10)
    p.join()

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me! Left some comments about the sleep time.

tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
@suquark suquark force-pushed the fix_multiprocessing branch from 14aa49e to 02b7690 Compare March 16, 2023 06:12
@suquark suquark force-pushed the fix_multiprocessing branch from 02b7690 to 39d3dee Compare March 16, 2023 08:44
@suquark suquark changed the title [spot] Fix multiprocessing signal handling [spot] Fix multiprocessing signal handling in spot controller Mar 16, 2023
@suquark
Copy link
Collaborator Author

suquark commented Mar 16, 2023

all related tests on both aws + gcp passed

@suquark suquark merged commit 464b5db into master Mar 16, 2023
@suquark suquark deleted the fix_multiprocessing branch March 16, 2023 08:59
@concretevitamin
Copy link
Member

Awesome find @suquark! This may be unrelated to this PR, posting to check with you - a UX thing:

Repro

  • sky spot launch sleep 600 --cloud gcp
  • Wait until the spinner the job actually gets started, then sky spot cancel -ay in another window

On 322ffad

I 03-16 08:10:29 cloud_vm_ray_backend.py:2597] Job submitted with Job ID: 18
I 03-16 15:12:53 log_lib.py:395] Start streaming logs for spot job 18.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.182.0.47']
Shared connection to 34.125.208.137 closed.
I 03-16 15:13:46 spot_utils.py:341] Logs finished for job 18 (status: CANCELLED).

On 464b5db (this PR)

I 03-16 07:59:08 cloud_vm_ray_backend.py:2597] Job submitted with Job ID: 16
I 03-16 15:01:56 log_lib.py:395] Start streaming logs for spot job 16.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.182.0.43']
Shared connection to 34.125.215.201 closed.
I 03-16 15:02:07 spot_utils.py:314] The job is preempted.
I 03-16 15:03:07 spot_utils.py:341] Logs finished for job 16 (status: CANCELLED).

There's an extra The job is preempted. prompt.

@Michaelvll
Copy link
Collaborator

Yes, this is caused by we setting the CANCELLED state after the cluster is cleaned up now, making this spot log streaming misclassified the job as preempted. It will be fixed by #1785

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants