[spot] Fix multiprocessing signal handling in spot controller #1745

suquark · 2023-03-05T06:15:24Z

Previously, we send an interruption signal to the controller process and the controller process handles cleanup. However, we figure out the behavior differs from cloud to cloud (e.g., GCP ignore 'SIGINT'). A possible reason is https://unix.stackexchange.com/questions/356408/strange-problem-with-trap-and-sigint. But anyway, a clean solution is killing the controller process directly, and then cleanup the cluster state.

Tested (run the relevant ones):

Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py --managed-spot (both AWS and GCP as spot controllers)
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Michaelvll · 2023-03-05T08:56:13Z

This is a fascinating finding @suquark! Does that means if the teardown of the existing cluster takes longer than 10s, the spot cluster will be leaked as the controller process will be crashed? Can we try if adding time.sleep before the cluster termination will cause the leakage in the previous implementation? I will submit the review soon.

Michaelvll

Thanks for the excellent finding @suquark! I am testing it with pytest tests/test_smoke.py --managed-spot as well. Left two comments.

sky/spot/controller.py

suquark · 2023-03-05T22:52:49Z

This is a fascinating finding @suquark! Does that means if the teardown of the existing cluster takes longer than 10s, the spot cluster will be leaked as the controller process will be crashed? Can we try if adding time.sleep before the cluster termination will cause the leakage in the previous implementation? I will submit the review soon.

Yes, it leaks the spot instance in the new provisioner implementation. I can reproduce it (in the new provisioner) constantly by adding time.sleep(60) (actually less than 60s usually works, sometimes you just need to sleep 10s).

Michaelvll · 2023-03-06T00:12:56Z

It seems the current PR does not work by my test:

sky spot launch --cloud gcp --cpus 2 "echo hi; sleep 10000"
sky spot cancel -a
watch -n 10 "sky spot queue"

Seems the spot job was not canceled after more than an hour.

By checking the sky spot logs --controller, it seems the SIGINT was sent, but did not trigger the KeyboardInterrupt exception:

(sky-2a29-gcpuser, pid=16315) I 03-05 21:37:49 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:09 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:11 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:11 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:31 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:33 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:33 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:53 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:55 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:38:55 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:15 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:16 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:16 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:37 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:38 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:38 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:43 controller.py:310] Cancelling spot job 360...
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:43 controller.py:312] Sending SIGINT to controller process 16400
(sky-2a29-gcpuser, pid=16315) I 03-05 21:39:58 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:00 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:00 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:20 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:22 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:22 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:42 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:44 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:40:44 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:04 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:06 spot_utils.py:76] Job status: JobStatus.RUNNING
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:06 spot_utils.py:79] ==================================
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:26 spot_utils.py:70] === Checking the job status... ===
(sky-2a29-gcpuser, pid=16315) I 03-05 21:41:28 spot_utils.py:76] Job status: JobStatus.RUNNING

suquark · 2023-03-06T00:16:20Z

interesting, it indeed works on my laptop with AWS. let me test GCP manually. something may be broken deeply, and we need to understand why it does/does not work

It seems the current PR does not work by my test:

sky spot launch --cloud gcp --cpus 2 "echo hi; sleep 10000"

sky spot cancel -a

watch -n 10 "sky spot queue"

Seems the spot job was not cancelled after more than an hour.

suquark · 2023-03-06T08:26:25Z

Context: I find the previous commit works when the spot controller is running in AWS, but it stops to work in GCP. It is totally mystery to me. The only difference should be the operation system (GCP uses a older Debian distribution).

It seems I have fixed this issue. The key idea is terminating the cluster in the controller daemon instead of the controller, since signal handling gets messy in this case.

Michaelvll

Left some early questions : )

sky/spot/controller.py

suquark · 2023-03-12T01:58:23Z

@Michaelvll This PR is ready for review again. I have tested it against both AWS & GCP. Thank you!

Michaelvll

Thanks for catching this important issue @suquark! The PR looks mostly good to me. Left several final questions.

sky/spot/controller.py

Michaelvll · 2023-03-14T17:48:17Z

sky/spot/controller.py

+        #  So we can only enable daemon after we no longer need to
+        #  start daemon processes like Ray.


Where do we use ray? I thought most of places we use ray are through subprocess.run

sky launch uses Ray as a subprocess

Oh, do you mean that the subprocess.Popen in run_with_log will make it fail?

skypilot/sky/backends/cloud_vm_ray_backend.py

Lines 1443 to 1464 in 77e6f31

returncode, stdout, stderr = log_lib.run_with_log(

[sys.executable, script_path],

log_abs_path,

stream_logs=False,

start_streaming_at='Shared connection to',

line_processor=log_utils.RayUpLineProcessor(),

# Reduce BOTO_MAX_RETRIES from 12 to 5 to avoid long hanging

# time during 'ray up' if insufficient capacity occurs.

env=dict(

os.environ,

BOTO_MAX_RETRIES='5',

# Use environment variables to disable the ray usage collection

# (to avoid overheads and potential issues with the usage)

# as sdk does not take the argument for disabling the usage

# collection.

RAY_USAGE_STATS_ENABLED='0'),

require_outputs=True,

# Disable stdin to avoid ray outputs mess up the terminal with

# misaligned output when multithreading/multiprocessing are used

# Refer to: https://github.com/ray-project/ray/blob/d462172be7c5779abf37609aed08af112a533e1e/python/ray/autoscaler/_private/subprocess_output_util.py#L264 # pylint: disable=line-too-long

stdin=subprocess.DEVNULL)

return returncode, stdout, stderr

yes, daemon=True + subprocess.Popen => process leakage

That worth a comment. : )

sky/spot/controller.py

Michaelvll · 2023-03-14T18:01:58Z

sky/spot/controller.py

+            # NOTE: it is ok to kill or join a killed process.
+            controller_process.kill()
+            # Kill the children processes launched by log_lib.run_with_log.
+            subprocess_utils.kill_children_processes()


Question: Will this work when the daemon is not enabled for the controller_process?

it will work. it was never daemon process in the original PR, so it does not make sense that it will not work here

Well, I think previously, the children processes are the spawn with subprocess.Popen directly from the current process, instead of having an additional multiprocessing.Process in between, i.e. previously the kill_children_processes is directly called within the controller_processs instead of the outer caller. I am not sure if that will make any difference.

From my observation it works, also as the note suggests it supports general *nix process:

"""Return the children of this process as a list of Process instances, pre-emptively checking whether PID has been reused. If *recursive* is True return all the parent descendants. Example (A == this process): A ─┐ │ ├─ B (child) ─┐ │ └─ X (grandchild) ─┐ │ └─ Y (great grandchild) ├─ C (child) └─ D (child) >>> import psutil >>> p = psutil.Process() >>> p.children() B, C, D >>> p.children(recursive=True) B, X, Y, C, D Note that in the example above if process X disappears process Y won't be listed as the reference to process A is lost. """

Ahh, got it. Thanks! I was worried about that the process will not appear as a child process when the daemon is not set, but seems it is not a problem. Tested it with the following code and seems working well:

import multiprocessing import time import os import subprocess from sky.utils import subprocess_utils def worker(): subprocess.run('sleep 1000', shell=True) time.sleep(10000) if __name__ == '__main__': multiprocessing.set_start_method('spawn', force=True) p = multiprocessing.Process(target=worker) p.start() print(os.getpid()) x = input('input to continue\n') print('kill') p.kill() subprocess_utils.kill_children_processes() time.sleep(10) p.join()

Michaelvll

Thanks for catching the issue @suquark! I just tested out manually and it works well for me. I am now testing it with the smoke test.

pytest tests/test_smoke.py --managed-spot

Michaelvll · 2023-03-15T05:34:33Z

sky/spot/controller.py

+            # NOTE: it is ok to kill or join a killed process.
+            controller_process.kill()
+            # Kill the children processes launched by log_lib.run_with_log.
+            subprocess_utils.kill_children_processes()


Ahh, got it. Thanks! I was worried about that the process will not appear as a child process when the daemon is not set, but seems it is not a problem. Tested it with the following code and seems working well:

import multiprocessing import time import os import subprocess from sky.utils import subprocess_utils def worker(): subprocess.run('sleep 1000', shell=True) time.sleep(10000) if __name__ == '__main__': multiprocessing.set_start_method('spawn', force=True) p = multiprocessing.Process(target=worker) p.start() print(os.getpid()) x = input('input to continue\n') print('kill') p.kill() subprocess_utils.kill_children_processes() time.sleep(10) p.join()

sky/spot/controller.py

Michaelvll

The changes look good to me! Left some comments about the sleep time.

tests/test_smoke.py

sky/spot/controller.py

suquark · 2023-03-16T08:46:13Z

all related tests on both aws + gcp passed

concretevitamin · 2023-03-16T15:15:11Z

Awesome find @suquark! This may be unrelated to this PR, posting to check with you - a UX thing:

Repro

sky spot launch sleep 600 --cloud gcp
Wait until the spinner the job actually gets started, then sky spot cancel -ay in another window

On 322ffad

I 03-16 08:10:29 cloud_vm_ray_backend.py:2597] Job submitted with Job ID: 18
I 03-16 15:12:53 log_lib.py:395] Start streaming logs for spot job 18.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.182.0.47']
Shared connection to 34.125.208.137 closed.
I 03-16 15:13:46 spot_utils.py:341] Logs finished for job 18 (status: CANCELLED).

On 464b5db (this PR)

I 03-16 07:59:08 cloud_vm_ray_backend.py:2597] Job submitted with Job ID: 16
I 03-16 15:01:56 log_lib.py:395] Start streaming logs for spot job 16.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.182.0.43']
Shared connection to 34.125.215.201 closed.
I 03-16 15:02:07 spot_utils.py:314] The job is preempted.
I 03-16 15:03:07 spot_utils.py:341] Logs finished for job 16 (status: CANCELLED).

There's an extra The job is preempted. prompt.

Michaelvll · 2023-03-16T16:38:38Z

Yes, this is caused by we setting the CANCELLED state after the cluster is cleaned up now, making this spot log streaming misclassified the job as preempted. It will be fixed by #1785

suquark requested a review from Michaelvll March 5, 2023 06:22

Michaelvll added the P0 label Mar 5, 2023

Michaelvll reviewed Mar 5, 2023

View reviewed changes

sky/spot/controller.py Outdated Show resolved Hide resolved

sky/spot/controller.py Outdated Show resolved Hide resolved

suquark requested a review from Michaelvll March 6, 2023 21:01

Michaelvll reviewed Mar 7, 2023

View reviewed changes

sky/spot/controller.py Outdated Show resolved Hide resolved

sky/spot/controller.py Outdated Show resolved Hide resolved

suquark requested a review from Michaelvll March 12, 2023 01:57

Michaelvll reviewed Mar 14, 2023

View reviewed changes

suquark requested a review from Michaelvll March 14, 2023 23:22

Michaelvll approved these changes Mar 15, 2023

View reviewed changes

Michaelvll reviewed Mar 15, 2023

View reviewed changes

sky/spot/controller.py Outdated Show resolved Hide resolved

Michaelvll reviewed Mar 15, 2023

View reviewed changes

Michaelvll reviewed Mar 16, 2023

View reviewed changes

sky/spot/controller.py Outdated Show resolved Hide resolved

suquark force-pushed the fix_multiprocessing branch from 14aa49e to 02b7690 Compare March 16, 2023 06:12

suquark added 3 commits March 16, 2023 01:42

fix controller cancellation

4bcc236

update

ef010b1

update

39d3dee

suquark force-pushed the fix_multiprocessing branch from 02b7690 to 39d3dee Compare March 16, 2023 08:44

suquark changed the title ~~[spot] Fix multiprocessing signal handling~~ [spot] Fix multiprocessing signal handling in spot controller Mar 16, 2023

suquark merged commit 464b5db into master Mar 16, 2023

suquark deleted the fix_multiprocessing branch March 16, 2023 08:59

Michaelvll mentioned this pull request Mar 16, 2023

[Spot] Add cancelling state for the spot job #1785

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spot] Fix multiprocessing signal handling in spot controller #1745

[spot] Fix multiprocessing signal handling in spot controller #1745

suquark commented Mar 5, 2023 •

edited

Loading

Michaelvll commented Mar 5, 2023 •

edited

Loading

Michaelvll left a comment

suquark commented Mar 5, 2023

Michaelvll commented Mar 6, 2023 •

edited

Loading

suquark commented Mar 6, 2023 •

edited

Loading

suquark commented Mar 6, 2023 •

edited

Loading

Michaelvll left a comment

suquark commented Mar 12, 2023 •

edited

Loading

Michaelvll left a comment

Michaelvll Mar 14, 2023

suquark Mar 14, 2023

Michaelvll Mar 14, 2023

suquark Mar 14, 2023

Michaelvll Mar 15, 2023

Michaelvll Mar 14, 2023

suquark Mar 14, 2023

Michaelvll Mar 14, 2023

suquark Mar 14, 2023

Michaelvll Mar 15, 2023

Michaelvll left a comment •

edited

Loading

Michaelvll Mar 15, 2023

Michaelvll left a comment

suquark commented Mar 16, 2023

concretevitamin commented Mar 16, 2023

Michaelvll commented Mar 16, 2023

		# So we can only enable daemon after we no longer need to
		# start daemon processes like Ray.

	returncode, stdout, stderr = log_lib.run_with_log(
	[sys.executable, script_path],
	log_abs_path,
	stream_logs=False,
	start_streaming_at='Shared connection to',
	line_processor=log_utils.RayUpLineProcessor(),
	# Reduce BOTO_MAX_RETRIES from 12 to 5 to avoid long hanging
	# time during 'ray up' if insufficient capacity occurs.
	env=dict(
	os.environ,
	BOTO_MAX_RETRIES='5',
	# Use environment variables to disable the ray usage collection
	# (to avoid overheads and potential issues with the usage)
	# as sdk does not take the argument for disabling the usage
	# collection.
	RAY_USAGE_STATS_ENABLED='0'),
	require_outputs=True,
	# Disable stdin to avoid ray outputs mess up the terminal with
	# misaligned output when multithreading/multiprocessing are used
	# Refer to: https://github.com/ray-project/ray/blob/d462172be7c5779abf37609aed08af112a533e1e/python/ray/autoscaler/_private/subprocess_output_util.py#L264 # pylint: disable=line-too-long
	stdin=subprocess.DEVNULL)
	return returncode, stdout, stderr

[spot] Fix multiprocessing signal handling in spot controller #1745

[spot] Fix multiprocessing signal handling in spot controller #1745

Conversation

suquark commented Mar 5, 2023 • edited Loading

Michaelvll commented Mar 5, 2023 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

suquark commented Mar 5, 2023

Michaelvll commented Mar 6, 2023 • edited Loading

suquark commented Mar 6, 2023 • edited Loading

suquark commented Mar 6, 2023 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

suquark commented Mar 12, 2023 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

suquark commented Mar 16, 2023

concretevitamin commented Mar 16, 2023

Michaelvll commented Mar 16, 2023

suquark commented Mar 5, 2023 •

edited

Loading

Michaelvll commented Mar 5, 2023 •

edited

Loading

Michaelvll commented Mar 6, 2023 •

edited

Loading

suquark commented Mar 6, 2023 •

edited

Loading

suquark commented Mar 6, 2023 •

edited

Loading

suquark commented Mar 12, 2023 •

edited

Loading

Michaelvll left a comment •

edited

Loading