Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded Job Queue #1636

Merged
merged 48 commits into from
Jun 6, 2023
Merged

Expanded Job Queue #1636

merged 48 commits into from
Jun 6, 2023

Conversation

mraheja
Copy link
Collaborator

@mraheja mraheja commented Jan 26, 2023

Implements the expanded job queue as described here:
https://drive.google.com/file/d/1jBjaStavimgh6YWJpuKEgGyL2st65G0A/view?usp=sharing

Tested (run the relevant ones):

  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@mraheja mraheja mentioned this pull request Jan 26, 2023
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the job queue PR again @mraheja! I haven't finished reading the PR yet, but I have several questions about the design below. : )

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fixes @mraheja! The code looks very clean. Left several comments.

tests/test_smoke.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
self._code += [
textwrap.dedent(f"""\
job_lib.set_job_started({self.job_id!r})
job_lib.scheduler.schedule_step()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line be moved to after L316, as we may want to schedule the next job as soon as the current job is fulfilled?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump @mraheja, I am not sure why this is marked as resolved, as we may want to start scheduling the next job as soon as the placement group is fulfilled for the job, instead of after the setup is done?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we schedule the next job immediately after the pg is fulfilled, this line should be able to be removed to reduce the overhead?

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Show resolved Hide resolved
sky/skylet/job_lib.py Show resolved Hide resolved
sky/skylet/job_lib.py Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @mraheja! The PR looks quite good to me now. I will test it after the comments are fixed.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @mraheja! Let's try to get this PR in soon. : )
Left several comments. Will test it as well.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
Comment on lines 110 to 116
# Restart skylet when the version does not match to keep the skylet up-to-date.
_MAYBE_SKYLET_RESTART_CMD = (
f'[[ $(cat {constants.SKYLET_VERSION_FILE}) = "{constants.SKYLET_VERSION}"'
' ]] || (pkill -f "python3 -m sky.skylet.skylet";'
f' echo {constants.SKYLET_VERSION} > {constants.SKYLET_VERSION_FILE};'
'nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &);')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we test this manually, just to make sure it does not update the skylet every time?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments @mraheja! The current one looks quite good to me. I am running the pytest tests/test_smoke.py but got a bunch of failures due to the timeout. Could we try to run those tests and see figure the problem out?

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/skylet/attempt_skylet.py Outdated Show resolved Hide resolved
sky/skylet/attempt_skylet.py Outdated Show resolved Hide resolved
sky/skylet/attempt_skylet.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

Michaelvll commented Jun 6, 2023

Tested:

  • pytest tests/test_smoke.py
  • Submit 300 jobs to a CPU node. (Original master branch fails after 70 jobs due to OOM issue)
    sky launch -c test-job-cancellation -y --cpus 4 --cloud gcp
    for i in `seq 1 1000`; do sky exec test-job-cancellation -d "echo hi; sleep 1000000000000" done
    sky cancel test-job-cancellation -y 94 95 96 979 8 99 100 125 126 127 # Cancelling PENDING job
    sky cancel test-job-cancellation -y 2 5 7 # Cancelling running job and testing FIFO
    
  • Do not schedule a pending job after manually reboot the cluster on cloud console.
  • tests/backward_compatibility_test.sh 1 (launch is required after updating to this PR)

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With several additional fixes above, the PR looks quite good to me now! Thanks for the great effort @mraheja! This should help a lot with the usability of the job queue.

@Michaelvll Michaelvll force-pushed the expanded-job-queue branch from 375bb20 to 505f9e3 Compare June 6, 2023 08:33
Comment on lines +2759 to +2766
if 'has no attribute' in stdout:
# Happens when someone calls `sky exec` but remote is outdated
# necessicating calling `sky launch`
with ux_utils.print_exception_no_traceback():
raise RuntimeError(
f'{colorama.Fore.RED}SkyPilot runtime is stale on the '
'remote cluster. To update, run: sky launch -c '
f'{handle.cluster_name}{colorama.Style.RESET_ALL}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into this when testing #1890 smoke tests (commit b020963) using a smoke VM launched by sky. I added code to print out stdout:

RuntimeError: SkyPilot runtime is stale on the remote cluster. To update, run: sky launch -c t-env-check-3917-b8
output:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'sky.skylet.job_lib' has no attribute 'scheduler'

This test cluster was freshly launched from the smoke VM, so not sure why the initial launch followed by an exec would trigger this staleness error?

cc @Michaelvll

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, is this check necessary for the PR to function? The 'has no attribute' in stdout can be brittle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Job submission feature requests
3 participants