-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanded Job Queue #1636
Expanded Job Queue #1636
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting the job queue PR again @mraheja! I haven't finished reading the PR yet, but I have several questions about the design below. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fixes @mraheja! The code looks very clean. Left several comments.
sky/backends/cloud_vm_ray_backend.py
Outdated
self._code += [ | ||
textwrap.dedent(f"""\ | ||
job_lib.set_job_started({self.job_id!r}) | ||
job_lib.scheduler.schedule_step() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this line be moved to after L316, as we may want to schedule the next job as soon as the current job is fulfilled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump @mraheja, I am not sure why this is marked as resolved, as we may want to start scheduling the next job as soon as the placement group is fulfilled for the job, instead of after the setup is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we schedule the next job immediately after the pg is fulfilled, this line should be able to be removed to reduce the overhead?
…o expanded-job-queue
…o expanded-job-queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @mraheja! The PR looks quite good to me now. I will test it after the comments are fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @mraheja! Let's try to get this PR in soon. : )
Left several comments. Will test it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick comments.
sky/backends/cloud_vm_ray_backend.py
Outdated
# Restart skylet when the version does not match to keep the skylet up-to-date. | ||
_MAYBE_SKYLET_RESTART_CMD = ( | ||
f'[[ $(cat {constants.SKYLET_VERSION_FILE}) = "{constants.SKYLET_VERSION}"' | ||
' ]] || (pkill -f "python3 -m sky.skylet.skylet";' | ||
f' echo {constants.SKYLET_VERSION} > {constants.SKYLET_VERSION_FILE};' | ||
'nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &);') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we test this manually, just to make sure it does not update the skylet every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments @mraheja! The current one looks quite good to me. I am running the pytest tests/test_smoke.py
but got a bunch of failures due to the timeout. Could we try to run those tests and see figure the problem out?
Co-authored-by: Zhanghao Wu <[email protected]>
…o expanded-job-queue
…skypilot into expanded-job-queue
…o expanded-job-queue
Tested:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With several additional fixes above, the PR looks quite good to me now! Thanks for the great effort @mraheja! This should help a lot with the usability of the job queue.
375bb20
to
505f9e3
Compare
if 'has no attribute' in stdout: | ||
# Happens when someone calls `sky exec` but remote is outdated | ||
# necessicating calling `sky launch` | ||
with ux_utils.print_exception_no_traceback(): | ||
raise RuntimeError( | ||
f'{colorama.Fore.RED}SkyPilot runtime is stale on the ' | ||
'remote cluster. To update, run: sky launch -c ' | ||
f'{handle.cluster_name}{colorama.Style.RESET_ALL}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran into this when testing #1890 smoke tests (commit b020963) using a smoke VM launched by sky
. I added code to print out stdout:
RuntimeError: SkyPilot runtime is stale on the remote cluster. To update, run: sky launch -c t-env-check-3917-b8
output:
Traceback (most recent call last):
File "<string>", line 1, in <module>
AttributeError: module 'sky.skylet.job_lib' has no attribute 'scheduler'
This test cluster was freshly launched from the smoke VM, so not sure why the initial launch
followed by an exec
would trigger this staleness error?
cc @Michaelvll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, is this check necessary for the PR to function? The 'has no attribute' in stdout
can be brittle.
Implements the expanded job queue as described here:
https://drive.google.com/file/d/1jBjaStavimgh6YWJpuKEgGyL2st65G0A/view?usp=sharing
Tested (run the relevant ones):
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh