Expanded Job Queue #1218

mraheja · 2022-10-10T05:37:52Z

Job Queue Design

In the past, ray jobs have been dispatched everytime a job is submit, causing many resources to be used just to idle. In this job queue, we have a pending jobs table which dispatches jobs only when the resources for the previous job have been fulfilled. The way the pending jobs table does this is by running a scheduling step which
Checks if the job has failed or been cancelled then it’ll remove it from the table
Checks if the job has already been submit, in which case it will return and wait for another call of the scheduling step

However, with the detached setup, there is some extra complication to this.

Option 1 - Current Implementation

The current implementation will run the setup script only when the ray job for the full

Statuses in order:

INIT: Job has been submit
PENDING: Job is waiting for resources
SETTING_UP: Job is running setup command and previous job is completed
RUNNING: Job is running and current job’s resources are acquired
SUCCEEDED: Job is completed

Keep in mind that there needs to be an update function that can just look at the state and calculate where each job is in the process. Here is the current update system:

Check if job is pending table, if so then mark the status as pending
Check the status of the ray job:
If it’s running, mark it as max(current_status, SETTING_UP)
If it’s a terminal status (i.e SUCCEEDED/FAILED), mark it as that
If it doesn’t exist, mark it as failed

Option 2 - Alternative Implementation

A pitfall of this implementation is that is cannot run the setup job (which requires no resources) before the actual job’s resources are freed. An alternative implementation would be:

INIT: Job has been submit
SETTING_UP: Setup command running (without any sort of other condition on resources)
PENDING: Job is waiting for resources
RUNNING: Actual Job is running
SUCCEEDED: Job is completed

Implementing this requires submitting a separate job for setup. Updating however becomes a little more unclear because if a ray job doesn’t exist you don’t know whether setup is completed or the entire thing failed or setup failed etc.

This could also cause some problems (i.e setup before intended) but for the most part will be faster because setup is run beforehand.

Michaelvll

Thanks for adding this job queue @mraheja! Great work! The code can be a proof-of-concept for extracting out the job queue. We should consider whether to place that pending information in the original jobs table, to avoid issues caused by concurrent modification of the two tables.

Michaelvll · 2022-10-11T17:56:29Z

sky/backends/cloud_vm_ray_backend.py

+        code = job_lib.JobLibCodeGen.queue_job(job_id, job_submit_cmd)
+        mkdir_code = f'{cd} && mkdir -p {remote_log_dir} && echo START > {remote_log_path} 2>&1'


We can combine these two commands into one run_on_head to reduce the overhead for ssh.

Michaelvll · 2022-10-11T18:01:21Z

sky/skylet/job_lib.py

+_DB_PATH = os.path.expanduser('~/.sky/jobs.db')
+os.makedirs(pathlib.Path(_DB_PATH).parents[0], exist_ok=True)
+CONN = sqlite3.connect(_DB_PATH)
+CURSOR = CONN.cursor()
+# CREATE TABLE
+try:
+    CURSOR.execute('SELECT * FROM pending_jobs LIMIT 0')
+except sqlite3.OperationalError:
+    # Tables do not exist, create them.
+    CURSOR.execute(''' CREATE TABLE pending_jobs(
+        job_name TEXT,
+        run_string TEXT,
+        submit INTEGER
+    )''')
+CONN.commit()


Can we place these lines in the create_table in L108, so we will get rid of some issues with the multi-threading + sqlite?

Michaelvll · 2022-10-11T18:36:01Z

sky/skylet/job_lib.py

+    # Go through queue and delete anything which is marked as anything but pending
+    # If previous job has a pg that has already been acquired, then run the command for the top job
+    # Give it one second then try the next one -> if not assume that resources aren't free and wait for a job to finish which will trigger this again
+    query = CURSOR.execute('SELECT * FROM pending_jobs ORDER BY job_name')


Use the _CURSOR and _CONN created in L126 and L127

Michaelvll · 2022-10-11T18:40:26Z

sky/skylet/job_lib.py

+        status = list(
+            CURSOR.execute(f'SELECT status FROM jobs WHERE job_id={name}'))
+        if len(status) == 0:
+            remove_job(name)
+            continue
+        status = status[0][0]


nit: follow the style for retrieving values from the SQLite returns in other parts of our code.

_CURSOR.execute(f'SELECT status FROM jobs WHERE job_id={name}') status = _CURSOR.fetchone() if status is None: remove_job(name) continue

Michaelvll · 2022-10-11T18:41:50Z

sky/skylet/job_lib.py

+
+
+def run_top_if_possible() -> None:
+    # Go through queue and delete anything which is marked as anything but pending


Should we update the comments? pending -> init or pending

Michaelvll · 2022-10-11T18:58:10Z

sky/skylet/job_lib.py

+            f'UPDATE pending_jobs SET submit=1 WHERE job_name={name!r}')
+        CONN.commit()
+        os.system(run_cmd)
+        time.sleep(1)


Why do we continue to the next job after this one is submitted? I thought the logic here is to submit the first job in the queue and break.

Also, this makes me wonder why don't we have this function call in the sky/skylet/events.py, so that we will call it every several seconds, submitting the job whenever there is no submitted one in the queue.

Michaelvll · 2022-10-11T19:03:56Z

sky/skylet/job_lib.py

+except sqlite3.OperationalError:
+    # Tables do not exist, create them.
+    CURSOR.execute(''' CREATE TABLE pending_jobs(
+        job_name TEXT,


Why do we use job_name instead of job_id as in L111? The job_name can be the same for different jobs.

Michaelvll · 2022-10-11T19:04:05Z

sky/skylet/job_lib.py

+    # Tables do not exist, create them.
+    CURSOR.execute(''' CREATE TABLE pending_jobs(
+        job_name TEXT,
+        run_string TEXT,


nit: run_cmd?

Michaelvll · 2022-10-11T19:14:20Z

sky/skylet/job_lib.py

+    CURSOR.execute('SELECT * FROM pending_jobs LIMIT 0')
+except sqlite3.OperationalError:
+    # Tables do not exist, create them.
+    CURSOR.execute(''' CREATE TABLE pending_jobs(


Should we create another pending_jobs table? I think we can add the run_string and submit columns to our original jobs table, so that we can get the job status and submit status together with

_CURSOR.execute('SELECT submit FROM jobs WHERE status='PENDING' ORDER_BY job_id LIMIT 1') if not submit: # submit the job else: return

Michaelvll · 2022-10-11T19:25:06Z

sky/skylet/job_lib.py

+CONN.commit()
+
+
+def run_top_if_possible() -> None:


We can refactor it a bit to support different strategy, but this might be fine for now.

Michaelvll · 2022-10-27T20:35:26Z

A correctness note: we set the set_pending for the spot tasks in the generated ray program.

skypilot/sky/backends/cloud_vm_ray_backend.py

Lines 200 to 207 in 6e332b5

    
           if spot_task is not None: 
        
               # Add the spot job to spot queue table. 
        
               resources_str = backend_utils.get_task_resources_str(spot_task) 
        
               self._code += [ 
        
                   'from sky.spot import spot_state', 
        
                   f'spot_state.set_pending(' 
        
                   f'{job_id}, {spot_task.name!r}, {resources_str!r})', 
        
               ]

Since we don't immediately run the generated ray program, when the job is submitted. The pending spot jobs will not appear in the spot queue. We need to move this check and set_pending call before the following line.

skypilot/sky/backends/cloud_vm_ray_backend.py

Line 2977 in 6e332b5

codegen.add_prologue(job_id,

Michaelvll

Thank you for the fix @mraheja! We may need to be careful about the detached setup, which adds a SETTING_UP status for the job #1379
For that, we may want to run the setup before the job goes to PENDING.

Michaelvll · 2022-11-18T03:48:43Z

examples/job_queue/cluster.yaml

@@ -8,4 +8,3 @@

 resources:
  cloud: aws
-  accelerators: K80


Michaelvll · 2022-11-18T03:49:53Z

examples/job_queue/cluster_multinode.yaml

@@ -9,6 +9,5 @@

 resources:
  cloud: aws
-  accelerators: K80


We need the accelerators to check the jobs pending works correctly. Otherwise, all the jobs in the test will be submitted without staying in the pending queue.

Michaelvll · 2022-11-18T03:50:49Z

examples/job_queue/job.yaml

-resources:
-  accelerators: K80:0.5
-


Michaelvll · 2022-11-18T03:54:23Z

sky/backends/cloud_vm_ray_backend.py

@@ -262,6 +262,8 @@ def add_gang_scheduling_placement_group(
                # it is waiting for other task to finish. We should hide the
                # error message.
                ray.get(pg.ready())
+                job_lib.scheduler.remove_job({self.job_id!r})


rename it? For example, how about job_lib.scheduler.set_scheduled({self.job_id!r})?

Michaelvll · 2022-11-18T03:55:00Z

sky/backends/cloud_vm_ray_backend.py

                f'--address=http://127.0.0.1:8265 --submission-id {ray_job_id} '
-                '--no-wait '
+                '--no-wait -- '


Any reason we need this --?

Michaelvll · 2022-11-18T17:34:38Z

sky/skylet/job_lib.py

+    def _get_jobs(self) -> Tuple:
+        return _CURSOR.execute('SELECT * FROM pending_jobs ORDER BY job_id')
+
+    def run_next_if_possible(self) -> None:


How about we rename it to schedule_step(self)?

Michaelvll · 2022-11-18T17:41:13Z

sky/skylet/job_lib.py

+def _get_pending_jobs():
+    rows = _CURSOR.execute('SELECT job_id, created_time FROM pending_jobs')
+    rows = list(rows)
+    return [int(row[0]) for row in rows], [int(row[1]) for row in rows]


How about we return a dict instead?

{ job_id: created_time for job_id, created_time in rows }

Michaelvll · 2022-11-18T17:42:05Z

sky/skylet/job_lib.py

+            idx = pending_jobs.index(job_id)
+            if start_times[idx] < psutil.boot_time():


Using the dict to get the created_time as mentioned above?

Michaelvll · 2022-11-18T17:42:45Z

sky/skylet/job_lib.py

-            # job server fails.
-            logger.warning(str(e))
-            continue
+        with filelock.FileLock(_get_lock_path(job_id)):


Good catch!

Michaelvll · 2022-11-18T17:50:25Z

tests/test_smoke.py

+            f'sky queue {name} | grep {name}-15 | grep RUNNING',
+            f'sky queue {name} | grep {name}-32 | grep RUNNING',
+            f'sky queue {name} | grep {name}-33 | grep PENDING',
+            f'sky queue {name} | grep {name}-50 | grep PENDING',


Nice! It is great to see the FIFO order works.

…ypilot into codegen-FIFO-pgs

…o codegen-FIFO-pgs

mraheja · 2023-01-26T18:12:23Z

moved to #1636

Michaelvll reviewed Oct 11, 2022

View reviewed changes

mraheja added 2 commits November 2, 2022 10:12

FIFO job scheduler

1468060

correctness changes

0ea817c

mraheja force-pushed the codegen-FIFO-pgs branch from 5b4c092 to 0ea817c Compare November 3, 2022 20:31

all updates

1b72d5a

Michaelvll self-requested a review November 10, 2022 00:58

Michaelvll reviewed Nov 18, 2022

View reviewed changes

mraheja and others added 7 commits December 1, 2022 21:05

Merge branch 'codegen-FIFO-pgs' of https://github.com/skypilot-org/sk…

6cff16e

…ypilot into codegen-FIFO-pgs

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

591b6bf

…o codegen-FIFO-pgs

updated job queue without seperate setup job

1c298cb

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

4964e6a

…o codegen-FIFO-pgs

minor concurrency fix

e3ba71a

testing

d9bff88

forgot to push change to test_large

69027f3

Michaelvll self-requested a review January 16, 2023 03:06

mraheja closed this Jan 26, 2023

Michaelvll mentioned this pull request May 13, 2023

OOM will cause the ray job fail #622

Closed

This was referenced May 31, 2023

[Job] sky cancel for jobs waiting in placement group #1357

Closed

[100 jobs] Abnormal failover leading to duplicate instances #656

Closed

Michaelvll deleted the codegen-FIFO-pgs branch December 18, 2024 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanded Job Queue #1218

Expanded Job Queue #1218

mraheja commented Oct 10, 2022 •

edited

Loading

Michaelvll left a comment

Michaelvll Oct 11, 2022

Michaelvll Oct 11, 2022

Michaelvll Oct 11, 2022

Michaelvll Oct 11, 2022

Michaelvll Oct 11, 2022

Michaelvll Oct 11, 2022

Michaelvll Oct 11, 2022

Michaelvll Oct 11, 2022

Michaelvll Oct 11, 2022

Michaelvll Oct 11, 2022

Michaelvll commented Oct 27, 2022

Michaelvll left a comment

Michaelvll Nov 18, 2022

Michaelvll Nov 18, 2022

Michaelvll Nov 18, 2022

Michaelvll Nov 18, 2022

Michaelvll Nov 18, 2022

Michaelvll Nov 18, 2022

Michaelvll Nov 18, 2022

Michaelvll Nov 18, 2022

Michaelvll Nov 18, 2022

Michaelvll Nov 18, 2022

mraheja commented Jan 26, 2023

		code = job_lib.JobLibCodeGen.queue_job(job_id, job_submit_cmd)
		mkdir_code = f'{cd} && mkdir -p {remote_log_dir} && echo START > {remote_log_path} 2>&1'



		def run_top_if_possible() -> None:
		# Go through queue and delete anything which is marked as anything but pending

		idx = pending_jobs.index(job_id)
		if start_times[idx] < psutil.boot_time():

Expanded Job Queue #1218

Expanded Job Queue #1218

Conversation

mraheja commented Oct 10, 2022 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Oct 27, 2022

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mraheja commented Jan 26, 2023

mraheja commented Oct 10, 2022 •

edited

Loading