Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue_cc for multi-gpu setup #2

Open
kmyi opened this issue Dec 7, 2018 · 3 comments
Open

queue_cc for multi-gpu setup #2

kmyi opened this issue Dec 7, 2018 · 3 comments
Assignees

Comments

@kmyi
Copy link
Collaborator

kmyi commented Dec 7, 2018

Implement new version of queue_cc which runs multiple jobs in a node, dedicating each to a gpu. Requires on-the-fly creation of job scripts.

@kmyi kmyi assigned kmyi and wsunid Dec 7, 2018
@kmyi
Copy link
Collaborator Author

kmyi commented Dec 7, 2018

Hey @weiweisun2018, not urgent, but if you can do this it would be great. Basically, the system currently prefers jobs that can take full nodes. Since our jobs can be batched together, it'd be a good idea to

  1. grab N jobs (4 for cedar, 2 for graham)
  2. create a new bash file for the job that has
#!/bin/bash
CUDA_VISIBLE_DEVICES=XXX jobscript1.sh &
CUDA_VISIBLE_DEVICES=YYY jobscript2.sh &
...
join

which would be a meta job that consumes a full node.
3. queue these jobs.

@wsunid
Copy link
Collaborator

wsunid commented Dec 7, 2018

Sure, my pleasure to do it. But there is a couple of question:
1, do you mean to submit a job in the array style? Could you please give me the specific document about the new way of arranging jobs?
2, From my understanding:
Take cedar as an example:

def check_ready_for_next_batch_job_and_return_batch_job(job_id):
       if there_is_no_batch_job_runing:
              return netxt_batch_job

batchjobs=[]

def scheduler_batchjobs():
       while true:
              for job_id in job_ids:
                    batchjobs.append(check_ready_for_next_batch_job_and_return_batch_job(job_id))
                    if len(batchjobs)=4:
                           notify_grabber()

def grab4batchjobs():
       while true:
               wait_for_scheduler()
                4batchjobs = batchjobs.pop(:4)
                

def submit_full_node_job()
        thread_scheduler = threading.thread(target=schedule_batchjobs)
        for 4batchjobs in  grab4batchjobs():
                send4batchjobs_to_cedar(4batchjobs) # My question here about how to submit such a full_node_job: should I request 4 GPUs and more memory?

@kmyi
Copy link
Collaborator Author

kmyi commented Dec 8, 2018

So for example, it'll be a single job containing multiple job executions inside. We want to do it this way so that we assign each job a specific GPU. array submission probably can't support that.

In short, we submit a fake job that uses all four GPUs (e.g. cedar), which internally is just running four jobs in parallel, one for each GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants