configure some iterations to run in the same job? #32

tdhock · 2024-10-25T14:21:32Z

Hi @sebffischer
I was wondering if it is currently possible to run different benchmark iterations in the same cluster job?

In particular I would like to tell mlr3batchmark to create a new job for every data set and cross-validation fold, but have all the different algorithms run one after another in the same job. is that possible?

sebffischer · 2024-10-25T14:41:45Z

If I understand you correctly, this can be achieved by chunking jobs: https://mllg.github.io/batchtools/reference/submitJobs.html#chunking-of-jobs

tdhock · 2024-10-25T18:17:33Z

Hi Seb
Thanks for the quick response.
That man page does seem to respond to my question: "To chunk jobs together, job ids must be provided as data.frame with columns “job.id” and “chunk” (integer). All jobs with the same chunk number will be executed sequentially inside the same batch job. "

but I need to use array jobs at the same time, is that compatible?

Usually to create a single job array (with many tasks that do some similar calculations), I do something like

job.table <- batchtools::getJobTable(reg=reg)
chunks <- data.frame(job.table, chunk=1)
batchtools::submitJobs(chunks, resources=list(
  walltime = 24*60*60,#seconds
  memory = 2000,#megabytes per cpu
  ncpus=1,  #>1 for multicore/parallel jobs.
  ntasks=1, #>1 for MPI jobs.
  chunks.as.arrayjobs=TRUE), reg=reg)

so then chunks is a table with all the job IDs assigned to a single chunk=1.
So in this setup I use chunk=1 column to tell batchtools to create a single job with many tasks in its array.

so at least from the docs, sections "Chunking of Jobs" and "Array Jobs" it seems that there are two different uses for chunk column:

to define which tasks are part of which cluster jobs in Array Jobs usage (should this usage be called cluster_job instead of chunk ?)
to define which batchtools jobs are a part of which cluster jobs (each of which only have one task / can not be job arrays)

I would like to do both at the same time. So I wonder if we could specify cluster_job and chunk at the same time?

tdhock · 2024-10-25T18:21:53Z

probably best for me would be if I could specify this as an argument to batchmark

(bench.grid <- mlr3::benchmark_grid( 
  tasks=task.list,
  learners=learner.list,
  resamplings=train.test.cv))
mlr3batchmark::batchmark(
      bench.grid, store_models = TRUE, reg=reg,
      job.for.each=c("resampling","task")) #could also specify "learner" here

does that seem reasonable to you? or would you suggest another approach? If you don't have time to write this functionality, I could give it a try, but it would be useful to have some guidance about what you think the interface should look like /where would be the best place to edit the current code.

be-marc · 2024-10-28T09:53:16Z

Why do you want to run multiple resampling iterations in one job?

tdhock · 2024-10-28T15:32:17Z

some clusters have limits on the number of jobs/tasks that can be running/queued simultaneously.
For example most clusters in compute canada have a limit of 1000.
For example if I run featureless baseline learner, and one other algorithm, say that makes 1200 tasks (with all the data sets and resampling iterations).
If I combine the two learners into a single task so that the total number of tasks is 600, so less than the max of 1000, and then it could run on the cluster.
Does that make sense to you?

be-marc · 2024-10-29T11:34:45Z

Yes, I understand the problem. In this case, you would normally set max.concurrent.jobs = 1000 in batchtools.conf.R.
submitJobs() would then wait after submitting 1000 jobs to the queue. When 1 jobs is finished, it sends the job 1001 to the queue. If you can't stay active on the login node, I would write a small script that executes submitJobs() in a job.

be-marc · 2024-10-29T11:40:14Z

or another option: Why don't you merge more jobs into one array job? As far as I know, they are then counted as one job on a Slurm cluster. If you use chunks.as.arrayjobs=TRUE then you should have already aggregated the jobs in some form.

tdhock · 2024-10-29T13:05:11Z

I did not know about max.concurrent.jobs, that is helpful.
I'm not sure about staying active on login node, so that solution is less than ideal.

Why don't you merge more jobs into one array job? As far as I know, they are then counted as one job on a Slurm cluster.

If I understand your suggestion correctly, you think I could assign all 1200 batchtools jobs in a single slurm job with 1200 tasks? Actually on our cluster I think each task in the job array is counted toward the limit of 1000, so that solution would not work.

I believe that is implemented via https://slurm.schedmd.com/resource_limits.html#assoc_maxsubmitjobs but that page does not specify how job arrays are handled.

be-marc · 2024-10-29T13:16:25Z

Okay then use max.concurrent.jobs = 1000 and start submitJobs() in a long running job itself.

tdhock · 2024-10-29T13:19:53Z

thanks for the quick feedback! I guess that could be a work-around in the short term.
but since that solution is less convenient for my use case, I would still like to fork mlr3batchmark and try to hack my proposal. could you please share any guidance about how/where to implement my proposition?
or would adding that feature not be an acceptable addition to mlr3batchmark?

be-marc · 2024-10-29T13:27:59Z

I don't think this requires a change to mlr3batchtmark but to batchtools. @mllg (maintainer) Do you think this could be useful?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configure some iterations to run in the same job? #32

configure some iterations to run in the same job? #32

tdhock commented Oct 25, 2024

sebffischer commented Oct 25, 2024

tdhock commented Oct 25, 2024

tdhock commented Oct 25, 2024

be-marc commented Oct 28, 2024

tdhock commented Oct 28, 2024 •

edited

Loading

be-marc commented Oct 29, 2024 •

edited

Loading

be-marc commented Oct 29, 2024 •

edited

Loading

tdhock commented Oct 29, 2024

be-marc commented Oct 29, 2024

tdhock commented Oct 29, 2024

be-marc commented Oct 29, 2024

configure some iterations to run in the same job? #32

configure some iterations to run in the same job? #32

Comments

tdhock commented Oct 25, 2024

sebffischer commented Oct 25, 2024

tdhock commented Oct 25, 2024

tdhock commented Oct 25, 2024

be-marc commented Oct 28, 2024

tdhock commented Oct 28, 2024 • edited Loading

be-marc commented Oct 29, 2024 • edited Loading

be-marc commented Oct 29, 2024 • edited Loading

tdhock commented Oct 29, 2024

be-marc commented Oct 29, 2024

tdhock commented Oct 29, 2024

be-marc commented Oct 29, 2024

tdhock commented Oct 28, 2024 •

edited

Loading

be-marc commented Oct 29, 2024 •

edited

Loading

be-marc commented Oct 29, 2024 •

edited

Loading