Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configure some iterations to run in the same job? #32

Open
tdhock opened this issue Oct 25, 2024 · 11 comments
Open

configure some iterations to run in the same job? #32

tdhock opened this issue Oct 25, 2024 · 11 comments

Comments

@tdhock
Copy link
Contributor

tdhock commented Oct 25, 2024

Hi @sebffischer
I was wondering if it is currently possible to run different benchmark iterations in the same cluster job?

In particular I would like to tell mlr3batchmark to create a new job for every data set and cross-validation fold, but have all the different algorithms run one after another in the same job. is that possible?

@sebffischer
Copy link
Member

If I understand you correctly, this can be achieved by chunking jobs: https://mllg.github.io/batchtools/reference/submitJobs.html#chunking-of-jobs

@tdhock
Copy link
Contributor Author

tdhock commented Oct 25, 2024

Hi Seb
Thanks for the quick response.
That man page does seem to respond to my question: "To chunk jobs together, job ids must be provided as data.frame with columns “job.id” and “chunk” (integer). All jobs with the same chunk number will be executed sequentially inside the same batch job. "

but I need to use array jobs at the same time, is that compatible?

Usually to create a single job array (with many tasks that do some similar calculations), I do something like

job.table <- batchtools::getJobTable(reg=reg)
chunks <- data.frame(job.table, chunk=1)
batchtools::submitJobs(chunks, resources=list(
  walltime = 24*60*60,#seconds
  memory = 2000,#megabytes per cpu
  ncpus=1,  #>1 for multicore/parallel jobs.
  ntasks=1, #>1 for MPI jobs.
  chunks.as.arrayjobs=TRUE), reg=reg)

so then chunks is a table with all the job IDs assigned to a single chunk=1.
So in this setup I use chunk=1 column to tell batchtools to create a single job with many tasks in its array.

so at least from the docs, sections "Chunking of Jobs" and "Array Jobs" it seems that there are two different uses for chunk column:

  • to define which tasks are part of which cluster jobs in Array Jobs usage (should this usage be called cluster_job instead of chunk ?)
  • to define which batchtools jobs are a part of which cluster jobs (each of which only have one task / can not be job arrays)

I would like to do both at the same time. So I wonder if we could specify cluster_job and chunk at the same time?

@tdhock
Copy link
Contributor Author

tdhock commented Oct 25, 2024

probably best for me would be if I could specify this as an argument to batchmark

(bench.grid <- mlr3::benchmark_grid( 
  tasks=task.list,
  learners=learner.list,
  resamplings=train.test.cv))
mlr3batchmark::batchmark(
      bench.grid, store_models = TRUE, reg=reg,
      job.for.each=c("resampling","task")) #could also specify "learner" here

does that seem reasonable to you? or would you suggest another approach? If you don't have time to write this functionality, I could give it a try, but it would be useful to have some guidance about what you think the interface should look like /where would be the best place to edit the current code.

@be-marc
Copy link
Member

be-marc commented Oct 28, 2024

Why do you want to run multiple resampling iterations in one job?

@tdhock
Copy link
Contributor Author

tdhock commented Oct 28, 2024

some clusters have limits on the number of jobs/tasks that can be running/queued simultaneously.
For example most clusters in compute canada have a limit of 1000.
For example if I run featureless baseline learner, and one other algorithm, say that makes 1200 tasks (with all the data sets and resampling iterations).
If I combine the two learners into a single task so that the total number of tasks is 600, so less than the max of 1000, and then it could run on the cluster.
Does that make sense to you?

@be-marc
Copy link
Member

be-marc commented Oct 29, 2024

Yes, I understand the problem. In this case, you would normally set max.concurrent.jobs = 1000 in batchtools.conf.R.
submitJobs() would then wait after submitting 1000 jobs to the queue. When 1 jobs is finished, it sends the job 1001 to the queue. If you can't stay active on the login node, I would write a small script that executes submitJobs() in a job.

@be-marc
Copy link
Member

be-marc commented Oct 29, 2024

or another option: Why don't you merge more jobs into one array job? As far as I know, they are then counted as one job on a Slurm cluster. If you use chunks.as.arrayjobs=TRUE then you should have already aggregated the jobs in some form.

@tdhock
Copy link
Contributor Author

tdhock commented Oct 29, 2024

I did not know about max.concurrent.jobs, that is helpful.
I'm not sure about staying active on login node, so that solution is less than ideal.

Why don't you merge more jobs into one array job? As far as I know, they are then counted as one job on a Slurm cluster.

If I understand your suggestion correctly, you think I could assign all 1200 batchtools jobs in a single slurm job with 1200 tasks? Actually on our cluster I think each task in the job array is counted toward the limit of 1000, so that solution would not work.

I believe that is implemented via https://slurm.schedmd.com/resource_limits.html#assoc_maxsubmitjobs but that page does not specify how job arrays are handled.

@be-marc
Copy link
Member

be-marc commented Oct 29, 2024

Okay then use max.concurrent.jobs = 1000 and start submitJobs() in a long running job itself.

@tdhock
Copy link
Contributor Author

tdhock commented Oct 29, 2024

thanks for the quick feedback! I guess that could be a work-around in the short term.
but since that solution is less convenient for my use case, I would still like to fork mlr3batchmark and try to hack my proposal. could you please share any guidance about how/where to implement my proposition?
or would adding that feature not be an acceptable addition to mlr3batchmark?

@be-marc
Copy link
Member

be-marc commented Oct 29, 2024

I don't think this requires a change to mlr3batchtmark but to batchtools. @mllg (maintainer) Do you think this could be useful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants