You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm a big fan of snakemake, which allows for automatic resubmission of failed jobs with increased resources (eg., mem = lambda wildcards, threads, attempt: attempt * 8 # Gb of memory doubles per attempt). It would be really awesome to have that feature in clustermq. For example, one could provide a function instead of a value for the template:
As a simpler approach, clustermq could just keep a log of all jobs that have completed successfully (or just those that failed), and the user could then set a parameter in Q() to just run the previously failed jobs (eg., Q(just.failed=TRUE)). The user could then wrap Q() in a loop in which the resources in template are increased in each iteration of the loop.
It appears that clustermq cluster array jobs suffer from the issue that if one of the N parallel cluster jobs (eg., n_jobs=20) dies, then the other jobs continue, and clustermq doesn't keep the total number of jobs at 20 (unlike snakemake).
In my case, this means that I currently only have 4 of 20 (n_jobs=20) running, since 16 of the cluster jobs have died for one reason or another. Running 4 jobs isn't really efficient, or what I intended. These 4 remaining jobs have been running for >1 day, so I don't want to kill them and loose all of that computation.
What do others do in cases where some of their 100's or 1000's of jobs fail? Do they have to always figure out which failed and then re-run just those jobs? That's potentially a lot of extra code just to figure out failed jobs and re-run only those (or just re-run everything).
originally posted in #153 by @nick-youngblut
I'm a big fan of snakemake, which allows for automatic resubmission of failed jobs with increased resources (eg.,
mem = lambda wildcards, threads, attempt: attempt * 8 # Gb of memory doubles per attempt
). It would be really awesome to have that feature in clustermq. For example, one could provide a function instead of a value for the template:One would also need a
max_attempts
parameter forQ()
.The text was updated successfully, but these errors were encountered: