-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustermq timeout at high number of targets workers #229
Comments
oops, workers = 2 failed again when I tried to scale up (using the real dataset instead of a truncated one). Common data is 8736.4 Mb... I read from the NEWS.md that this was improved but I can read into those issues... |
You can change the worker timeout: https://github.com/mschubert/clustermq/blob/master/vignettes/userguide.Rmd#L343-L344 Does that help? |
Thanks! I think I have adjusted that to 6000 and will let you know! On a side note, I raised a issue here ropensci/targets#251 I am curious what is your take on this? In the clustermq log (produced one for each target worker), I do not see the signs of submitting further jobs as a result of using foreach. Is there any place I can verify that? |
Are you using foreach to submit more clustermq workers within targets that are already running within clustermq workers? Or am I misunderstanding and you are using multicore parallelism within targets (recommended)? |
@wlandau Mine is like this test_foreach <- function(list_1, list_2) {
df_2 <- list_2 %>% pluck(1)
x <- foreach(df_1 = list_1, .combine = "rbind") %dopar% {
#do some test
tibble()
}
x
}
options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)
target <- list(
tar_target(test, test_foreach(list_1, list_2), map(list_2))
, tar_target(list_1, function_1)
, tar_target(list_2, function_2)
) So I think I am using multicore parallelism within targets? P.S. I sourced the function from a script |
@mschubert unfortunately after I set here is an example output:
|
to add on: most jobs show timeout after 600 seconds instead of 6000, showing that the option is implemented? |
options([...], clustermq.worker.timeout = 6000) Your timeout is set for the master, but the worker accesses it via an option again (and hence doesn't know about it) You need to either:
See here: https://github.com/mschubert/clustermq/blob/master/R/worker.r#L10-L11 |
Thanks so much! I will add that to my Rprofile for now. How can I do it via the template? I am not familiar with the template file so would really appreciate if there are some resources on this and I do not have to bother to that much >< |
Do you add it on this line?
but i am not too sure where to add Re the curly brackets |
It would be (untested): CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout=getOption("clustermq.worker.timeout", 600))' |
Thanks! I will test that later. For now though, after setting it in .Rprofile, the timeout issue might have been fixed but there is a memory allocation error:
This is strange because in my tmpl file
and clustermq register option:
both requested memory of 20000 MB so that should have met the requirement and there should not be a memory issue? The master has 30 GB of memory and the the error does not come from there (targets still appear to be running despite workers have failed) so I do not think that is where the issue is... |
I am adding
to .Rprofile as a test |
This error means that the worker can not allocate 5.2 Gb on top of what it is already using. So most likely this brings you over 20 Gb total, and then it crashes. |
It's worth noting here that |
Thank you both @mschubert @wlandau! I have put the following config
in both main
However the job have been submitted for a few hours and I think something has gone wrong - one target usually only takes about 30min so I had to reduce the timeout again and see what is the cause... |
To sum up, this does not seem to be a timeout issue but related to nesting I will close this, but please reopen if you have a reproducible example that illustrates a problem on |
I have a target that generates around 1000 branches, which each contain a foreach loop using
dopar
.The default number of core is 32, n_job is 32.
When submitting
tar_make_clustermq(workers = 32, log_worker = TRUE)
, all logs have timeout errors.When submitting the same targets script with
tar_make_clustermq(workers = 2, log_worker = TRUE)
, the jobs are done successfully. I am wondering whether it is the overhead of targets with more workers that potentially lead to the timeout. I am running it on a cluster with a fair use policy (so some jobs will be at PEND stage at the start).@wlandau, I think this is the issue you talked about before that clustermq fails silently? Targets were still running in interactive terminal but the jobs have been killed. I know @mschubert is working on this and I really appreciate that!
The text was updated successfully, but these errors were encountered: