Clustermq timeout at high number of targets workers #229

liutiming · 2020-12-30T17:30:02Z

I have a target that generates around 1000 branches, which each contain a foreach loop using dopar.

The default number of core is 32, n_job is 32.

When submitting tar_make_clustermq(workers = 32, log_worker = TRUE), all logs have timeout errors.

When submitting the same targets script with tar_make_clustermq(workers = 2, log_worker = TRUE), the jobs are done successfully. I am wondering whether it is the overhead of targets with more workers that potentially lead to the timeout. I am running it on a cluster with a fair use policy (so some jobs will be at PEND stage at the start).

@wlandau, I think this is the issue you talked about before that clustermq fails silently? Targets were still running in interactive terminal but the jobs have been killed. I know @mschubert is working on this and I really appreciate that!

The text was updated successfully, but these errors were encountered:

liutiming · 2020-12-30T18:21:07Z

oops, workers = 2 failed again when I tried to scale up (using the real dataset instead of a truncated one). Common data is 8736.4 Mb... I read from the NEWS.md that this was improved but I can read into those issues...

mschubert · 2020-12-30T21:34:57Z

You can change the worker timeout: https://github.com/mschubert/clustermq/blob/master/vignettes/userguide.Rmd#L343-L344

Does that help?

liutiming · 2020-12-30T21:56:27Z

Thanks! I think I have adjusted that to 6000 and will let you know!

On a side note, I raised a issue here ropensci/targets#251

I am curious what is your take on this? In the clustermq log (produced one for each target worker), I do not see the signs of submitting further jobs as a result of using foreach. Is there any place I can verify that?

wlandau · 2020-12-30T22:43:33Z

Are you using foreach to submit more clustermq workers within targets that are already running within clustermq workers? Or am I misunderstanding and you are using multicore parallelism within targets (recommended)?

liutiming · 2020-12-30T23:10:52Z

@wlandau Mine is like this

test_foreach  <- function(list_1, list_2) {
    df_2 <- list_2  %>% pluck(1)
    x <- foreach(df_1 = list_1, .combine = "rbind") %dopar% {
        #do some test
        tibble()
    }
    x
}

options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

target  <- list(
    tar_target(test, test_foreach(list_1, list_2), map(list_2))
    , tar_target(list_1, function_1)
    , tar_target(list_2, function_2)
)

So I think I am using multicore parallelism within targets?

P.S. I sourced the function from a script

liutiming · 2020-12-30T23:12:49Z

@mschubert unfortunately after I set clustermq.worker.timeout = 6000, most workers still exit by timeout (some finished beforehand)

here is an example output:
FYI bc-30-2-12 is the node where I needed to start R, otherwise there will not be enough memory to send the jobs... it is then submitted to other nodes. The R session was submitted with bsub -q yesterday -R"select[mem>30000] rusage[mem=30000]" -M30000 -Is R and I am not trying to submit it to a normal queue to see if there is any change. (updates: submitting from another R session from a different queue did not help). It is very weird because previously there is no such time out errors (may be because targets were generated quite quickly, unlike this time?) even though a lot submitted jobs are pending. This time the jobs are not pending but it still result in time-out.

> clustermq:::worker("tcp://bc-30-2-12:9332")
2020-12-30 22:11:15.917879 | Master: tcp://bc-30-2-12:9332
2020-12-30 22:11:15.940585 | WORKER_UP to: tcp://bc-30-2-12:9332
2020-12-30 22:12:14.044555 | > DO_SETUP (18.293s wait)
2020-12-30 22:12:14.046469 | token from msg: set_common_data_token
2020-12-30 22:18:58.082127 | > DO_CALL (58.606s wait)
2020-12-30 22:42:39.818437 | eval'd: target_run_workertarget
Error in clustermq:::worker("tcp://bc-30-2-12:9332") : 
  Timeout reached, terminating
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

liutiming · 2020-12-31T09:29:09Z

to add on: most jobs show timeout after 600 seconds instead of 6000, showing that the option is implemented?

mschubert · 2020-12-31T09:48:44Z

options([...], clustermq.worker.timeout = 6000)

Your timeout is set for the master, but the worker accesses it via an option again (and hence doesn't know about it)

You need to either:

Set the clustermq.worker.timeout option in your ~/.Rprofile
Pass the argument via the template

See here: https://github.com/mschubert/clustermq/blob/master/R/worker.r#L10-L11

liutiming · 2020-12-31T09:52:41Z

Thanks so much! I will add that to my Rprofile for now. How can I do it via the template? I am not familiar with the template file so would really appreciate if there are some resources on this and I do not have to bother to that much ><

liutiming · 2020-12-31T09:54:20Z

Do you add it on this line?

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

but i am not too sure where to add Re the curly brackets

mschubert · 2020-12-31T10:11:45Z

It would be (untested):

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout=getOption("clustermq.worker.timeout", 600))'

liutiming · 2020-12-31T10:51:16Z

Thanks! I will test that later.

For now though, after setting it in .Rprofile, the timeout issue might have been fixed but there is a memory allocation error:

> clustermq:::worker("tcp://node-12-2-3:7199")
2020-12-31 09:56:41.175863 | Master: tcp://node-12-2-3:7199
2020-12-31 09:56:41.194812 | WORKER_UP to: tcp://node-12-2-3:7199
2020-12-31 09:57:08.695212 | > DO_SETUP (17.643s wait)
2020-12-31 09:57:08.697145 | token from msg: set_common_data_token
2020-12-31 09:59:48.006753 | > DO_CALL (57.754s wait)
2020-12-31 10:23:31.016150 | eval'd: target_run_workertarget
Error: cannot allocate vector of size 5.2 Gb
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

This is strange because in my tmpl file

#BSUB-R "span[hosts=1] select[mem>20000] rusage[mem=20000]"  # Memory requirements in Mbytes

and clustermq register option:

clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

both requested memory of 20000 MB so that should have met the requirement and there should not be a memory issue?

The master has 30 GB of memory and the the error does not come from there (targets still appear to be running despite workers have failed) so I do not think that is where the issue is...

liutiming · 2020-12-31T11:58:07Z

I am adding

options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

to .Rprofile as a test

mschubert · 2020-12-31T12:02:54Z

Error: cannot allocate vector of size 5.2 Gb

This error means that the worker can not allocate 5.2 Gb on top of what it is already using. So most likely this brings you over 20 Gb total, and then it crashes.

wlandau · 2020-12-31T19:31:29Z

It's worth noting here that targets runs tar_make() etc. in a callr process, callr::r() by default. callr::r() ignores your personal global .Rprofile file, so the option would need to be set either in your project's specific .Rprofile or in _targets.R.

liutiming · 2020-12-31T20:03:35Z

Thank you both @mschubert @wlandau!

I have put the following config

options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

in both main ~/.Rprofile and _targets.R. In the log file it starts to submit jobs like this, which was never seen before, so I am not sure whether this is the right way to do...

Submitting 32 worker jobs (ID: cmq7002) ...
Running 29,695 calculations (2 objs/0.1 Mb common; 1 calls/chunk) ...
Master: [3.9s 62.4% CPU]; Worker: [avg 10.2% CPU, max 248.6 Mb]

However the job have been submitted for a few hours and I think something has gone wrong - one target usually only takes about 30min so I had to reduce the timeout again and see what is the cause...

mschubert · 2021-01-04T10:18:37Z

To sum up, this does not seem to be a timeout issue but related to nesting targets jobs with in-target foreach that itself submits jobs.

I will close this, but please reopen if you have a reproducible example that illustrates a problem on clustermq's side (aside from not yet tracking worker crashes, that's in #223)

liutiming mentioned this issue Jan 2, 2021

Subpipeline data transfer timeout ropensci/targets#237

Closed

6 tasks

mschubert closed this as completed Jan 4, 2021

mschubert added the needs mwe label Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustermq timeout at high number of targets workers #229

Clustermq timeout at high number of targets workers #229

liutiming commented Dec 30, 2020 •

edited

Loading

liutiming commented Dec 30, 2020 •

edited

Loading

mschubert commented Dec 30, 2020

liutiming commented Dec 30, 2020

wlandau commented Dec 30, 2020 •

edited

Loading

liutiming commented Dec 30, 2020 •

edited

Loading

liutiming commented Dec 30, 2020 •

edited

Loading

liutiming commented Dec 31, 2020

mschubert commented Dec 31, 2020

liutiming commented Dec 31, 2020

liutiming commented Dec 31, 2020 •

edited

Loading

mschubert commented Dec 31, 2020 •

edited

Loading

liutiming commented Dec 31, 2020

liutiming commented Dec 31, 2020

mschubert commented Dec 31, 2020 •

edited

Loading

wlandau commented Dec 31, 2020

liutiming commented Dec 31, 2020

mschubert commented Jan 4, 2021

Clustermq timeout at high number of targets workers #229

Clustermq timeout at high number of targets workers #229

Comments

liutiming commented Dec 30, 2020 • edited Loading

liutiming commented Dec 30, 2020 • edited Loading

mschubert commented Dec 30, 2020

liutiming commented Dec 30, 2020

wlandau commented Dec 30, 2020 • edited Loading

liutiming commented Dec 30, 2020 • edited Loading

liutiming commented Dec 30, 2020 • edited Loading

liutiming commented Dec 31, 2020

mschubert commented Dec 31, 2020

liutiming commented Dec 31, 2020

liutiming commented Dec 31, 2020 • edited Loading

mschubert commented Dec 31, 2020 • edited Loading

liutiming commented Dec 31, 2020

liutiming commented Dec 31, 2020

mschubert commented Dec 31, 2020 • edited Loading

wlandau commented Dec 31, 2020

liutiming commented Dec 31, 2020

mschubert commented Jan 4, 2021

liutiming commented Dec 30, 2020 •

edited

Loading

liutiming commented Dec 30, 2020 •

edited

Loading

wlandau commented Dec 30, 2020 •

edited

Loading

liutiming commented Dec 30, 2020 •

edited

Loading

liutiming commented Dec 30, 2020 •

edited

Loading

liutiming commented Dec 31, 2020 •

edited

Loading

mschubert commented Dec 31, 2020 •

edited

Loading

mschubert commented Dec 31, 2020 •

edited

Loading