Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustermq timeout at high number of targets workers #229

Closed
liutiming opened this issue Dec 30, 2020 · 17 comments
Closed

Clustermq timeout at high number of targets workers #229

liutiming opened this issue Dec 30, 2020 · 17 comments

Comments

@liutiming
Copy link

liutiming commented Dec 30, 2020

I have a target that generates around 1000 branches, which each contain a foreach loop using dopar.

The default number of core is 32, n_job is 32.

When submitting tar_make_clustermq(workers = 32, log_worker = TRUE), all logs have timeout errors.

When submitting the same targets script with tar_make_clustermq(workers = 2, log_worker = TRUE), the jobs are done successfully. I am wondering whether it is the overhead of targets with more workers that potentially lead to the timeout. I am running it on a cluster with a fair use policy (so some jobs will be at PEND stage at the start).

@wlandau, I think this is the issue you talked about before that clustermq fails silently? Targets were still running in interactive terminal but the jobs have been killed. I know @mschubert is working on this and I really appreciate that!

@liutiming
Copy link
Author

liutiming commented Dec 30, 2020

oops, workers = 2 failed again when I tried to scale up (using the real dataset instead of a truncated one). Common data is 8736.4 Mb... I read from the NEWS.md that this was improved but I can read into those issues...

@mschubert
Copy link
Owner

You can change the worker timeout: https://github.com/mschubert/clustermq/blob/master/vignettes/userguide.Rmd#L343-L344

Does that help?

@liutiming
Copy link
Author

Thanks! I think I have adjusted that to 6000 and will let you know!

On a side note, I raised a issue here ropensci/targets#251

I am curious what is your take on this? In the clustermq log (produced one for each target worker), I do not see the signs of submitting further jobs as a result of using foreach. Is there any place I can verify that?

@wlandau
Copy link
Contributor

wlandau commented Dec 30, 2020

Are you using foreach to submit more clustermq workers within targets that are already running within clustermq workers? Or am I misunderstanding and you are using multicore parallelism within targets (recommended)?

@liutiming
Copy link
Author

liutiming commented Dec 30, 2020

@wlandau Mine is like this

test_foreach  <- function(list_1, list_2) {
    df_2 <- list_2  %>% pluck(1)
    x <- foreach(df_1 = list_1, .combine = "rbind") %dopar% {
        #do some test
        tibble()
    }
    x
}

options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

target  <- list(
    tar_target(test, test_foreach(list_1, list_2), map(list_2))
    , tar_target(list_1, function_1)
    , tar_target(list_2, function_2)
)

So I think I am using multicore parallelism within targets?

P.S. I sourced the function from a script

@liutiming
Copy link
Author

liutiming commented Dec 30, 2020

@mschubert unfortunately after I set clustermq.worker.timeout = 6000, most workers still exit by timeout (some finished beforehand)

here is an example output:
FYI bc-30-2-12 is the node where I needed to start R, otherwise there will not be enough memory to send the jobs... it is then submitted to other nodes. The R session was submitted with bsub -q yesterday -R"select[mem>30000] rusage[mem=30000]" -M30000 -Is R and I am not trying to submit it to a normal queue to see if there is any change. (updates: submitting from another R session from a different queue did not help). It is very weird because previously there is no such time out errors (may be because targets were generated quite quickly, unlike this time?) even though a lot submitted jobs are pending. This time the jobs are not pending but it still result in time-out.

> clustermq:::worker("tcp://bc-30-2-12:9332")
2020-12-30 22:11:15.917879 | Master: tcp://bc-30-2-12:9332
2020-12-30 22:11:15.940585 | WORKER_UP to: tcp://bc-30-2-12:9332
2020-12-30 22:12:14.044555 | > DO_SETUP (18.293s wait)
2020-12-30 22:12:14.046469 | token from msg: set_common_data_token
2020-12-30 22:18:58.082127 | > DO_CALL (58.606s wait)
2020-12-30 22:42:39.818437 | eval'd: target_run_workertarget
Error in clustermq:::worker("tcp://bc-30-2-12:9332") : 
  Timeout reached, terminating
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

@liutiming
Copy link
Author

to add on: most jobs show timeout after 600 seconds instead of 6000, showing that the option is implemented?

@mschubert
Copy link
Owner

options([...], clustermq.worker.timeout = 6000)

Your timeout is set for the master, but the worker accesses it via an option again (and hence doesn't know about it)

You need to either:

  • Set the clustermq.worker.timeout option in your ~/.Rprofile
  • Pass the argument via the template

See here: https://github.com/mschubert/clustermq/blob/master/R/worker.r#L10-L11

@liutiming
Copy link
Author

Thanks so much! I will add that to my Rprofile for now. How can I do it via the template? I am not familiar with the template file so would really appreciate if there are some resources on this and I do not have to bother to that much ><

@liutiming
Copy link
Author

liutiming commented Dec 31, 2020

Do you add it on this line?

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

but i am not too sure where to add Re the curly brackets

@mschubert
Copy link
Owner

mschubert commented Dec 31, 2020

It would be (untested):

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout=getOption("clustermq.worker.timeout", 600))'

@liutiming
Copy link
Author

Thanks! I will test that later.

For now though, after setting it in .Rprofile, the timeout issue might have been fixed but there is a memory allocation error:

> clustermq:::worker("tcp://node-12-2-3:7199")
2020-12-31 09:56:41.175863 | Master: tcp://node-12-2-3:7199
2020-12-31 09:56:41.194812 | WORKER_UP to: tcp://node-12-2-3:7199
2020-12-31 09:57:08.695212 | > DO_SETUP (17.643s wait)
2020-12-31 09:57:08.697145 | token from msg: set_common_data_token
2020-12-31 09:59:48.006753 | > DO_CALL (57.754s wait)
2020-12-31 10:23:31.016150 | eval'd: target_run_workertarget
Error: cannot allocate vector of size 5.2 Gb
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

This is strange because in my tmpl file

#BSUB-R "span[hosts=1] select[mem>20000] rusage[mem=20000]"  # Memory requirements in Mbytes

and clustermq register option:

clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

both requested memory of 20000 MB so that should have met the requirement and there should not be a memory issue?

The master has 30 GB of memory and the the error does not come from there (targets still appear to be running despite workers have failed) so I do not think that is where the issue is...

@liutiming
Copy link
Author

I am adding

options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

to .Rprofile as a test

@mschubert
Copy link
Owner

mschubert commented Dec 31, 2020

Error: cannot allocate vector of size 5.2 Gb

This error means that the worker can not allocate 5.2 Gb on top of what it is already using. So most likely this brings you over 20 Gb total, and then it crashes.

@wlandau
Copy link
Contributor

wlandau commented Dec 31, 2020

It's worth noting here that targets runs tar_make() etc. in a callr process, callr::r() by default. callr::r() ignores your personal global .Rprofile file, so the option would need to be set either in your project's specific .Rprofile or in _targets.R.

@liutiming
Copy link
Author

Thank you both @mschubert @wlandau!

I have put the following config

options(clustermq.scheduler = "lsf", clustermq.template = "_targets_lsf.tmpl", clustermq.worker.timeout = 6000)
clustermq::register_dopar_cmq(n_jobs=32, memory=20000)

in both main ~/.Rprofile and _targets.R. In the log file it starts to submit jobs like this, which was never seen before, so I am not sure whether this is the right way to do...

Submitting 32 worker jobs (ID: cmq7002) ...
Running 29,695 calculations (2 objs/0.1 Mb common; 1 calls/chunk) ...
Master: [3.9s 62.4% CPU]; Worker: [avg 10.2% CPU, max 248.6 Mb]

However the job have been submitted for a few hours and I think something has gone wrong - one target usually only takes about 30min so I had to reduce the timeout again and see what is the cause...

@mschubert
Copy link
Owner

To sum up, this does not seem to be a timeout issue but related to nesting targets jobs with in-target foreach that itself submits jobs.

I will close this, but please reopen if you have a reproducible example that illustrates a problem on clustermq's side (aside from not yet tracking worker crashes, that's in #223)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants