Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelise tasks with parallel #84

Open
dfalster opened this issue Apr 12, 2016 · 7 comments
Open

Parallelise tasks with parallel #84

dfalster opened this issue Apr 12, 2016 · 7 comments

Comments

@dfalster
Copy link
Contributor

i know proper parallelisation is something that is a pretty big, longer-term challenge for remake. But I wonder about this as a relatively easily implementation for the use case of iterating over a list. The idea came to me via @kunstler.

Lets's say you have a target of sort

ret:
    command: my_wrapper(my_list)

where

my_wrapper <- function(input) {
    lapply(input, my_fun)

If this is slow, we might want to do some parallel compute. At present, we can force (trick) remake into doing some parallel compute using a function like (based on parallel package):

my_wrapper_parallel <- function(input, ncores= detectCores()-2) {
  cl <- makeCluster(ncores)
  on.exit(stopCluster(cl))
  parLapply(input, my_fun)
}

Then the remake target would be

ret:
    command: my_wrapper_parallel(my_list)
    packages: parallel

So we can already do this. But then we'll be writing lots of wrapper functions, so why not make it a remake option?

Would probably need to come as part of a general list target (#8). E.g.

ret:
    command: remake_list_target(my_list, my_fun)
    parallel: parallel 

Eventually one might addd other (more complicated) backends (like farming out to AWS), but starting with parallel seems like an at least potentially manageable shorter term goal.

@wlandau
Copy link

wlandau commented May 20, 2016

Cool idea. Unfortunately though, in a distributed computing scenario, parallel in single R session only reaches a single node. Like you said, proper parallelization is a long-term challenge. What you really want is something like make with the -j flag. So for the short term, I wrote a package called parallelRemake. It generates multiple remake/YAML files and then creates an overarching Makefile to arrange them into parallelizable stages. That way, you can call make -j <whatever> to run parallel instances of remake. Clunky, I know, but it seems to work. There's also a function to help automate the generation of remake/YAML files by producing them from named lists.

@wlandau
Copy link

wlandau commented Jun 3, 2016

By the way, I also built workflowHelper on top of parallelRemake to handle certain kinds of common workflows in parallel without having to go through YAML (maybe a short-term special case of #20). Update: remakeGenerator is the successor to workflowHelper.

@dfalster
Copy link
Contributor Author

dfalster commented Jun 5, 2016

thanks for the suggestions @wlandau. As I see it, there are at least two levels of parallelisation remake needs:

  1. Local machine (single node)
  2. Remote (cluster) machines (multiple nodes)

The solutions to these might be different. Your suggestion was aimed at No 2 but would be a bit heavy for No 1.

I know No 2 has been on @richfitz's todo list, but good to see you have some ideas on this and made a start via parallelremake. One thing i was wondering was whether, instead of writing multiple remake files, you do write a makefile that executed parts of the one master remake file:

So lets say your remake.yml file has:

targets:
  all:
    depends:
      - target1
      - target2

  target1:
    command: f1()
  target2:
    command: f2()

You could write a make file like this:

all: target1 target2

target1:  
    Rscript -e "remake::make('target1')"

target2:  
    Rscript -e "remake::make('target2')"

This would still enable you to exploit the make -j option, or alternatively submit jobs via a queuing system, while still working with a single remake file. To get it to work, you'd want to make sure your you symlinked the .remake folder on each local node back to the master, so that it wrote accessed any dependencies and also wrote results into the correct place.

@wlandau
Copy link

wlandau commented Jun 6, 2016

Thanks for the great idea, @dfalster! I just implemented it in the single_yaml_file branch of parallelRemake, which I'll merge to master along with an major revision of workflowHelper. It's really gratifying how the whole structure cleaned up instantly.

Right now, I need to parse commands a bit more intelligently to figure out dependencies for the master Makefile, but the guts of remake should take care of that.

Edit: Now using remake's parse_command function to resolve dependencies for the master Makefile.

@wlandau
Copy link

wlandau commented Jun 6, 2016

Both parallelRemake and workflowHelper now implement the suggestion by @dfalster on master. That was a quicker update than I thought it would be.

@wlandau
Copy link

wlandau commented Nov 26, 2016

Regarding @dfalster's suggestion for a single-node solution, how hard would it be to resolve parallelizable groups of commands within the existing topological sort? With that accomplished, it would seem easy to iterate sequentially over groups and use parallel::mclapply() within groups.

@richfitz
Copy link
Owner

Hi Will - with the current interfaces available to us in the parallel package, the scope for using it for this is pretty narrow; it will work in a few use-cases where the tree has a very particular shape but in general you'd be lucky to get a 2x speed up.

This problem was the motivation for some queuing packages that I wrote (rrqueue and rrq) but it's possible that Henrick's amazing looking future package here might be a better interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants