vignettes/remakeGenerator.Rmd

---
title: "remakeGenerator"
author: "William Michael Landau"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{remakeGenerator}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Drake, the successor to remakeGenerator

[Drake](https://github.com/wlandau-lilly/drake) is a newer, standalone, [CRAN-published](https://CRAN.R-project.org/package=drake) [Make](https://www.gnu.org/software/make/)-like build system. It has the convenience of [remakeGenerator](https://github.com/wlandau/remakeGenerator), the reproducibility of [remake](https://github.com/richfitz/remake), and more comprehensive built-in parallel computing functionality than [parallelRemake](https://github.com/wlandau/parallelRemake).

# remakeGenerator

```{r, echo = F}
library(remakeGenerator)
```

The `remakeGenerator` package is a helper add-on for [`remake`](https://github.com/richfitz/remake), a [Makefile](https://www.gnu.org/software/make/)-like reproducible build system for R. If you haven't done so already, go learn [`remake`](https://github.com/richfitz/remake)! Once you do that, you will be ready to use `remakeGenerator`. With `remakeGenerator`, your long and cumbersome workflows will be

- **Quick to set up**. You can plan a large workflow with a small amount of code.
- **Reproducible**. Reproduce computation with `remake::make()` or [GNU Make](https://www.gnu.org/software/make/).
- **Development-friendly**. Thanks to [`remake`](https://github.com/richfitz/remake), whenever you change your code, your next computation will only run the parts that are new or out of date.
- **Parallelizable**. Distribute your workflow over multiple parallel processes with a single flag in [GNU Make](https://www.gnu.org/software/make/).

The `remakeGenerator` package accomplishes this by generating [YAML](http://yaml.org/) files for [`remake`](https://github.com/richfitz/remake) that would be too big to type manually.

# Rtools for Windows users

Windows users may need [`Rtools`](https://github.com/stan-dev/rstan/wiki/Install-Rtools-for-Windows) to take full advantage of `remakeGenerator`'s features, specifically to run [Makefiles](https://www.gnu.org/software/make/) with `system("make")`.


# Help and troubleshooting

Use the `help_remakeGenerator()` function to obtain a collection of helpful links. For troubleshooting, please refer to [TROUBLESHOOTING.md](https://github.com/wlandau/remakeGenerator/blob/master/TROUBLESHOOTING.md) on the [GitHub page](https://github.com/wlandau/remakeGenerator) for instructions.


# Basic example

Write the files for the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic) using

```{r, eval = F}
library(remakeGenerator)
example_remakeGenerator("basic")
# list_examples_remakeGenerator() # Shows the names of available examples.
```

Run [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) to produce the [`remake`](https://github.com/richfitz/remake) file `remake.yml`, an overarching [Makefile](https://www.gnu.org/software/make/), and run the workflow using 2 parallel processes.

```{r, eval = F}
source("workflow.R")
```

To use [`remake`](https://github.com/richfitz/remake) directly in a single process, use

```{r, eval = F}
worflow(..., run = FALSE)
remake::make()
```

**Do not call the [Makefile](https://www.gnu.org/software/make/) directly in the Linux command line.** As explained in the [parallelRemake](https://github.com/wlandau/parallelRemake) vignette, you must use `workflow(..., command = "make", args = "--jobs=2")` or `parallelRemake::makefile(..., command = "make", args = "--jobs=4")`, etc. [parallelRemake](https://github.com/wlandau/parallelRemake) uses a quick overhead step to configure hidden files for the [Makefile](https://www.gnu.org/software/make/) before running it.


Notice how [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) and `remake.yml` rely on the functions defined in [`code.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/code.R). To see how `remakeGenerator` saves you time, change the body of one of these functions (something more significant than whitespace or comments) and then run `remake::make()` again. Only the targets that depend on that function and downstream output are recomputed. If you only change whitespace or comments in [`code.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/code.R), the next call to `remake::make()` will change nothing, so you can tidy and document your code without triggering unnecessary rebuilds.


# A walk through [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R)

[`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) is the master plan of the analysis. It arranges the helper functions in [`code.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/code.R) to

1. Generate some datasets.

    ```{r, eval = F}
    library(remakeGenerator)
    datasets = commands(
      normal16 = normal_dataset(n = 16),
      poisson32 = poisson_dataset(n = 32),
      poisson64 = poisson_dataset(n = 64)
    )
    ```
    
2. Analyze each dataset with each of two methods of analysis.

    ```{r, eval = F}
    analyses = analyses(
      commands = commands(
        linear = linear_analysis(..dataset..),
        quadratic = quadratic_analysis(..dataset..)), 
      datasets = datasets)
    ```
    
3. Summarize each analysis of each dataset and gather the summaries into manageable objects.

    ```{r, eval = F}
    summaries = summaries(
      commands = commands(
        mse = mse_summary(..dataset.., ..analysis..),
        coef = coefficients_summary(..analysis..)), 
      analyses = analyses, datasets = datasets, gather = strings(c, rbind))
    ```
    
4. Compute output on the summaries.

    ```{r, eval = F}
    output = commands(coef.csv = write.csv(coef, target_name))
    ```
    
5. Generate plots.

    ```{r, eval = F}
    plots = commands(mse.pdf = hist(mse, col = I("black")))
    plots$plot = TRUE
    ```
    
6. Compile [`knitr`](http://yihui.name/knitr/) reports.

    ```{r, eval = F}
    reports = data.frame(target = strings(markdown.md, latex.tex),
      depends = c("poisson32, coef, coef.csv", ""))
    reports$knitr = TRUE
    ```
    
With these stages of the workflow planned, [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) collects all the 
[`remake`](https://github.com/richfitz/remake) targets into one [YAML](http://yaml.org/)-like list.

```{r, eval = F}
targets = targets(datasets = datasets, analyses = analyses, 
  summaries = summaries, output = output, plots = plots, reports = reports)
```

Finally, it generates the [`remake.yml`](https://github.com/richfitz/remake) file and an overarching [Makefile](https://www.gnu.org/software/make/). Then, unless `run = FALSE`, it runs the  [Makefile](https://www.gnu.org/software/make/) to deploy your workflow. In this case, from the `command` argument, you can see that the work is distributed over at most 2 parallel jobs. 

```{r, eval = F}
workflow(targets, sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
  prepend = c("# Prepend this", "# to the Makefile."), command = "make",
  args = "--jobs=2")
```

# Running intermediate stages

You can run each intermediate stages by themselves with the `make_these` argument in `workflow(...)`.

```{r, eval = F}
workflow(targets, make_these = "summaries", 
  sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
  prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")
```

Bypassing the [Makefile](https://www.gnu.org/software/make/) using `run = FALSE` and running [remake.yml](https://github.com/richfitz/remake) directly does the same thing in a single R process.

```{r, eval = F}
workflow(targets, make_these = "summaries", run = FALSE
  sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
  prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")
remake::make("summaries")
```

To remove the intermediate files and final results, run

```{r, eval = F}
remake::make("clean")
```

# The framework

At each stage (`datasets`, `analyses`, `summaries`, `mse`, etc.), the user supplies named R commands. The commands are then arranged into a data frame, such as the `datasets` data frame from the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic).

```{r, eval = F}
> datasets
     target                 command
1  normal16  normal_dataset(n = 16)
2 poisson32 poisson_dataset(n = 32)
3 poisson64 poisson_dataset(n = 64)
```

Above, each row stands for an individual [`remake`](https://github.com/richfitz/remake) target, and the `target` column contains the name of the target. Each command is the R function call that produces its respective target. With the exception of "`target`", each column of each  data frame represents a target-specific field in the [`remake.yml`](https://github.com/richfitz/remake) file. If additional fields are needed, just append the appropriate columns to the data frame. In [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R), the `plot` and `knitr` fields were added this way to the `plots` and `reports` data frames, respectively. Recall from [`remake`](https://github.com/richfitz/remake) that setting `plot` to `TRUE` automatically sends the output of the command to a file so you do not have to bother writing the code to save it.

```{r, eval = F}
> plots
   target                            command plot
1 mse.pdf hist(mse_vector, col = I("black")) TRUE
```

In addition, setting `knitr` to `TRUE` knits `.md` and `.tex` target files from `.Rmd` and `.Rnw` files, respectively.

```{r, eval = F}
> reports
       target                         depends knitr
1 markdown.md poisson32, coef, coef.csv  TRUE
2   latex.tex   
```

Above, and in the general case, each `depends` field is a character string of comma-separated [`remake`](https://github.com/richfitz/remake) dependencies. Dependencies that are arguments to commands are automatically resolved and should not be restated in `depends`. However, for [`knitr`](http://yihui.name/knitr/) reports, every dependency must be explicitly given in the `depends` field.

In generating the `analyses` and `summaries` data frames, you may have noticed the `..dataset..` and `..analysis..` symbols. Those are wildcard placeholders indicating that the respective commands will iterate over each dataset and each analysis of each dataset. The `analyses()` function turns 

```{r, eval = F}
> commands(linear = linear_analysis(..dataset..), quadratic = quadratic_analysis(..dataset..))
     target                         command
1    linear    linear_analysis(..dataset..)
2 quadratic quadratic_analysis(..dataset..)
```

into

```{r, eval = F}
               target                       command
1     linear_normal16     linear_analysis(normal16)
2    linear_poisson32    linear_analysis(poisson32)
3    linear_poisson64    linear_analysis(poisson64)
4  quadratic_normal16  quadratic_analysis(normal16)
5 quadratic_poisson32 quadratic_analysis(poisson32)
6 quadratic_poisson64 quadratic_analysis(poisson64)
```

and `summaries(..., gather = NULL)` turns

```{r, eval = F}
> commands(mse = mse_summary(..dataset.., ..analysis..), coef = coefficients_summary(..analysis..))
  target                                command
1    mse mse_summary(..dataset.., ..analysis..)
2   coef     coefficients_summary(..analysis..)
```

into

```{r, eval = F}
                     target                                     command
1       mse_linear_normal16      mse_summary(normal16, linear_normal16)
2      mse_linear_poisson32    mse_summary(poisson32, linear_poisson32)
3      mse_linear_poisson64    mse_summary(poisson64, linear_poisson64)
4    mse_quadratic_normal16   mse_summary(normal16, quadratic_normal16)
5   mse_quadratic_poisson32 mse_summary(poisson32, quadratic_poisson32)
6   mse_quadratic_poisson64 mse_summary(poisson64, quadratic_poisson64)
7      coef_linear_normal16       coefficients_summary(linear_normal16)
8     coef_linear_poisson32      coefficients_summary(linear_poisson32)
9     coef_linear_poisson64      coefficients_summary(linear_poisson64)
10  coef_quadratic_normal16    coefficients_summary(quadratic_normal16)
11 coef_quadratic_poisson32   coefficients_summary(quadratic_poisson32)
12 coef_quadratic_poisson64   coefficients_summary(quadratic_poisson64)
```

Setting the `gather` argument in `summaries()` to `c("c", "rbind")` prepends the following two rows to the above data frame.

```{r, eval = F}
    target
1   coef
2   mse

    command
1   rbind(coef_linear_normal16 = coef_linear_normal16, coef_linear_poisson32 = coef_linear_poisson32, coef_linear_poisson64 = coef_linear_poisson64, coef_quadratic_normal16 = coef_quadratic_normal16, coef_quadratic_poisson32 = coef_quadratic_poisson32, coef_quadratic_poisson64 = coef_quadratic_poisson64)
2   c(mse_linear_normal16 = mse_linear_normal16, mse_linear_poisson32 = mse_linear_poisson32, mse_linear_poisson64 = mse_linear_poisson64, mse_quadratic_normal16 = mse_quadratic_normal16, mse_quadratic_poisson32 = mse_quadratic_poisson32, mse_quadratic_poisson64 = mse_quadratic_poisson64)
```

These top two rows contain instructions to gather the summaries together into manageable objects. The default value of `gather` is a character vector with entries `"list"`.

# The "commands" functions

Functions `commands()`, `commands_string()`, and `commands_batch()` help create datasets as in previous section. 

```{r}
commands(x = f(1), y = g(2))
a = "f(1)"
b = "g(2)"
commands_string(x = a, y = b)
batch = c(x = a, y = b)
commands_batch(batch)
```

# Where is my output?

When your worflow runs, intermediate objects such as datasets, analyses, and summaries are maintained in [`remake`](https://github.com/richfitz/remake)'s hidden [`storr`](https://github.com/richfitz/storr) cache, located in the hidden `.remake/objects/` folder. To inspect your workflow, you can list the generated objects using `parallelRemake::recallable()` and load objects using `parallelRemake::recall()`. After running the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic), we see the following.

```{r, eval = F}
> library(parallelRemake)
> recallable()
 [1] "coef"                     "coef_linear_normal16"    
 [3] "coef_linear_poisson32"    "coef_linear_poisson64"   
 [5] "coef_quadratic_normal16"  "coef_quadratic_poisson32"
 [7] "coef_quadratic_poisson64" "linear_normal16"         
 [9] "linear_poisson32"         "linear_poisson64"        
[11] "mse"                      "mse_linear_normal16"     
[13] "mse_linear_poisson32"     "mse_linear_poisson64"    
[15] "mse_quadratic_normal16"   "mse_quadratic_poisson32" 
[17] "mse_quadratic_poisson64"  "normal16"                
[19] "poisson32"                "poisson64"               
[21] "quadratic_normal16"       "quadratic_poisson32"     
[23] "quadratic_poisson64"   
> recall("normal16")
            x        y
1   1.5500328 4.226192
2   1.4714371 4.374820
3   0.4906371 6.228053
4   1.0086720 4.945609
5   1.3360642 5.619259
6   1.4899272 4.920836
7   0.7046544 4.926668
8   1.4092923 4.030779
9   2.5636956 6.026149
10 -0.5202316 4.368160
11  0.5540340 4.760691
12  1.6256007 4.722436
13  1.3210316 3.838017
14  0.8247446 2.708511
15  2.7262725 5.878415
16  2.3565342 4.445811
> out = recall("normal16", "poisson32")
> str(out)
List of 2
 $ normal16 :'data.frame':	16 obs. of  2 variables:
  ..$ x: num [1:16] 0.9728 1.0688 1.4152 -0.4313 0.0912 ...
  ..$ y: num [1:16] 6.76 6.48 5.59 5.03 3.01 ...
 $ poisson32:'data.frame':	32 obs. of  2 variables:
  ..$ x: int [1:32] 0 2 1 0 0 2 1 0 0 1 ...
  ..$ y: int [1:32] 4 4 5 4 3 7 4 4 5 2 ...
```

The functions `create_bindings()` and `make_environment()` are alternatives from [`remake`](https://github.com/richfitz/remake) itself. Just be careful with `create_bindings()` if your project has a lot of data.

**Do not use `recall()` or `recallable()` in serious production-level workflows because operations on the [`storr`](https://github.com/richfitz/storr) cache are not reproducibly tracked.**

# High-performance computing

If you want to run Make to distribute tasks over multiple nodes of a [Slurm](http://slurm.schedmd.com/) cluster, you should generate a Makefile like the one in [this post](http://plindenbaum.blogspot.com/2014/09/parallelizing-gnu-make-4-in-slurm.html).
To do this, add the following to an R script (say, `my_script.R`)

```{r, eval = F}
workflow(..., command = "make", args = "--jobs=8",
  prepend = c(
    "SHELL=srun",
    ".SHELLFLAGS= <ARGS> bash -c"))
```

where `<ARGS>` stands for additional arguments to `srun`. Then, deploy your parallelized workflow to the cluster using the following [Linux command](http://linuxcommand.org/).

```r
nohup nice -19 R CMD BATCH my_script.R &
```

For other task managers such as [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System), you  may have to create a custom stand-in for a shell. 
For example, suppose we are using the Univa Grid Engine. In `my_script.R`, call

```r
workflow(.., command = "make", args = "--jobs=8",
  begin = "SHELL = ./shell.sh")
```

where the file `shell.sh` contains

```r
#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y
```

Now, in the Linux command line, enable execution with

```r
chmod +x shell.sh
```

and then distribute the work over `[N]` simultaneous jobs with

```r
nohup nice -19 R CMD BATCH my_script.R &
```

The same approach should work for [LSF systems](https://en.wikipedia.org/wiki/Platform_LSF), where `make` replaced by [lsmake](https://www.ibm.com/support/knowledgecenter/SSETD4_9.1.3/lsf_admin/lsmake_how_works_lsf.html) and the [Makefile](https://www.gnu.org/software/make/) is compatible.

Regardless of the system, be sure that all nodes point to the same working directory so that they share the same `.remake` [storr](https://github.com/richfitz/storr) cache. For the Univa Grid Engine, the `-cwd` flag for `qsub` accomplishes this.


# downsize

You can use the [`downsize`](https://CRAN.R-project.org/package=downsize) package in conjunction with `remakeGenerator`. First, write an R script (say, `downsize.R`) to set test or production mode.

```r
# downsize::test_mode()
downsize::production_mode()
```

Load `downsize.R` into  [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) to make your analysis plan respond to `downsize()`.

```{r, eval = F}
library(remakeGenerator)
source("downsize.R")
datasets = commands_string(
  target = "data1",
  command = paste0("long_job(number_of_samples = ", downsize(1000, 2), ")")
)
```

If your custom [`code.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/code.R) functions call `downsize()` internally, [`remake`](https://github.com/richfitz/remake) needs to know.

```r
workflow(sources = c("downsize.R", "code.R", ...), packages = c("downsize", ...))
```

Unfortunately, [`remake`](https://github.com/richfitz/remake) does not rebuild targets in response to changes to global options, so you should manually run `remake::make("clean")` to start from scratch whenever you change `downsize.R`.


# Flexibility

Some workflows do not fit the rigid structure of the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic) but could still benefit from the automated generation of [`remake.yml`](https://github.com/richfitz/remake) files and [Makefiles](https://www.gnu.org/software/make/). If you supply the appropriate data frames to the `targets()` function, you can customize your own analyses. Here, the `expand()` and `evaluate()` functions are essential to flexibility. The `expand()` function replicates targets generated by the same commands, and the `evaluate()` function lets you create and evaluate your own wildcard placeholders. With the `rules` argument, the `evaluate()` funcion is also capable of evaluating multiple wildcards in a single function call. (In this case, `rules` takes precedence, and the `wildcard` and `values` arguments are ignored.) Here are some examples.

```{r, echo = F}
library(remakeGenerator)
```

```{r}
df = commands(data = simulate(center = MU, scale = SIGMA))
df
df = expand(df, values = c("rep1", "rep2"))
df
evaluate(df, wildcard = "MU", values = 1:2)
evaluate(df, wildcard = "MU", values = 1:2, expand = FALSE)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = FALSE)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1, 10)))
```

For another demonstration, see the [flexible example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/flexible), which almost the same as the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic) except that it uses `expand()` and `evaluate()` explicitly.