-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathremakeGenerator.Rmd
425 lines (317 loc) · 20.8 KB
/
remakeGenerator.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
---
title: "remakeGenerator"
author: "William Michael Landau"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{remakeGenerator}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
# Drake, the successor to remakeGenerator
[Drake](https://github.com/wlandau-lilly/drake) is a newer, standalone, [CRAN-published](https://CRAN.R-project.org/package=drake) [Make](https://www.gnu.org/software/make/)-like build system. It has the convenience of [remakeGenerator](https://github.com/wlandau/remakeGenerator), the reproducibility of [remake](https://github.com/richfitz/remake), and more comprehensive built-in parallel computing functionality than [parallelRemake](https://github.com/wlandau/parallelRemake).
# remakeGenerator
```{r, echo = F}
library(remakeGenerator)
```
The `remakeGenerator` package is a helper add-on for [`remake`](https://github.com/richfitz/remake), a [Makefile](https://www.gnu.org/software/make/)-like reproducible build system for R. If you haven't done so already, go learn [`remake`](https://github.com/richfitz/remake)! Once you do that, you will be ready to use `remakeGenerator`. With `remakeGenerator`, your long and cumbersome workflows will be
- **Quick to set up**. You can plan a large workflow with a small amount of code.
- **Reproducible**. Reproduce computation with `remake::make()` or [GNU Make](https://www.gnu.org/software/make/).
- **Development-friendly**. Thanks to [`remake`](https://github.com/richfitz/remake), whenever you change your code, your next computation will only run the parts that are new or out of date.
- **Parallelizable**. Distribute your workflow over multiple parallel processes with a single flag in [GNU Make](https://www.gnu.org/software/make/).
The `remakeGenerator` package accomplishes this by generating [YAML](http://yaml.org/) files for [`remake`](https://github.com/richfitz/remake) that would be too big to type manually.
# Rtools for Windows users
Windows users may need [`Rtools`](https://github.com/stan-dev/rstan/wiki/Install-Rtools-for-Windows) to take full advantage of `remakeGenerator`'s features, specifically to run [Makefiles](https://www.gnu.org/software/make/) with `system("make")`.
# Help and troubleshooting
Use the `help_remakeGenerator()` function to obtain a collection of helpful links. For troubleshooting, please refer to [TROUBLESHOOTING.md](https://github.com/wlandau/remakeGenerator/blob/master/TROUBLESHOOTING.md) on the [GitHub page](https://github.com/wlandau/remakeGenerator) for instructions.
# Basic example
Write the files for the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic) using
```{r, eval = F}
library(remakeGenerator)
example_remakeGenerator("basic")
# list_examples_remakeGenerator() # Shows the names of available examples.
```
Run [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) to produce the [`remake`](https://github.com/richfitz/remake) file `remake.yml`, an overarching [Makefile](https://www.gnu.org/software/make/), and run the workflow using 2 parallel processes.
```{r, eval = F}
source("workflow.R")
```
To use [`remake`](https://github.com/richfitz/remake) directly in a single process, use
```{r, eval = F}
worflow(..., run = FALSE)
remake::make()
```
**Do not call the [Makefile](https://www.gnu.org/software/make/) directly in the Linux command line.** As explained in the [parallelRemake](https://github.com/wlandau/parallelRemake) vignette, you must use `workflow(..., command = "make", args = "--jobs=2")` or `parallelRemake::makefile(..., command = "make", args = "--jobs=4")`, etc. [parallelRemake](https://github.com/wlandau/parallelRemake) uses a quick overhead step to configure hidden files for the [Makefile](https://www.gnu.org/software/make/) before running it.
Notice how [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) and `remake.yml` rely on the functions defined in [`code.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/code.R). To see how `remakeGenerator` saves you time, change the body of one of these functions (something more significant than whitespace or comments) and then run `remake::make()` again. Only the targets that depend on that function and downstream output are recomputed. If you only change whitespace or comments in [`code.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/code.R), the next call to `remake::make()` will change nothing, so you can tidy and document your code without triggering unnecessary rebuilds.
# A walk through [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R)
[`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) is the master plan of the analysis. It arranges the helper functions in [`code.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/code.R) to
1. Generate some datasets.
```{r, eval = F}
library(remakeGenerator)
datasets = commands(
normal16 = normal_dataset(n = 16),
poisson32 = poisson_dataset(n = 32),
poisson64 = poisson_dataset(n = 64)
)
```
2. Analyze each dataset with each of two methods of analysis.
```{r, eval = F}
analyses = analyses(
commands = commands(
linear = linear_analysis(..dataset..),
quadratic = quadratic_analysis(..dataset..)),
datasets = datasets)
```
3. Summarize each analysis of each dataset and gather the summaries into manageable objects.
```{r, eval = F}
summaries = summaries(
commands = commands(
mse = mse_summary(..dataset.., ..analysis..),
coef = coefficients_summary(..analysis..)),
analyses = analyses, datasets = datasets, gather = strings(c, rbind))
```
4. Compute output on the summaries.
```{r, eval = F}
output = commands(coef.csv = write.csv(coef, target_name))
```
5. Generate plots.
```{r, eval = F}
plots = commands(mse.pdf = hist(mse, col = I("black")))
plots$plot = TRUE
```
6. Compile [`knitr`](http://yihui.name/knitr/) reports.
```{r, eval = F}
reports = data.frame(target = strings(markdown.md, latex.tex),
depends = c("poisson32, coef, coef.csv", ""))
reports$knitr = TRUE
```
With these stages of the workflow planned, [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) collects all the
[`remake`](https://github.com/richfitz/remake) targets into one [YAML](http://yaml.org/)-like list.
```{r, eval = F}
targets = targets(datasets = datasets, analyses = analyses,
summaries = summaries, output = output, plots = plots, reports = reports)
```
Finally, it generates the [`remake.yml`](https://github.com/richfitz/remake) file and an overarching [Makefile](https://www.gnu.org/software/make/). Then, unless `run = FALSE`, it runs the [Makefile](https://www.gnu.org/software/make/) to deploy your workflow. In this case, from the `command` argument, you can see that the work is distributed over at most 2 parallel jobs.
```{r, eval = F}
workflow(targets, sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
prepend = c("# Prepend this", "# to the Makefile."), command = "make",
args = "--jobs=2")
```
# Running intermediate stages
You can run each intermediate stages by themselves with the `make_these` argument in `workflow(...)`.
```{r, eval = F}
workflow(targets, make_these = "summaries",
sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")
```
Bypassing the [Makefile](https://www.gnu.org/software/make/) using `run = FALSE` and running [remake.yml](https://github.com/richfitz/remake) directly does the same thing in a single R process.
```{r, eval = F}
workflow(targets, make_these = "summaries", run = FALSE
sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")
remake::make("summaries")
```
To remove the intermediate files and final results, run
```{r, eval = F}
remake::make("clean")
```
# The framework
At each stage (`datasets`, `analyses`, `summaries`, `mse`, etc.), the user supplies named R commands. The commands are then arranged into a data frame, such as the `datasets` data frame from the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic).
```{r, eval = F}
> datasets
target command
1 normal16 normal_dataset(n = 16)
2 poisson32 poisson_dataset(n = 32)
3 poisson64 poisson_dataset(n = 64)
```
Above, each row stands for an individual [`remake`](https://github.com/richfitz/remake) target, and the `target` column contains the name of the target. Each command is the R function call that produces its respective target. With the exception of "`target`", each column of each data frame represents a target-specific field in the [`remake.yml`](https://github.com/richfitz/remake) file. If additional fields are needed, just append the appropriate columns to the data frame. In [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R), the `plot` and `knitr` fields were added this way to the `plots` and `reports` data frames, respectively. Recall from [`remake`](https://github.com/richfitz/remake) that setting `plot` to `TRUE` automatically sends the output of the command to a file so you do not have to bother writing the code to save it.
```{r, eval = F}
> plots
target command plot
1 mse.pdf hist(mse_vector, col = I("black")) TRUE
```
In addition, setting `knitr` to `TRUE` knits `.md` and `.tex` target files from `.Rmd` and `.Rnw` files, respectively.
```{r, eval = F}
> reports
target depends knitr
1 markdown.md poisson32, coef, coef.csv TRUE
2 latex.tex
```
Above, and in the general case, each `depends` field is a character string of comma-separated [`remake`](https://github.com/richfitz/remake) dependencies. Dependencies that are arguments to commands are automatically resolved and should not be restated in `depends`. However, for [`knitr`](http://yihui.name/knitr/) reports, every dependency must be explicitly given in the `depends` field.
In generating the `analyses` and `summaries` data frames, you may have noticed the `..dataset..` and `..analysis..` symbols. Those are wildcard placeholders indicating that the respective commands will iterate over each dataset and each analysis of each dataset. The `analyses()` function turns
```{r, eval = F}
> commands(linear = linear_analysis(..dataset..), quadratic = quadratic_analysis(..dataset..))
target command
1 linear linear_analysis(..dataset..)
2 quadratic quadratic_analysis(..dataset..)
```
into
```{r, eval = F}
target command
1 linear_normal16 linear_analysis(normal16)
2 linear_poisson32 linear_analysis(poisson32)
3 linear_poisson64 linear_analysis(poisson64)
4 quadratic_normal16 quadratic_analysis(normal16)
5 quadratic_poisson32 quadratic_analysis(poisson32)
6 quadratic_poisson64 quadratic_analysis(poisson64)
```
and `summaries(..., gather = NULL)` turns
```{r, eval = F}
> commands(mse = mse_summary(..dataset.., ..analysis..), coef = coefficients_summary(..analysis..))
target command
1 mse mse_summary(..dataset.., ..analysis..)
2 coef coefficients_summary(..analysis..)
```
into
```{r, eval = F}
target command
1 mse_linear_normal16 mse_summary(normal16, linear_normal16)
2 mse_linear_poisson32 mse_summary(poisson32, linear_poisson32)
3 mse_linear_poisson64 mse_summary(poisson64, linear_poisson64)
4 mse_quadratic_normal16 mse_summary(normal16, quadratic_normal16)
5 mse_quadratic_poisson32 mse_summary(poisson32, quadratic_poisson32)
6 mse_quadratic_poisson64 mse_summary(poisson64, quadratic_poisson64)
7 coef_linear_normal16 coefficients_summary(linear_normal16)
8 coef_linear_poisson32 coefficients_summary(linear_poisson32)
9 coef_linear_poisson64 coefficients_summary(linear_poisson64)
10 coef_quadratic_normal16 coefficients_summary(quadratic_normal16)
11 coef_quadratic_poisson32 coefficients_summary(quadratic_poisson32)
12 coef_quadratic_poisson64 coefficients_summary(quadratic_poisson64)
```
Setting the `gather` argument in `summaries()` to `c("c", "rbind")` prepends the following two rows to the above data frame.
```{r, eval = F}
target
1 coef
2 mse
command
1 rbind(coef_linear_normal16 = coef_linear_normal16, coef_linear_poisson32 = coef_linear_poisson32, coef_linear_poisson64 = coef_linear_poisson64, coef_quadratic_normal16 = coef_quadratic_normal16, coef_quadratic_poisson32 = coef_quadratic_poisson32, coef_quadratic_poisson64 = coef_quadratic_poisson64)
2 c(mse_linear_normal16 = mse_linear_normal16, mse_linear_poisson32 = mse_linear_poisson32, mse_linear_poisson64 = mse_linear_poisson64, mse_quadratic_normal16 = mse_quadratic_normal16, mse_quadratic_poisson32 = mse_quadratic_poisson32, mse_quadratic_poisson64 = mse_quadratic_poisson64)
```
These top two rows contain instructions to gather the summaries together into manageable objects. The default value of `gather` is a character vector with entries `"list"`.
# The "commands" functions
Functions `commands()`, `commands_string()`, and `commands_batch()` help create datasets as in previous section.
```{r}
commands(x = f(1), y = g(2))
a = "f(1)"
b = "g(2)"
commands_string(x = a, y = b)
batch = c(x = a, y = b)
commands_batch(batch)
```
# Where is my output?
When your worflow runs, intermediate objects such as datasets, analyses, and summaries are maintained in [`remake`](https://github.com/richfitz/remake)'s hidden [`storr`](https://github.com/richfitz/storr) cache, located in the hidden `.remake/objects/` folder. To inspect your workflow, you can list the generated objects using `parallelRemake::recallable()` and load objects using `parallelRemake::recall()`. After running the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic), we see the following.
```{r, eval = F}
> library(parallelRemake)
> recallable()
[1] "coef" "coef_linear_normal16"
[3] "coef_linear_poisson32" "coef_linear_poisson64"
[5] "coef_quadratic_normal16" "coef_quadratic_poisson32"
[7] "coef_quadratic_poisson64" "linear_normal16"
[9] "linear_poisson32" "linear_poisson64"
[11] "mse" "mse_linear_normal16"
[13] "mse_linear_poisson32" "mse_linear_poisson64"
[15] "mse_quadratic_normal16" "mse_quadratic_poisson32"
[17] "mse_quadratic_poisson64" "normal16"
[19] "poisson32" "poisson64"
[21] "quadratic_normal16" "quadratic_poisson32"
[23] "quadratic_poisson64"
> recall("normal16")
x y
1 1.5500328 4.226192
2 1.4714371 4.374820
3 0.4906371 6.228053
4 1.0086720 4.945609
5 1.3360642 5.619259
6 1.4899272 4.920836
7 0.7046544 4.926668
8 1.4092923 4.030779
9 2.5636956 6.026149
10 -0.5202316 4.368160
11 0.5540340 4.760691
12 1.6256007 4.722436
13 1.3210316 3.838017
14 0.8247446 2.708511
15 2.7262725 5.878415
16 2.3565342 4.445811
> out = recall("normal16", "poisson32")
> str(out)
List of 2
$ normal16 :'data.frame': 16 obs. of 2 variables:
..$ x: num [1:16] 0.9728 1.0688 1.4152 -0.4313 0.0912 ...
..$ y: num [1:16] 6.76 6.48 5.59 5.03 3.01 ...
$ poisson32:'data.frame': 32 obs. of 2 variables:
..$ x: int [1:32] 0 2 1 0 0 2 1 0 0 1 ...
..$ y: int [1:32] 4 4 5 4 3 7 4 4 5 2 ...
```
The functions `create_bindings()` and `make_environment()` are alternatives from [`remake`](https://github.com/richfitz/remake) itself. Just be careful with `create_bindings()` if your project has a lot of data.
**Do not use `recall()` or `recallable()` in serious production-level workflows because operations on the [`storr`](https://github.com/richfitz/storr) cache are not reproducibly tracked.**
# High-performance computing
If you want to run Make to distribute tasks over multiple nodes of a [Slurm](http://slurm.schedmd.com/) cluster, you should generate a Makefile like the one in [this post](http://plindenbaum.blogspot.com/2014/09/parallelizing-gnu-make-4-in-slurm.html).
To do this, add the following to an R script (say, `my_script.R`)
```{r, eval = F}
workflow(..., command = "make", args = "--jobs=8",
prepend = c(
"SHELL=srun",
".SHELLFLAGS= <ARGS> bash -c"))
```
where `<ARGS>` stands for additional arguments to `srun`. Then, deploy your parallelized workflow to the cluster using the following [Linux command](http://linuxcommand.org/).
```r
nohup nice -19 R CMD BATCH my_script.R &
```
For other task managers such as [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System), you may have to create a custom stand-in for a shell.
For example, suppose we are using the Univa Grid Engine. In `my_script.R`, call
```r
workflow(.., command = "make", args = "--jobs=8",
begin = "SHELL = ./shell.sh")
```
where the file `shell.sh` contains
```r
#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y
```
Now, in the Linux command line, enable execution with
```r
chmod +x shell.sh
```
and then distribute the work over `[N]` simultaneous jobs with
```r
nohup nice -19 R CMD BATCH my_script.R &
```
The same approach should work for [LSF systems](https://en.wikipedia.org/wiki/Platform_LSF), where `make` replaced by [lsmake](https://www.ibm.com/support/knowledgecenter/SSETD4_9.1.3/lsf_admin/lsmake_how_works_lsf.html) and the [Makefile](https://www.gnu.org/software/make/) is compatible.
Regardless of the system, be sure that all nodes point to the same working directory so that they share the same `.remake` [storr](https://github.com/richfitz/storr) cache. For the Univa Grid Engine, the `-cwd` flag for `qsub` accomplishes this.
# downsize
You can use the [`downsize`](https://CRAN.R-project.org/package=downsize) package in conjunction with `remakeGenerator`. First, write an R script (say, `downsize.R`) to set test or production mode.
```r
# downsize::test_mode()
downsize::production_mode()
```
Load `downsize.R` into [`workflow.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/workflow.R) to make your analysis plan respond to `downsize()`.
```{r, eval = F}
library(remakeGenerator)
source("downsize.R")
datasets = commands_string(
target = "data1",
command = paste0("long_job(number_of_samples = ", downsize(1000, 2), ")")
)
```
If your custom [`code.R`](https://github.com/wlandau/remakeGenerator/blob/master/inst/examples/basic/code.R) functions call `downsize()` internally, [`remake`](https://github.com/richfitz/remake) needs to know.
```r
workflow(sources = c("downsize.R", "code.R", ...), packages = c("downsize", ...))
```
Unfortunately, [`remake`](https://github.com/richfitz/remake) does not rebuild targets in response to changes to global options, so you should manually run `remake::make("clean")` to start from scratch whenever you change `downsize.R`.
# Flexibility
Some workflows do not fit the rigid structure of the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic) but could still benefit from the automated generation of [`remake.yml`](https://github.com/richfitz/remake) files and [Makefiles](https://www.gnu.org/software/make/). If you supply the appropriate data frames to the `targets()` function, you can customize your own analyses. Here, the `expand()` and `evaluate()` functions are essential to flexibility. The `expand()` function replicates targets generated by the same commands, and the `evaluate()` function lets you create and evaluate your own wildcard placeholders. With the `rules` argument, the `evaluate()` funcion is also capable of evaluating multiple wildcards in a single function call. (In this case, `rules` takes precedence, and the `wildcard` and `values` arguments are ignored.) Here are some examples.
```{r, echo = F}
library(remakeGenerator)
```
```{r}
df = commands(data = simulate(center = MU, scale = SIGMA))
df
df = expand(df, values = c("rep1", "rep2"))
df
evaluate(df, wildcard = "MU", values = 1:2)
evaluate(df, wildcard = "MU", values = 1:2, expand = FALSE)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = FALSE)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1, 10)))
```
For another demonstration, see the [flexible example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/flexible), which almost the same as the [basic example](https://github.com/wlandau/remakeGenerator/tree/master/inst/examples/basic) except that it uses `expand()` and `evaluate()` explicitly.