generated from carpentries/workbench-template-rmd
-
-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathparallel.Rmd
174 lines (129 loc) · 6.06 KB
/
parallel.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: 'Parallel Processing'
teaching: 10
exercises: 2
---
:::::::::::::::::::::::::::::::::::::: questions
- How can we build targets in parallel?
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: objectives
- Be able to build targets in parallel
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: instructor
Episode summary: Show how to use parallel processing
:::::::::::::::::::::::::::::::::::::
```{r}
#| label: setup
#| echo: FALSE
#| message: FALSE
#| warning: FALSE
library(targets)
library(tarchetypes)
library(broom)
source("files/lesson_functions.R")
# Increase width for printing tibbles
options(width = 140)
```
Once a pipeline starts to include many targets, you may want to think about parallel processing.
This takes advantage of multiple processors in your computer to build multiple targets at the same time.
::::::::::::::::::::::::::::::::::::: {.callout}
## When to use parallel processing
Parallel processing should only be used if your workflow has independent tasks---if your workflow only consists of a linear sequence of targets, then there is nothing to parallelize.
Most workflows that use branching can benefit from parallelism.
:::::::::::::::::::::::::::::::::::::
`targets` includes support for high-performance computing, cloud computing, and various parallel backends.
Here, we assume you are running this analysis on a laptop and so will use a relatively simple backend.
If you are interested in high-performance computing, [see the `targets` manual](https://books.ropensci.org/targets/hpc.html).
### Install R packages for parallel computing
For this demo, we will use the new [`crew` backend](https://wlandau.github.io/crew/).
::::::::::::::::::::::::::::::::::::: {.prereq}
### Install required packages
You will need to install several packages to use the `crew` backend:
```{r}
#| label: install-crew
#| eval: false
install.packages("nanonext", repos = "https://shikokuchuo.r-universe.dev")
install.packages("mirai", repos = "https://shikokuchuo.r-universe.dev")
install.packages("crew", type = "source")
```
:::::::::::::::::::::::::::::::::::::
### Set up workflow
To enable parallel processing with `crew` you only need to load the `crew` package, then tell `targets` to use it using `tar_option_set`.
Specifically, the following lines enable crew, and tells it to use 2 parallel workers.
You can increase this number on more powerful machines:
```r
library(crew)
tar_option_set(
controller = crew_controller_local(workers = 2)
)
```
Make these changes to the penguins analysis.
It should now look like this:
```{r}
#| label = "example-model-show-setup",
#| eval = FALSE,
#| code = readLines("files/plans/plan_9.R")[3:42]
```
There is still one more thing we need to modify only for the purposes of this demo: if we ran the analysis in parallel now, you wouldn't notice any difference in compute time because the functions are so fast.
So let's make "slow" versions of `glance_with_mod_name()` and `augment_with_mod_name()` using the `Sys.sleep()` function, which just tells the computer to wait some number of seconds.
This will simulate a long-running computation and enable us to see the difference between running sequentially and in parallel.
Add these functions to `functions.R` (you can copy-paste the original ones, then modify them):
```{r}
#| label: slow-funcs
#| eval: false
#| file:
#| - files/tar_functions/glance_with_mod_name_slow.R
#| - files/tar_functions/augment_with_mod_name_slow.R
```
Then, change the plan to use the "slow" version of the functions:
```{r}
#| label = "example-model-show-9",
#| eval = FALSE,
#| code = readLines("files/plans/plan_10.R")[3:42]
```
Finally, run the pipeline with `tar_make()` as normal.
```{r}
#| label: example-model-hide-9
#| warning: false
#| message: false
#| echo: false
# FIXME: parallel code uses all available CPUs and hangs when rendering website
# with sandpaper::build_lesson(), even though it only uses 2 when run
# interactively
#
# plan_10_dir <- make_tempdir()
# pushd(plan_10_dir)
# write_example_plan("plan_9.R")
# tar_make(reporter = "silent")
# write_example_plan("plan_10.R")
# tar_make()
# popd()
# Solution for now is to hard-code output
cat("✔ skip target penguins_data_raw_file
✔ skip target penguins_data_raw
✔ skip target penguins_data
✔ skip target models
• start branch model_predictions_5ad4cec5
• start branch model_predictions_c73912d5
• start branch model_predictions_91696941
• start branch model_summaries_5ad4cec5
• start branch model_summaries_c73912d5
• start branch model_summaries_91696941
• built branch model_predictions_5ad4cec5 [4.884 seconds]
• built branch model_predictions_c73912d5 [4.896 seconds]
• built branch model_predictions_91696941 [4.006 seconds]
• built pattern model_predictions
• built branch model_summaries_5ad4cec5 [4.011 seconds]
• built branch model_summaries_c73912d5 [4.011 seconds]
• built branch model_summaries_91696941 [4.011 seconds]
• built pattern model_summaries
• end pipeline [15.153 seconds]")
```
Notice that although the time required to build each individual target is about 4 seconds, the total time to run the entire workflow is less than the sum of the individual target times! That is proof that processes are running in parallel **and saving you time**.
The unique and powerful thing about targets is that **we did not need to change our custom function to run it in parallel**. We only adjusted *the workflow*. This means it is relatively easy to refactor (modify) a workflow for running sequentially locally or running in parallel in a high-performance context.
Now that we have demonstrated how this works, you can change your analysis plan back to the original versions of the functions you wrote.
::::::::::::::::::::::::::::::::::::: keypoints
- Dynamic branching creates multiple targets with a single command
- You usually need to write custom functions so that the output of the branches includes necessary metadata
- Parallel computing works at the level of the workflow, not the function
::::::::::::::::::::::::::::::::::::::::::::::::