generated from carpentries/workbench-template-rmd
-
-
Notifications
You must be signed in to change notification settings - Fork 9
/
lifecycle.Rmd
281 lines (187 loc) · 10.5 KB
/
lifecycle.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
---
title: 'The Workflow Lifecycle'
teaching: 10
exercises: 2
---
:::::::::::::::::::::::::::::::::::::: questions
- What happens if we re-run a workflow?
- How does `targets` know what steps to re-run?
- How can we inspect the state of the workflow?
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: objectives
- Explain how `targets` helps increase efficiency
- Be able to inspect a workflow to see what parts are outdated
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: {.instructor}
Episode summary: Demonstrate typical cycle of running `targets`: make, inspect, adjust, make...
:::::::::::::::::::::::::::::::::::::
```{r}
#| label: setup
#| echo: FALSE
#| message: FALSE
#| warning: FALSE
library(targets)
library(visNetwork)
source("files/lesson_functions.R")
```
## Re-running the workflow
One of the features of `targets` is that it maximizes efficiency by only running the parts of the workflow that need to be run.
This is easiest to understand by trying it yourself. Let's try running the workflow again:
```{r}
#| label: targets-run
#| echo: [5]
# Each tar_script is fresh, so need to run once to catch up to learners
pushd(make_tempdir())
write_example_plan("plan_1.R")
tar_make(reporter = "silent")
tar_make()
popd()
```
Remember how the first time we ran the pipeline, `targets` printed out a list of each target as it was being built?
This time, it tells us it is skipping those targets; they have already been built, so there's no need to run that code again.
Remember, the fastest code is the code you don't have to run!
## Re-running the workflow after modification
What happens when we change one part of the workflow then run it again?
Say that we decide the species names should be shorter.
Right now they include the common name and the scientific name, but we really only need the first part of the common name to distinguish them.
Edit `_targets.R` so that the `clean_penguin_data()` function looks like this:
```{r}
#| label: new-func
#| eval: FALSE
#| file: files/tar_functions/clean_penguin_data.R
```
Then run it again.
```{r}
#| label: targets-run-2
#| echo: [6]
plan_2_dir <- make_tempdir()
pushd(plan_2_dir)
write_example_plan("plan_1.R")
tar_make(reporter = "silent")
write_example_plan("plan_2.R")
tar_make()
popd()
```
What happened?
This time, it skipped `penguins_csv_file` and `penguins_data_raw` and only ran `penguins_data`.
Of course, since our example workflow is so short we don't even notice the amount of time saved.
But imagine using this in a series of computationally intensive analysis steps.
The ability to automatically skip steps results in a massive increase in efficiency.
::::::::::::::::::::::::::::::::::::: challenge
## Challenge 1: Inspect the output
How can you inspect the contents of `penguins_data`?
:::::::::::::::::::::::::::::::::: solution
With `tar_read(penguins_data)` or by running `tar_load(penguins_data)` followed by `penguins_data`.
::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::
## Under the hood
How does `targets` keep track of which targets are up-to-date vs. outdated?
For each target in the workflow (items in the list at the end of the `_targets.R` file) and any custom functions used in the workflow, `targets` calculates a **hash value**, or unique combination of letters and digits that represents an object in the computer's memory.
You can think of the hash value (or "hash" for short) as **a unique fingerprint** for a target or function.
The first time your run `tar_make()`, `targets` calculates the hashes for each target and function as it runs the code and stores them in the targets cache (the `_targets` folder).
Then, for each subsequent call of `tar_make()`, it calculates the hashes again and compares them to the stored values.
It detects which have changed, and this is how it knows which targets are out of date.
:::::::::::::::::::::::::::::::::::::::: callout
## Where the hashes live
If you are curious about what the hashes look like, you can see them in the file `_targets/meta/meta`, but **do not edit this file by hand**---that would ruin your workflow!
::::::::::::::::::::::::::::::::::::::::
This information is used in combination with the dependency relationships (in other words, how each target depends on the others) to re-run the workflow in the most efficient way possible: code is only run for targets that need to be re-built, and others are skipped.
## Visualizing the workflow
Typically, you will be making edits to various places in your code, adding new targets, and running the workflow periodically.
It is good to be able to visualize the state of the workflow.
This can be done with `tar_visnetwork()`
```{r}
#| label: targets-run-hide-3
#| echo: [5]
#| results: "asis"
#| eval: FALSE
# TODO: Change #| eval to TRUE when
# https://github.com/carpentries/sandpaper/issues/443
# is resolved
pushd(plan_2_dir)
tar_visnetwork()
popd()
```
![](fig/lifecycle-visnetwork.png){alt="Visualization of the targets worklow, showing 'penguins_data' connected by lines to 'penguins_data_raw', 'penguins_csv_file' and 'clean_penguin_data'"}
You should see the network show up in the plot area of RStudio.
It is an HTML widget, so you can zoom in and out (this isn't important for the current example since it is so small, but is useful for larger, "real-life" workflows).
Here, we see that all of the targets are dark green, indicating that they are up-to-date and would be skipped if we were to run the workflow again.
::::::::::::::::::::::::::::::::::::: prereq
## Installing visNetwork
You may encounter an error message `The package "visNetwork" is required.`
In this case, install it first with `install.packages("visNetwork")`.
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: challenge
## Challenge 2: What else can the visualization tell us?
Modify the workflow in `_targets.R`, then run `tar_visnetwork()` again **without** running `tar_make()`.
What color indicates that a target is out of date?
:::::::::::::::::::::::::::::::::: solution
Light blue indicates the target is out of date.
Depending on how you modified the code, any or all of the targets may now be light blue.
::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: callout
## 'Outdated' does not always mean 'will be run'
Just because a target appears as light blue (is "outdated") in the network visualization, this does not guarantee that it will be re-built during the next run. Rather, it means that **at least of one the targets that it depends on has changed**.
For example, if the workflow state looked like this:
`A -> B* -> C -> D`
where the `*` indicates that `B` has changed compared to the last time the workflow was run, the network visualization will show `B`, `C`, and `D` all as light blue.
But if re-running the workflow results in the exact same value for `C` as before, `D` will not be re-run (will be "skipped").
Most of the time, a single change will cascade to the rest of the downstream targets and cause them to be re-built, but this is not always the case. `targets` has no way of knowing ahead of time what the actual output will be, so it cannot provide a network visualization that completely predicts the future!
:::::::::::::::::::::::::::::::::::::::::::::::
## Other ways to check workflow status
The visualization is very useful, but sometimes you may be working on a server that doesn't provide graphical output, or you just want a quick textual summary of the workflow.
There are some other useful functions that can do that.
`tar_outdated()` lists only the outdated targets; that is, targets that will be built during the next run, or depend on such a target.
If everything is up to date, it will return a zero-length character vector (`character(0)`).
```{r}
#| label: targets-outdated
#| echo: [2]
pushd(plan_2_dir)
tar_outdated()
popd()
```
`tar_progress()` shows the current status of the workflow as a dataframe.
You may find it helpful to further manipulate the dataframe to obtain useful summaries of the workflow, for example using `dplyr` (such data manipulation is beyond the scope of this lesson but the instructor may demonstrate its use).
```{r}
#| label: targets-progress
#| echo: [2]
pushd(plan_2_dir)
tar_progress()
popd()
```
## Granular control of targets
It is possible to only make a particular target instead of running the entire workflow.
To do this, type the name of the target you wish to build after `tar_make()` (note that any targets required by the one you specify will also be built).
For example, `tar_make(penguins_data_raw)` would **only** build `penguins_data_raw`, not `penguins_data`.
Furthermore, if you want to manually "reset" a target and make it appear out-of-date, you can do so with `tar_invalidate()`. This means that target (and any that depend on it) will be re-run next time.
Let's give this a try. Remember that our pipeline is currently up to date, so `tar_make()` will skip everything:
```{r}
#| label: targets-progress-show-2
#| eval: true
#| echo: [2]
pushd(plan_2_dir)
tar_make()
popd()
```
Let's invalidate `penguins_data` and run it again:
```{r}
#| label: targets-progress-show-3
#| eval: true
#| echo: [2, 3]
pushd(plan_2_dir)
tar_invalidate(penguins_data)
tar_make()
popd()
```
If you want to reset **everything** and start fresh, you can use `tar_invalidate(everything())` (`tar_invalidate()` [accepts `tidyselect` expressions](https://docs.ropensci.org/targets/reference/tar_invalidate.html) to specify target names).
**Caution should be exercised** when using granular methods like this, though, since you may end up with your workflow in an unexpected state. The surest way to maintain an up-to-date workflow is to run `tar_make()` frequently.
## How this all works in practice
In practice, you will likely be switching between running the workflow with `tar_make()`, loading the targets you built with `tar_load()`, and editing your custom functions by running code in an interactive R session. It takes some time to get used to it, but soon you will feel that your code isn't "real" until it is embedded in a `targets` workflow.
::::::::::::::::::::::::::::::::::::: keypoints
- `targets` only runs the steps that have been affected by a change to the code
- `tar_visnetwork()` shows the current state of the workflow as a network
- `tar_progress()` shows the current state of the workflow as a data frame
- `tar_outdated()` lists outdated targets
- `tar_invalidate()` can be used to invalidate (re-run) specific targets
::::::::::::::::::::::::::::::::::::::::::::::::