generated from carpentries/workbench-template-rmd
-
-
Notifications
You must be signed in to change notification settings - Fork 9
/
files.Rmd
389 lines (303 loc) · 11.5 KB
/
files.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
---
title: 'Working with External Files'
teaching: 10
exercises: 2
---
:::::::::::::::::::::::::::::::::::::: questions
- How can we load external data?
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: objectives
- Be able to load external data into a workflow
- Configure the workflow to rerun if the contents of the external data change
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: instructor
Episode summary: Show how to read and write external files
:::::::::::::::::::::::::::::::::::::
```{r}
#| label: setup
#| echo: FALSE
#| message: FALSE
#| warning: FALSE
library(targets)
library(tarchetypes)
source("https://raw.githubusercontent.com/joelnitta/targets-workshop/main/episodes/files/functions.R?token=$(date%20+%s)") # nolint
```
## Treating external files as a dependency
Almost all workflows will start by importing data, which is typically stored as an external file.
As a simple example, let's create an external data file in RStudio with the "New File" menu option. Enter a single line of text, "Hello World" and save it as "hello.txt" text file in `_targets/user/data/`.
We will read in the contents of this file and store it as `some_data` in the workflow by writing the following plan and running `tar_make()`:
::::::::::::::::::::::::::::::::::::: {.callout}
## Save your progress
You can only have one active `_targets.R` file at a time in a given project.
We are about to create a new `_targets.R` file, but you probably don't want to lose your progress in the one we have been working on so far (the penguins bill analysis). You can temporarily rename that one to something like `_targets_old.R` so that you don't overwrite it with the new example `_targets.R` file below. Then, rename them when you are ready to work on it again.
:::::::::::::::::::::::::::::::::::::
```{r}
#| label: example-file-show-1
#| eval: FALSE
library(targets)
library(tarchetypes)
tar_plan(
some_data = readLines("_targets/user/data/hello.txt")
)
```
```{r}
#| label: example-file-hide-1
#| echo: FALSE
tar_dir({
fs::dir_create("_targets/user/data")
writeLines("Hello World", "_targets/user/data/hello.txt")
tar_script({
library(targets)
library(tarchetypes)
tar_plan(
some_data = readLines("_targets/user/data/hello.txt")
)
})
tar_make()
})
```
If we inspect the contents of `some_data` with `tar_read(some_data)`, it will contain the string `"Hello World"` as expected.
Now say we edit "hello.txt", perhaps add some text: "Hello World. How are you?". Edit this in the RStudio text editor and save it. Now run the pipeline again.
```{r}
#| label: example-file-show-2
#| eval: FALSE
library(targets)
library(tarchetypes)
tar_plan(
some_data = readLines("_targets/user/data/hello.txt")
)
```
```{r}
#| label: example-file-hide-2
#| echo: FALSE
tar_dir({
fs::dir_create("_targets/user/data")
writeLines("Hello World", "_targets/user/data/hello.txt")
tar_script({
library(targets)
library(tarchetypes)
tar_plan(
some_data = readLines("_targets/user/data/hello.txt")
)
})
tar_make(reporter = "silent")
writeLines("Hello World. How are you?", "_targets/user/data/hello.txt")
tar_make()
})
```
The target `some_data` was skipped, even though the contents of the file changed.
That is because right now, targets is only tracking the **name** of the file, not its contents. We need to use a special function for that, `tar_file()` from the `tarchetypes` package. `tar_file()` will calculate the "hash" of a file---a unique digital signature that is determined by the file's contents. If the contents change, the hash will change, and this will be detected by `targets`.
```{r}
#| label: example-file-show-3
#| eval: FALSE
library(targets)
library(tarchetypes)
tar_plan(
tar_file(data_file, "_targets/user/data/hello.txt"),
some_data = readLines(data_file)
)
```
```{r}
#| label: example-file-hide-3
#| echo: FALSE
tar_dir({
fs::dir_create("_targets/user/data")
writeLines("Hello World", "_targets/user/data/hello.txt")
tar_script({
library(targets)
library(tarchetypes)
tar_plan(
tar_file(data_file, "_targets/user/data/hello.txt"),
some_data = readLines(data_file)
)
})
tar_make(reporter = "silent")
writeLines("Hello World. How are you?", "_targets/user/data/hello.txt")
tar_make()
})
```
This time we see that `targets` does successfully re-build `some_data` as expected.
## A shortcut (or, About target factories)
However, also notice that this means we need to write two targets instead of one: one target to track the contents of the file (`data_file`), and one target to store what we load from the file (`some_data`).
It turns out that this is a common pattern in `targets` workflows, so `tarchetypes` provides a shortcut to express this more concisely, `tar_file_read()`.
```{r}
#| label: example-file-show-4
#| eval: FALSE
library(targets)
library(tarchetypes)
tar_plan(
tar_file_read(
hello,
"_targets/user/data/hello.txt",
readLines(!!.x)
)
)
```
Let's inspect this pipeline with `tar_manifest()`:
```{r}
#| label: example-file-show-5
#| eval: FALSE
tar_manifest()
```
```{r}
#| label: example-file-hide-5
#| echo: FALSE
tar_dir({
# Emulate what the learner is doing
fs::dir_create("_targets/user/data")
# Old (longer) version:
writeLines("Hello World. How are you?", "_targets/user/data/hello.txt")
# Make it again with the shorter version
tar_script({
library(targets)
library(tarchetypes)
tar_file_read(
hello,
"_targets/user/data/hello.txt",
readLines(!!.x)
)
})
tar_manifest()
})
```
Notice that even though we only specified one target in the pipeline (`hello`, with `tar_file_read()`), the pipeline actually includes **two** targets, `hello_file` and `hello`.
That is because `tar_file_read()` is a special function called a **target factory**, so-called because it makes **multiple** targets at once. One of the main purposes of the `tarchetypes` package is to provide target factories to make writing pipelines easier and less error-prone.
## Non-standard evaluation
What is the deal with the `!!.x`? That may look unfamiliar even if you are used to using R. It is known as "non-standard evaluation," and gets used in some special contexts. We don't have time to go into the details now, but just remember that you will need to use this special notation with `tar_file_read()`. If you forget how to write it (this happens frequently!) look at the examples in the help file by running `?tar_file_read`.
## Other data loading functions
Although we used `readLines()` as an example here, you can use the same pattern for other functions that load data from external files, such as `readr::read_csv()`, `xlsx::read_excel()`, and others (for example, `read_csv(!!.x)`, `read_excel(!!.x)`, etc.).
This is generally recommended so that your pipeline stays up to date with your input data.
::::::::::::::::::::::::::::::::::::: {.challenge}
## Challenge: Use `tar_file_read()` with the penguins example
We didn't know about `tar_file_read()` yet when we started on the penguins bill analysis.
How can you use `tar_file_read()` to load the CSV file while tracking its contents?
:::::::::::::::::::::::::::::::::: {.solution}
```{r}
#| label: tar-file-read-answer-show
#| eval: FALSE
source("R/packages.R")
source("R/functions.R")
tar_plan(
tar_file_read(
penguins_data_raw,
path_to_file("penguins_raw.csv"),
read_csv(!!.x, show_col_types = FALSE)
),
penguins_data = clean_penguin_data(penguins_data_raw)
)
```
```{r}
#| label: tar-file-read-answer-hide
#| echo: FALSE
tar_dir({
# New workflow
write_example_plan(3)
# Run it
tar_make()
})
```
::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::
## Writing out data
Writing to files is similar to loading in files: we will use the `tar_file()` function. There is one important caveat: in this case, the second argument of `tar_file()` (the command to build the target) **must return the path to the file**. Not all functions that write files do this (some return nothing; these treat the output file is a side-effect of running the function), so you may need to define a custom function that writes out the file and then returns its path.
Let's do this for `writeLines()`, the R function that writes character data to a file. Normally, its output would be `NULL` (nothing), as we can see here:
```{r}
#| label: write-data-show-1
#| eval: false
x <- writeLines("some text", "test.txt")
x
```
```{r}
#| label: write-data-hide-1
#| echo: false
x <- writeLines("some text", "test.txt")
x
fs::file_delete("test.txt")
```
Here is our modified function that writes character data to a file and returns the name of the file (the `...` means "pass the rest of these arguments to `writeLines()`"):
```{r}
#| label: write-data-func
write_lines_file <- function(text, file, ...) {
writeLines(text = text, con = file, ...)
file
}
```
Let's try it out:
```{r}
#| label: write-data-show-2
#| eval: false
x <- write_lines_file("some text", "test.txt")
x
```
```{r}
#| label: write-data-hide-2
#| echo: false
x <- write_lines_file("some text", "test.txt")
x
fs::file_delete("test.txt")
```
We can now use this in a pipeline. For example let's change the text to upper case then write it out again:
```{r}
#| label: example-file-show-6
#| eval: false
library(targets)
library(tarchetypes)
tar_plan(
tar_file_read(
hello,
"_targets/user/data/hello.txt",
readLines(!!.x)
),
hello_caps = toupper(hello),
tar_file(
hello_caps_out,
write_lines_file(hello_caps, "_targets/user/results/hello_caps.txt")
)
)
```
```{r}
#| label: example-file-hide-6
#| echo: false
tar_dir({
fs::dir_create("_targets/user/data")
fs::dir_create("_targets/user/results")
writeLines("Hello World. How are you?", "_targets/user/data/hello.txt")
tar_script({
library(targets)
library(tarchetypes)
write_lines_file <- function(text, file, ...) {
writeLines(text = text, con = file, ...)
file
}
tar_plan(
tar_file_read(
hello,
"_targets/user/data/hello.txt",
readLines(!!.x)
),
hello_caps = toupper(hello),
tar_file(
hello_caps_out,
write_lines_file(hello_caps, "_targets/user/results/hello_caps.txt")
)
)
})
tar_make()
})
```
Take a look at `hello_caps.txt` in the `results` folder and verify it is as you expect.
::::::::::::::::::::::::::::::::::::: {.challenge}
## Challenge: What happens to file output if its modified?
Delete or change the contents of `hello_caps.txt` in the `results` folder.
What do you think will happen when you run `tar_make()` again?
Try it and see.
:::::::::::::::::::::::::::::::::: {.solution}
`targets` detects that `hello_caps_out` has changed (is "invalidated"), and re-runs the code to make it, thus writing out `hello_caps.txt` to `results` again.
So this way of writing out results makes your pipeline more robust: we have a guarantee that the contents of the file in `results` are generated solely by the code in your plan.
::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: keypoints
- `tarchetypes::tar_file()` tracks the contents of a file
- Use `tarchetypes::tar_file_read()` in combination with data loading functions like `read_csv()` to keep the pipeline in sync with your input data
- Use `tarchetypes::tar_file()` in combination with a function that writes to a file and returns its path to write out data
::::::::::::::::::::::::::::::::::::::::::::::::