-
Notifications
You must be signed in to change notification settings - Fork 1
/
outline.txt
297 lines (213 loc) · 10.7 KB
/
outline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
Intro
writing a function in R
Basic function structure
name <- function(arguments (inputs)) {body, output}
return() isn't always necessary, last thing is returned
Environments... might skip this... It's important generally, but I think it's too much.
two types of vectors
*Atomic
lgl, int, dbl, chr, complex?, raw?
factor is an INT with names... WTF!?
"augmented vectors" with added attributes
"complex data structures"
"Homogenous"
Restricted types. Vector that is atomic only has that type of contents. You cannot mix them.
NULL means the absence of an entire vector
NA means the absence of a singular value. (Each atomic type has its own NA)
from ?NA
NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_ (dbl), NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language.
NA is NA_logical... there is no other name for it.
*Lists
Can be "heterogenous"
Can contain anything... Even more lists.
subsetting
`[[`
`[`
`$`
Interesting... maybe skip this too
Except the image of a list could be useful in explaining what a data.frame/tibble is
A lot of what we do as humans is break down complex problems into pieces that can be repeated. This increases efficiency and allows us to focus on more complex problems.
R let's you do things in many different ways
Package system allows this to be extended in infinite ways.
Iteration in R
Iteration helps us deal with these repetitive tasks.
For Loops
output #define this to store your result
for (sequence) {
body
output # if you just print, then the output is printed to the screen
}
df <- data.frame(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
For Loop
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
Same thing with seq_along
Why is this safer? Is this relevant to this audience? Mention, but glaze?
?base::seq_along or ?seq
for (i in seq_along(df)) {
print(median(df[[i]]))
}
Why write a function?
Copying and pasting code is fine
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$a, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
This is a legitimate approach, but it is error prone and doesn't easily enable you to see what you're doing.
When to write a function?
Hadley's advice is when you copy and paste something twice (you have 3 repeats with different variables)
rescale01 <-
function(x) {
(x - min(x, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
}
Then:
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
Do you see the repetition here?
Tidyverse:purrr:: Map/Walk Family
map allows us to do this
map(df, rescale01)
Also, base::lapply
lapply(df, rescale01)
data.table (don't know how to do this, but it's possible)
**summary of how you should write a function**
Simplify your problem (do one thing really well)
Make a working snippet
Convert snippet to use a generic input (function arguments)
simplify if you can
Make this a function "name <- function(arguments) {}
She uses the code from above to prove this point. Perhaps we could skip over this or scroll through it.
Good function is
Correct and understandable (to humans as well as the computer)
I think this is probably skippable so we can get to map/walk sooner.
naming principles... Definitely skippable or at least a single slide with best practices.
styles vary
Tidy style =
all lower case
sep_by_low_bar
Be consistent
function name should say what it does (why not a verb)
argument name should be a noun (things)
don't smash override existing functions or names
argument order: data first, then details (parameters)
Data first is because tidyverse functions assume this standard!
Pick reasonable default values where appropriate
Uses cupcake concept
for loop is cupcake recipes for vanilla and chocolate
code is super repeated
makes it hard to see what is important and what is different between things (very little)
shows two for loops
out1 <- vector("double", ncol(mtcars)) #variable of the right type to store results
for (i in seq_along(mtcars)) {
out1[i] <- mean(mtcars[[i]], na.rm = TRUE)
}
out2 <- vector("double", ncol(mtcars)) #variable of the right type to store results
for (i in seq_along(mtcars)) {
out2[i] <- median(mtcars[[i]], na.rm = TRUE)
}
functions can be arguments just like anything else!
Steph suggested a read in documents example with map_df
http://data.library.virginia.edu/getting-started-with-the-purrr-package-in-r/
Mix with this your new readtxtfromfolder2() function to import filenames and timestamps as columns
Jenny has a huge purrr site
https://jennybc.github.io/purrr-tutorial/
Specific page mentioning how you can use the various functions
https://jennybc.github.io/purrr-tutorial/ls03_map-function-syntax.html
What else have I used it for?
Nested datasets, iterate over them and do stuff.
Iterate over them and plot
iterate over them and print a paragraph about the plot and its data in an RmD
Should mention how you use map, map2, pmap
map = ., map2 = .x;.y, pmap = .1;.2;.3;etc...
This was difficult for me to figure out, spend some time on it.
Datacamp functional programming...
Dataframe is a list of lists/vector
map() iterates over this list
Show a list that contains numeric vectors first
l = list(a=1:10, b = 10:100)
map(l, function(x) mean(x, na.rm = TRUE))
map can iterate over an atomic vector (simplist case? show first?)
vec <- c(a=2, c= 5)
handy shortcuts listed as a positive, but really only useful if you "get it"
The map functions use the ... ("dot dot dot") argument to pass along additional arguments to .f each time it’s called. For example, we can pass the trim argument to the mean() function:
map_dbl(df, mean, trim = 0.5)
Multiple arguments can be passed along using commas to separate them. For example, we can also pass the na.rm argument to mean():
map_dbl(df, mean, trim = 0.5, na.rm = TRUE)
You don't have to specify the arguments by name, but it is good practice!
You may be wondering why the arguments to map() are .x and .f and not x and f? It's because .x and .f are very unlikely to be argument names you might pass through the ..., thereby preventing confusion about whether an argument belongs to map() or to the function being mapped.
supplying as a formula
~ sum(is.na(.)) The . takes the place of the argument. (I hate this. It's not clear.)
map2(.x, .y, .f, ...)
pmap(.l, .f, ...)
with pmap, you can name the parts directly to what they should be in the function.
example on datacamp is with rnorm(n, mean, sd)
list1 <- list(n = list(5,10,20),
mean = list(5,10, 20),
sd = list(0.1, 0.2, 0.3)
)
pmap(list1, rnorm)
because they are named the same I assume you can mix them up and they go in the right place (nope! POSITIONAL MATCHING IF NAMES ARE NOT MATCHED)
Compare the following two calls to pmap() (run them in the console and compare their output too!):
pmap(list(n, mu, sd), rnorm)
pmap(list(mu, n, sd), rnorm)
What's the difference? By default pmap() matches the elements of the list to the arguments in the function by position. In the first case, n to the n argument of rnorm(), mu to the mean argument of rnorm(), and sd to the sd argument of rnorm(). In the second case mu gets matched to the n argument of rnorm(), which is clearly not what we intended!
Instead of relying on this positional matching, a safer alternative is to provide names in our list. The name of each element should be the argument name we want to match it to.
pmap(list(mean=mu, n=n, sd=sd), rnorm) #name of argument comes first, then list that you're pulling in.
if they are not named, the should go in order? (yes, positional)
finally, you can call them via the function call like this:
pmap(list(mu, n, sd), function(mu, n, sd) rnorm(n, mu, sd))
Shit... invoke_map(.f, .x, ...)
reverses the order of the call. .f is first and is a list of function names, .x is a list of the arguments to the various functions (if they differ), if they don't differ you can pass that through with `...`.
Sometimes it's not the arguments to a function you want to iterate over, but a set of functions themselves. Imagine that instead of varying the parameters to rnorm() we want to simulate from different distributions, say, using rnorm(), runif(), and rexp(). How do we iterate over calling these functions?
In purrr, this is handled by the invoke_map() function. The first argument is a list of functions. In our example, something like:
f <- list("rnorm", "runif", "rexp")
The second argument specifies the arguments to the functions. In the simplest case, all the functions take the same argument, and we can specify it directly, relying on ... to pass it to each function. In this case, call each function with the argument n = 5:
invoke_map(f, n = 5)
In more complicated cases, the functions may take different arguments, or we may want to pass different values to each function. In this case, we need to supply invoke_map() with a list, where each element specifies the arguments to the corresponding function.
Let's use this approach to simulate three samples from the following three distributions: Normal(10, 1), Uniform(0, 5), and Exponential(5).
# Define list of functions
f <- list("rnorm", "runif", "rexp")
# Parameter list for rnorm()
rnorm_params <- list(mean = 10)
# Add a min element with value 0 and max element with value 5
runif_params <- list(min = 0, max = 5)
# Add a rate element with value 5
rexp_params <- list(rate = 5)
# Define params for each function
params <- list(
rnorm_params,
runif_params,
rexp_params
)
# Call invoke_map() on f supplying params as the second argument
invoke_map(f, params, n = 5)
Walk/walk2/pwalk
Same concept but output is a "side effect" = plot, print...
The return value of a walk function is the input so you can output something and then continue to do something with it, in a chain.
Hadley's advice is to keep chains < 10 verbs' long. Typically for me a walk is the final thing I want done.
Charlotte's example is that of printing a list one by one and then measuring it's length with map_dbl()
# Define list of functions
f <- list(Normal = "rnorm", Uniform = "runif", Exp = "rexp")
# Define params
params <- list(
Normal = list(mean = 10),
Uniform = list(min = 0, max = 5),
Exp = list(rate = 5)
)
# Assign the simulated samples to sims
sims <- invoke_map(f, params, n = 50)
# Use walk() to make a histogram of each element in sims
walk(sims, hist)