-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter6.Rmd
738 lines (566 loc) · 26.4 KB
/
chapter6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
---
title: "Rethinking Chapter 6"
author: "Gregor Mathes"
date: "2021-02-02"
slug: Rethinking Chapter 6
categories: []
tags: [Rethinking, Bayes, Statistics]
subtitle: ''
summary: 'This is the fifth part of a series where I work through the practice questions of the second edition of Richard McElreaths Statistical Rethinking'
authors: [Gregor Mathes]
lastmod: '2021-02-02T12:07:04+02:00'
featured: no
projects: [Rethinking]
output:
html_document:
toc: true
toc_depth: 1
number_sections: true
fig_width: 6
mathjax: "default"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.dim=c(7,4), warning=FALSE, message = FALSE)
library(tidyverse)
library(rethinking)
library(ggdag)
library(dagitty)
map <- purrr::map
```
# Introduction
This is the fifth part of a series where I work through the practice questions of the second edition of Richard McElreaths [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/).
Each post covers a new chapter and you can see the posts on previous chapters [here](https://gregor-mathes.netlify.app/tags/rethinking/).
You can find the the lectures and homework accompanying the book [here](https://github.com/rmcelreath/stat_rethinking_2020%3E).
The colours for this blog post are:
```{r colour setup}
coral <- "#CD7672"
mint <- "#138086"
purple <- "#534666"
yellow <- "#EEb462"
```
```{r colour plot, echo=FALSE}
tibble(colours = c(coral, mint, purple, yellow),
colourname = c("coral", "mint", "purple", "yellow")) %>%
arrange(colours) %>%
mutate(colourname = fct_reorder(colourname, colours),
colourname = paste0(colours, "/ ", colourname)) %>%
ggplot() +
geom_bar(aes(y = colours, fill = colours)) +
scale_fill_identity() +
geom_text(aes(x = 0.5, y = colours, label = colourname),
size = 6, colour = "white") +
theme_void()
```
The online version of the *Statistical Rethinking*, provided by the brilliant [Erik Kusch](https://www.erikkusch.com/), is missing a lot of practive questions, so I will focus on the examples from the print version here.
# Easy practices
## Question 6E1
**List three mechanisms by which multiple regression can produce false inferences about causal effect.**
The tree examples mentioned throughout the chapter were:
1. Collinearity
2. Post-treatment bias
3. Collider bias
## Question 6E2
**For one of the mechanisms in the previous problem, provide an example of your choice, perhaps from your own research.**
- Collinearity
If you want to estimate the effect of geographic range on the extinction risk organism in the fossil record, you can choose between a range of potential parameters that express geographic range. For example, you can use the convex hull area or the maximum pairwise great circle distance. However, if you add both parameters in a model their true magnitude of association to extinction
risk is lowered or even hidden, as they both encapsulate the same information.
- Post-treatment bias
Assume you want to estimate the effect of global mean temperature on the extinction risk of marine species in the fossil record. Additionally, you have an amazing data set on continental shelve area through time and would love to include that as well. However, temperature is quite likely causally related to shelve area as it drives eustatic sea level. So including shelve area in a model would shut the path between temperature and extinction risk, even though there is a real causal association.
- Collider bias
If anyone has a good example for a collider bias in palaeobiology, just message me on twitter (@GregorMathes).
## Question 6E3
**List the four elemental confounds. Can you explain the conditional dependencies of each?**
1. Pipe
```{r 6E3 part 1}
dagitty("dag{
X -> Y -> Z}") %>%
impliedConditionalIndependencies()
```
If we condition on `Y`, we shut the path between `X` and `Z`.
2. Fork
```{r 6E3 part 2}
dagitty("dag{
X <- Y -> Z}") %>%
impliedConditionalIndependencies()
```
If we condition on `Y`, then learning `X` tells us nothing about `Z`. All the information is in `Y`.
3. Collider
```{r 6E3 part 3}
dagitty("dag{
X -> Y <- Z}") %>%
impliedConditionalIndependencies()
```
`X` is independent of `Z`. But conditioning on `Y` would open the path, and then `X`
would be dependent on `Z` conditional on `Y`.
4. Descendant
```{r 6E3 part 4}
dagitty("dag{
X -> Y -> Z
Y -> W}") %>%
impliedConditionalIndependencies()
```
This is interesting. In the chapter, it says that if we would condition on `W`,
we would condition on `Y` as well (to a lesser extent). So I would have expected
X _||_ Z | W here.
## Question 6E4
**How is a biased sample like conditioning on a collider? Think of the example at the open of the chapter.**
Assume the collider X -> Y <- Z.Conditioning on a collider `Y` opens a path between `X` and `Z`, and leads to a spurious correlation between between these. This is similar to selection bias, where the researcher that sampled the data (or nature itself) cared about both `X` and `Z` when generating the sample.
# Medium practices
## Question 6M1
**Modify the DAG on page 190 (page 186 in print) to include the variable V, an unobserved cause of C and Y: C <- V -> Y. Reanalyze the DAG. How many paths connect X to Y? Which must be closed? Which variables should you condition on now?**
Let's update the DAG using the `ggdag` package:
```{r 6M1 part 1}
tribble(
~ name, ~ x, ~ y,
"A", 1, 3,
"U", 0, 2,
"C", 2, 2,
"V", 3, 1,
"B", 1, 1,
"X", 0, 0,
"Y", 2, 0
) %>%
dagify(
Y ~ X + C + V,
C ~ V + A,
B ~ C + U,
U ~ A,
X ~ U,
coords = .) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_node(internal_colour = mint, alpha = 0.8, colour = "white") +
geom_dag_text(aes(label = name), color = purple, size = 5) +
geom_dag_edges(edge_color = purple) +
labs(caption = "Figure 1: Adjusted DAG from the chapter, including V.") +
theme_void()
```
For the visual representations of the DAGs later on in this post, I will not show the code to generate the DAG plots to reduce the visual load on the reader. Anyways, you can look at the Rmd file with the raw code [here](https://github.com/Ischi94/statistical-rethinking/blob/master/chapter6.Rmd).
So let's first identify each path from `X` to `Y`.
(1) X -> Y
(2) X <- U <- A -> C -> Y
(3) X <- U <- A -> C <- V -> Y
(4) X <- U -> B <- C -> Y
(5) X <- U -> B <- C <- V -> Y
C is now a collider in path (3) and the path is closed. Additionally, both paths passing through B are closed as well as `B` stays a collider. The only open backdoor path is (2), and to close this path we can condition on `A`.
## Question 6M2
**Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters is the conditional association, not the association before the variables are included in the model. To highlight this, consider the DAG X -> Z -> Y. Simulate data from this DAG so that the correlation between X and Z is very large. Then include both in a model prediction Y. Do you observe any multicollinearity? Why or why not? What is different from the legs example in the chapter?**
The simulation is rather simple. We let `Z` be completely dependent on `X` with just a little bit of noise, and render `Y` then completely dependent on `Z`. Note that I standardize each value to choose the priors more easily.
```{r 6M2 part 1}
sim_dat <- tibble(x = rnorm(1e4),
z = rnorm(1e4, x, 0.5),
y = rnorm(1e4, z)) %>%
mutate(across(everything(), standardize))
```
The correlation between `X` and `Z` is hence really large as all information flows through this path.
```{r 6M2 part 2}
sim_dat %>%
summarise(correlation = cor(x, z))
```
Now let's fit a model using quadratic approximation and plot the coefficients:
```{r 6M2 part 3}
alist(y ~ dnorm(mu, sigma),
mu <- a + bx*x + bz*z,
a ~ dnorm(0, 0.2),
c(bx, bz) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = sim_dat) %>%
precis() %>%
as_tibble(rownames = "estimate") %>%
filter(estimate %in% c("bx", "bz")) %>%
rename(lower_pi = `5.5%`, upper_pi = `94.5%`) %>%
ggplot() +
geom_linerange(aes(xmin = lower_pi, xmax = upper_pi, y = estimate),
size = 1.5, colour = mint) +
geom_point(aes(x = mean, y = estimate),
shape = 21, colour = "grey20", stroke = 1,
size = 5, fill = purple) +
labs(y = NULL, x = "Estimate", caption = "Figure 2: Coefficient plot for simulated data with
collinearity.") +
theme_minimal()
```
Not a big surprise, the effect of `X` is completely hidden as we condition on `Z` in a pipe. There is nothing new about `X` once the model knows `Z`. But to know that `Z` is only a mediatory variable, we need a causal model expressed in a DAG. The difference to the legs example here is that `X` and `Z` are causally related as `X` causes `Z`. It is therefore an example for a post-treatment bias. The leg lengths, on the other hand, where not causing each other, but were caused by a common parent instead:
leg1 <- Parent -> leg2
## Question 6M3
**Learning to analyze DAGs requires practice. For each of the four DAGs below, state which variables, if any, you must adjust for (condition on) to estimate the total causal influence of X on Y.**
```{r 6M3 part 1, echo=FALSE}
tribble(
~ name, ~ x, ~ y,
"A", 2, 1,
"Z", 1, 1,
"X", 0, 0,
"Y", 2, 0
) %>%
dagify(
X ~ Z,
Z ~ A,
Y ~ X + Z + A,
coords = .) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_node(internal_colour = mint, alpha = 0.8, colour = "white") +
geom_dag_text(aes(label = name), color = purple, size = 5) +
geom_dag_edges(edge_color = purple) +
labs(caption = "Figure 3: Directed acyclic graph number one.") +
theme_void()
```
The are two backdoor paths, X <- Z -> Y and X <- Z <- A -> Y.
Both are open and go through `Z`, so we can simply condition on `Z`.
```{r 6M3 part 2, echo=FALSE}
tribble(
~ name, ~ x, ~ y,
"A", 2, 1,
"Z", 1, 1,
"X", 0, 0,
"Y", 2, 0
) %>%
dagify(
Z ~ A + X,
Y ~ X + Z + A,
coords = .) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_node(internal_colour = coral, alpha = 0.8, colour = "white") +
geom_dag_text(aes(label = name), color = purple, size = 5) +
geom_dag_edges(edge_color = purple) +
labs(caption = "Figure 4: Directed acyclic graph number two.") +
theme_void()
```
There are two backdoor paths, X -> Z -> Y and X -> Z <- A -> Y.
We want to keep X -> Z -> Y as it encapsulates information flowing from `X` to `Y` (and we
want to capture the total causal influence).
The second path is closed as `Z` is a collider here.
We don't need to adjust for anything.
But note that conditioning on `Z` would open the second path and would be a mistake as it would create association between `X` on `A`.
```{r 6M3 part 3, echo=FALSE}
tribble(
~ name, ~ x, ~ y,
"A", 0, 1,
"Z", 1, 1,
"X", 0, 0,
"Y", 2, 0
) %>%
dagify(
X ~ A,
Z ~ A + X + Y,
Y ~ X,
coords = .) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_node(internal_colour = yellow, alpha = 0.8, colour = "white") +
geom_dag_text(aes(label = name), color = purple, size = 5) +
geom_dag_edges(edge_color = purple) +
labs(caption = "Figure 5: Directed acyclic graph number three.") +
theme_void()
```
There are two backdoor paths: X <- A -> Z <- Y and X -> Z <- Y.
Both paths are closed as `Z` is a collider for both.
We don't need to condition on anything.
Again, inluding `Z` in a model would create spurios relationships that we want to avoid.
```{r 6M3 part 4, echo=FALSE}
tribble(
~ name, ~ x, ~ y,
"A", 0, 1,
"Z", 1, 1,
"X", 0, 0,
"Y", 2, 0
) %>%
dagify(
X ~ A,
Z ~ A + X,
Y ~ X + Z,
coords = .) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_node(colour = purple) +
geom_dag_text(aes(label = name), color = "white", size = 5) +
geom_dag_edges(edge_color = coral) +
labs(caption = "Figure 6: Directed acyclic graph number four.") +
theme_void()
```
There are two backdoor paths: X <- A -> Z -> Y and X -> Z -> Y.
We want to keep the second one but close the first one. For this, we can condition on `A`. Conditioning on `Z` would close the first path as well, but also the second one which we want to keep as true causal.
# Hard practices
## Question 6H1
**Use the Waffle House data, data(WaffleDivorce), to find the total causal influence of number of Waffle Houses on divorce rate. Justify your model or models with a causal graph.**
Let's load the data, standardise predictor and outcome and assign better names. We further transform south to an index variable: 1 is coded as not belonging to south, and 2 as belonging to south.
```{r 6H1 part 1}
data("WaffleDivorce")
dat_waffle <- WaffleDivorce %>%
as_tibble() %>%
select(divorce = Divorce, age = MedianAgeMarriage,
m_rate = Marriage, waffle = WaffleHouses,
south = South) %>%
mutate(across(-south, standardize),
south = south + 1)
```
I assume that `age` (medium age at marriage) influence both `divorce` (divorce rate) and `m_rate` (marriage rate). This is something we could see in previous chapter. I further expect that `south` (southerness) influences both `age` and `waffle` (number of Waffle Houses).
```{r 6H1 part 2, echo=FALSE}
tribble(
~ name, ~ x, ~ y,
"south", 0, 2,
"age", 1, 2,
"m_rate", 2, 1,
"waffle", 0, 0,
"divorce", 1, 0
) %>%
dagify(
divorce ~ age + waffle,
waffle ~ south,
age ~ south,
m_rate ~ age,
coords = .) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_node(colour = purple, shape = 1, size = 20) +
geom_dag_text(aes(label = name), color = "black", size = 4) +
geom_dag_edges(edge_color = coral) +
labs(caption = "Figure 7: The total causal influence of number of Waffle Houses on divorce rate.") +
theme_void()
```
We can see that there is one open backdoor path from `waffle` <- `south` -> `age` -> `divorce`. We can close this path by conditioning on `south`.
```{r 6H1 part 3}
m_waffle <- alist(
divorce ~ dnorm(mu, sigma),
mu <- a[south] + Bwaffle*waffle,
a[south] ~ dnorm(0, 0.5),
Bwaffle ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_waffle)
```
We can see that there is no information in `waffle` about `divorce`, that is not already in `south`.
```{r 6H1 part 4, warning=FALSE, fig.width=1}
m_waffle %>%
precis() %>%
as_tibble(rownames = "estimate") %>%
filter(estimate == "Bwaffle") %>%
mutate(across(where(is.numeric), round, digits = 2)) %>%
knitr::kable()
```
As we will look at coefficient estimates quite often in the hard practices, I will wrap up the code above into a helper function.
```{r 6H1 part 5}
test_independence <- function(model.input, coeff.input, depth.output = 1){
suppressWarnings(
precis(model.input, depth = depth.output) %>%
as_tibble(rownames = "estimate") %>%
filter(estimate %in% {{coeff.input}}) %>%
mutate(across(where(is.numeric), round, digits = 2)) %>%
knitr::kable()
)
}
```
We can visualise the association as well:
```{r 6H1 part 6}
s <- seq(from = -2, to = 2, length.out = 30)
N <- 1e3
m_waffle %>%
link(data = list(waffle = s,
south = 1)) %>%
as_tibble() %>%
pivot_longer(cols = everything(), values_to = "pred_divorce") %>%
add_column(waffle = rep(s, N)) %>%
group_by(waffle) %>%
nest() %>%
mutate(pred_divorce = map(data, "pred_divorce"),
mean_divorce = map_dbl(pred_divorce, mean),
pi = map(pred_divorce, PI),
lower_pi = map_dbl(pi, pluck(1)),
upper_pi = map_dbl(pi, pluck(2))) %>%
select(waffle, mean_divorce, lower_pi, upper_pi) %>%
ggplot() +
geom_ribbon(aes(waffle, ymin = lower_pi, ymax = upper_pi),
fill = mint, alpha = 0.3) +
geom_line(aes(waffle, mean_divorce),
size = 1.5, colour = coral) +
coord_cartesian(ylim = c(-2, 2)) +
labs(x = "Waffle (std)", y = "Divorce (std)",
caption = "Figure 8: Total causal effect of number of Waffle House on Divorce,
while keeping 'South' constant.") +
theme_minimal()
```
## Question 6H2
**Build a series of models to test the implied conditional independencies of the causal graph you used in the previous problem. If any of the tests fail, how do you think the graph needs to be amended? Does the graph need more or fewer arrows? Feel free to nominate variables that are not in the data.**
Let's first check the implied conditional independencies.
```{r 6H2 part 1}
dagitty("dag{
waffle -> divorce <- age <- south -> waffle
age -> m_rate}") %>%
impliedConditionalIndependencies()
```
Let's start with age _||_ waffle | south
```{r 6H2 part 2}
alist(
waffle ~ dnorm(mu, sigma),
mu <- a[south] + Bage*age,
a[south] ~ dnorm(0, 0.5),
c(Bage) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_waffle) %>%
test_independence(coeff.input = "Bage")
```
`age` is independent of `waffle` after conditioning on `south`, as the posterior `Bage` shows no consistent association with `waffle`.
******
divorce _||_ m_rate | age
```{r 6H2 part 3}
alist(
m_rate ~ dnorm(mu, sigma),
mu <- a + Bdivorce*divorce + Bage*age,
a ~ dnorm(0, 0.2),
c(Bdivorce, Bage) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_waffle) %>%
test_independence(coeff.input = "Bdivorce")
```
`divorce` is independent from `m_rate` after conditioning on `age`.
******
divorce _||_ south | age, waffle
```{r 6H2 part 4}
alist(
divorce ~ dnorm(mu, sigma),
mu <- a[south] + Bage*age + Bwaffle*waffle,
a[south] ~ dnorm(0, 0.5),
c(Bage, Bwaffle) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_waffle) %>%
test_independence(depth.output = 2, coeff.input = c("a[1]", "a[2]"))
```
`south` is independent from `divorce` after conditioning on `age` `and waffle`.
******
m_rate _||_ south | age
```{r 6H2 part 5}
alist(
m_rate ~ dnorm(mu, sigma),
mu <- a[south] + Bage*age,
a[south] ~ dnorm(0, 0.5),
Bage ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_waffle) %>%
test_independence(depth.output = 2, coeff.input = c("a[1]", "a[2]"))
```
`south` is independent from `m_rate` after conditioning on `age`.
******
m_rate _||_ waffle | south
```{r 6H2 part 6}
alist(
waffle ~ dnorm(mu, sigma),
mu <- a[south] + Bm_rate*m_rate,
a[south] ~ dnorm(0, 0.5),
Bm_rate ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_waffle) %>%
test_independence(coeff.input = "Bm_rate")
```
`m_rate` is independent from `waffle` after conditioning on `south`.
******
m_rate _||_ waffle | age
```{r 6H2 part 7}
alist(
waffle ~ dnorm(mu, sigma),
mu <- a + Bm_rate*m_rate + Bage*age,
a ~ dnorm(0, 0.2),
c(Bm_rate, Bage) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_waffle) %>%
test_independence(coeff.input = "Bm_rate")
```
`m_rate` is independent from `waffle` after conditioning on `age`.
## Data foxes
**All three problems below are based on the same data. The data in data(foxes) are 116 foxes from 30 different urban groups in England. These foxes are like street gangs. Group size varies from 2 to 8 individuals. Each group maintains its own (almost exclusive) urban territory. Some territories are larger than others. The area variable encodes this information. Some territories also have more avgfood than others. We want to model the weight of each fox. For the problems below, assume this DAG:**
```{r data foxes,echo=FALSE}
tribble(
~ name, ~ x, ~ y,
"area", 1, 2,
"avgfood", 0, 1,
"groupsize",2, 1,
"weight", 1, 0
) %>%
dagify(
avgfood ~ area,
groupsize ~ avgfood,
weight ~ avgfood + groupsize,
coords = .) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_node(internal_colour = coral, size = 20,
alpha = 0.8, colour = "white") +
geom_dag_text(aes(label = abbreviate(name)), color = purple, size = 5) +
geom_dag_edges(edge_color = mint) +
labs(caption = "Figure 9: Directed acyclic graph for the foses data.") +
theme_void()
```
## Question 6H3
**Use a model to infer the total causal influence of area on weight . Would increasing the area available to each fox make it heavier (healthier)? You might want to standardize the variables. Regardless, use prior predictive simulation to show that your model’s prior predictions stay within the possible outcome range.**
There are two paths from `area` to `weight`:
(1) area -> avgfood -> weight
(2) area -> avgfood -> groupsize -> weight
Both of these paths are causal and we want to have them in our model. Note that we don't have any backdoor paths and hence a simple linear regression with a single predictor is sufficient.
Let's start by loading the data and standardizing all variables but group, which is a dummy variable.
```{r 6H3 part 1}
data("foxes")
dat_foxes <- foxes %>%
as_tibble() %>%
mutate(across(-group, standardize))
```
Now we are ready to call the model.
```{r 6H3 part 2}
m_foxes1 <- alist(weight ~ dnorm(mu, sigma),
mu <- a + Barea*area,
a ~ dnorm(0, 0.2),
Barea ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_foxes)
```
Using prior predictive simulation as discussed in previous chapters, we can check if our priors are falling outside a realistic range.
```{r 6H3 part 3}
m_foxes1 %>%
extract.prior() %>%
link(m_foxes1, post = .,
data = list(area = seq(-2, 2, length.out = 20))) %>%
as_tibble() %>%
pivot_longer(cols = everything(), values_to = "weight") %>%
add_column(
area = rep(seq(-2, 2, length.out = 20), 1000),
type = rep(as.character(1:1000), each = 20)) %>%
ggplot() +
geom_line(aes(x = area, y = weight, group = type),
alpha = 0.1, colour = coral) +
labs(x = "Area (std)", y = "Weight (std)",
caption = "Figure 9: Prior predictive simulation for standardised area on weight.") +
theme_minimal()
```
It almost seems like some of these lines are too extreme (outside of |-2, 2| sd) but they are still realistic. Anyone with some kind of knowledge about fox ecology can produce better priors than I am using here, but as I am really lazy I will stick to these rather uninformative priors for now.
Let's see the effect of `area` on `weight`.
```{r 6H3 part 4}
m_foxes1 %>%
test_independence(coeff.input = "Barea")
```
There is no substantial association between `area` and `weight`.
## Question 6H4
**Now infer the causal impact of adding food (avgfood) to a territory. Would this make foxes heavier? Which covariates do you need to adjust for to estimate the total causal influence of food?**
There are again two paths:
(1) avgfood -> groupsize -> weight
(2) avgfood -> weight
Both are open and we want to keep both to get the total causal impact of food on weight. Including groupsize (condition on groupsize) would close the first path, which we don't want. Again, we don't have any backdoor paths and a single predictor is enough.
```{r 6H3 part 5}
alist(weight ~ dnorm(mu, sigma),
mu <- a + Bavgf*avgfood,
a ~ dnorm(0, 0.2),
Bavgf ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_foxes) %>%
test_independence(coeff.input = "Bavgf")
```
There is no substantial relationship between `avgfood` and `weight`. This is no big surprise as `avgfood` is only causally related to `area` (at least in our DAG), and `area` showed no dependency as well.
## Question 6H5
**Now infer the causal impact of group size. Which covariates do you need to adjust for? Looking at the posterior distribution of the resulting model, what do you think explains these data? That is, can you explain the estimates for all three problems? How do they make sense together?**
We have two paths:
(1) groupsize -> weight
(2) groupsize -> avgfood -> weight
The second is a backdoor path and we want to close it (note that avgfood is a fork). We can do this by simply adjusting for avgfood.
```{r 6H3 part 6}
alist(weight ~ dnorm(mu, sigma),
mu <- a + Bavgf*avgfood + Bgrps*groupsize,
a ~ dnorm(0, 0.2),
c(Bavgf, Bgrps) ~ dnorm(0, 0.5),
sigma ~ dexp(1)) %>%
quap(data = dat_foxes) %>%
test_independence(coeff.input = c("Bavgf","Bgrps"))
```
We can see that `groupsize` is negatively related to `weight`, and when `groupsize` is in the model the effect of `avgfood` on `weight` is positive. This looks like a masked relationship as the total causal effect of `avgfood` on `weight` was not substantial. It makes sense that foxes with more food are heavier, but more food means also that other foxes are attracted (or born) into the area. Now you have to distribute the food between more foxes, which cancels out the effect of `avgfood` on `weight.` More foxes means also that an individual fox has less food and therefore weighs less, leading to the negative relationship between `groupsize` and `weight`.
## Question 6H6 and 6H7
Questions 6H6 and 6H7 are open research questions and I will hopefully manage to play around with them, once I have more time. So stick around for some updates of this post in the future.
# Summary
I have read *The book of why* by Juda Pearl two years ago and thought to myself that this nicely plays out the route analysis in Palaeobiology (or in Science overall) should go. However, there was no way I could implement the statements from *The book of why* in my own research. There was no guide to go from theory to practice and I was lacking the tools. Turns out that this chapter in *Statistical rethinking* is providing the tools and the basics for causal inference. And this just makes me happy.