-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter4.Rmd
1329 lines (979 loc) · 50.3 KB
/
chapter4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Rethinking Chapter 4"
author: "Gregor Mathes"
date: "2021-01-01"
slug: Rethinking Chapter 4
categories: []
tags: [Rethinking, Bayes, Statistics]
subtitle: ''
summary: 'This is the third part of a series where I work through the practice questions of the second edition of Richard McElreaths Statistical Rethinking'
authors: [Gregor Mathes]
lastmod: '2021-01-01T12:07:04+02:00'
featured: no
projects: [Rethinking]
output:
blogdown::html_page:
toc: true
toc_depth: 1
number_sections: true
fig_width: 6
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.dim=c(7,4))
library(tidyverse)
library(rethinking)
```
# Introduction
This is the third part of a series where I work through the practice questions of the second edition of Richard McElreaths [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/). Each post covers a new chapter. There are already some awesome sources for this book online like [Jeffrey Girard](https://jmgirard.com/statistical-rethinking-ch2/) working through the exercises of the first edition, or [Solomon Kurz](https://bookdown.org/ajkurz/Statistical_Rethinking_recoded/) leading through each example of the book with the *brms* and the *tidyverse* packages. You can even watch the [lectures of McElreath](https://www.youtube.com/playlist?list=PLDcUM9US4XdNM4Edgs7weiyIguLSToZRI) on Youtube and work through the [homework and solutions](https://github.com/rmcelreath/statrethinking_winter2019/tree/master/homework).
However, so far I couldn't find a source providing solutions for the practice questions of the second edition, or the homework practices, in a tidy(-verse) way. My aim here is therefore to provide solutions for each homework and practice question of the second edition, using the *tidyverse* and the *rethinking* packages. The third part of the series will cover chapter 4, which corresponds to week 2 of the lectures and homework.
McElreath himself states that chapter 4 provides the foundation for most of the examples of the later chapters. So I will spend a bit more time on this chapter and go a bit more into detail.
# Homework
## Question 1
**The weights listed below were recorded in the !Kung census, but heights were not recorded for these individuals. Provide predicted heights and 89% intervals for each of these individuals. That is, fill in the table below, using model-based predictions.**
```{r question 1 part 1, echo=FALSE}
opts <- options(knitr.kable.NA = "")
tibble('Individual' = 1:5,
'weight' = c(46.95, 43.72, 64.78, 32.59, 54.63),
'expected height' = NA,
'89% Interval' = NA) %>%
knitr::kable(align = "ccrr")
```
We can use a linear regression model to predict height from weight. First, let's load the census data and the new weights:
```{r question 1 part 2}
# data
data("Howell1")
d <- Howell1
# new data to predict from
new_weight <- c(46.95, 43.72, 64.78, 32.59, 54.63)
```
For the model, we use the same structur and priors as given on page 102. Further, we use all data (both juveniles and adults) and use quadratic approximation for the posterior.
```{r question 1 part 3}
# define the average weight
xbar <- d %>% summarise(xbar = mean(weight)) %>% pull
# model formula
m <- alist(
height ~ dnorm(mu, sigma),
mu <- a + b * (weight - xbar),
a ~ dnorm(178, 20),
b ~ dlnorm(0, 1),
sigma ~ dunif(0, 50)) %>%
# model fit using quadratic approximation
quap(data = d)
```
Now we can calculate the posterior distribution of heights for each individual weight value in the table using the `link()` function, as explained on page 105. From these posterior distributions, we can calculate the mean and the 89% percentile interval using `summarise_all()`.
```{r question 1 part 4, warning=FALSE}
# predict height
pred_height <- link(m, data = data.frame(weight = new_weight))
# calculate means
expected <- pred_height %>%
as_tibble() %>%
summarise_all(mean) %>%
as_vector()
# calculate percentile interval
interval <- pred_height %>%
as_tibble() %>%
summarise_all(HPDI, prob = 0.89) %>%
as_vector()
```
Now we just have to add the predicted values to the table:
```{r question 1 part 5}
table_height <- tibble(individual = 1:5, weight = new_weight, expected = expected,
lower = interval[c(TRUE, FALSE)], upper = interval[c(FALSE, TRUE)]) %>%
knitr::kable(align = "cccrr")
table_height
```
## Question 2
**Model the relationship between height (cm) and the natural logarithm of weight (log-kg). Use the entire `Howell1` data frame, all 544 rows, adults and non-adults. Fit this model, using quadratic approximation. Use any model type from chapter 4 that you think useful: an ordinary linear regression, a polynomial or a spline. Plot the posterior predictions against the raw data.**
First, let's take a look at the data:
```{r question 2 part 1}
d %>%
mutate(log.weight = log(weight)) %>%
ggplot() +
geom_point(aes(log.weight, height), alpha = 0.5) +
theme_minimal()
```
It actually looks like a decent linear relationship, so a simple linear regression should be sufficient. All we need to change from the previous model is to log-transform the weight.
```{r question 2 part 2}
m_log <- alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight),
a ~ dnorm(178, 20),
b ~ dlnorm(0, 1),
sigma ~ dunif(0, 50)) %>%
quap(data = d)
```
Let's glimpse at the results:
```{r question 2 part 3, warning=FALSE}
precis(m_log) %>% as_tibble() %>%
add_column(parameter = rownames(precis(m_log))) %>%
rename("lower" = '5.5%', "upper" = '94.5%') %>%
select(parameter, everything()) %>%
knitr::kable(align = "lcrrr")
```
Instead of trying to read these estimates, we can just visualise our model. Let's calculate the predicted mean height as a function of weight, the 97% PI for the mean, and the 97% PI for predicted heights as explained on page 108.
As we will repeat these steps throughout the exercises, we can set up a function for the interval calculation:
```{r question 2 part 4}
# we better make a function out of it since we use it more often
tidy_intervals <- function(my_function, my_model, interval_type,
x_var, x_seq){
# preprocess dataframe
df <- data.frame(col1 = x_seq)
colnames(df) <- x_var
# calculate 89% intervals for each weight
# either link or sim
my_function(my_model, data = df) %>%
as_tibble() %>%
# either PI or HPDI
summarise_all(interval_type, prob = 0.89) %>%
add_column(type = c("lower", "upper")) %>%
pivot_longer(cols = -type, names_to = "cols", values_to = "intervals") %>%
add_column(x_var = rep(x_seq, 2)) %>%
pivot_wider(names_from = type, values_from = intervals)
}
```
And for the mean:
```{r question 2 part 5}
tidy_mean <- function(my_model, data){
link(my_model, data) %>%
as_tibble() %>%
summarise_all(mean) %>%
pivot_longer(cols = everything()) %>%
add_column(x_var = data[[1]])
}
```
Now let's use these functions to calculate the intervals:
```{r question 2 part 6}
# define weight range
weight_seq <- seq(from = min(d$weight), to = max(d$weight), by = 1)
# calculate 89% intervals for each weight
intervals <- tidy_intervals(link, m_log, HPDI,
x_var = "weight", x_seq = weight_seq)
# calculate mean
reg_line <- tidy_mean(m_log, data = data.frame(weight = weight_seq))
# calculate prediction intervals
pred_intervals <- tidy_intervals(sim, m_log, PI,
x_var = "weight", x_seq = weight_seq)
```
Now we can plot the raw data, the posterior mean from, the distribution of mu (which corresponds to the 89% HPDI of the mean), and the region within which the model expects to find 89% of actual heights in the population.
```{r question 2 part 7}
ggplot(d) +
geom_point(aes(weight, height), alpha = 0.5) +
geom_ribbon(aes(x = x_var, ymin = lower, ymax = upper),
alpha=0.8, data = intervals) +
geom_ribbon(aes(x = x_var, ymin = lower, ymax = upper),
alpha=0.2, data = pred_intervals) +
geom_line(aes(x = x_var, y = value), data = reg_line, colour = "coral") +
theme_light()
```
This looks like a decent fit. Let's make a function of the ggplot, so we can call it later on:
```{r question 2 part 8}
plot_regression <- function(df, x, y,
interv = intervals, pred_interv = pred_intervals){
ggplot(df) +
geom_point(aes({{x}}, {{y}}), alpha = 0.5) +
geom_ribbon(aes(x = x_var, ymin = lower, ymax = upper),
alpha=0.8, data = interv) +
geom_ribbon(aes(x = x_var, ymin = lower, ymax = upper),
alpha=0.2, data = pred_interv) +
geom_line(aes(x = x_var, y = value), data = reg_line, colour = "coral") +
theme_light()
}
```
## Question 3
**Plot the prior predictive distribution for the polynomial regression model in Chapter 4. You can modify the prior predictive distribution. 20 or 30 parabolas from the prior should suffice to show where the prior probability resides. Can you modify the prior distributions of a, b1, and b2 so that the prior predictions stay within the biologically reasonable outcome space? That is to say: Do not try to fit the data by hand. But do try to keep the curves consistent with what you know about height and weight, before seeing these exact data.**
Let us first recap how the polynomial regression model in Chapter 4 was built: The first thing was to standardise the predictor variables, in our case the `weight` variable. Additionally, we calculate the square of the standardised weight.
```{r question 3 part 1}
# standardise weight
d_stand <- d %>% mutate(weight_s = (weight - mean(weight)) / sd(weight),
weight_s2 = weight_s^2) %>%
as_tibble()
```
And here's the notation of the model with the priors used in the chapter:
```{r question 3 part 2}
# fit model
m_poly <- alist(height ~ dnorm(mu, sigma),
mu <- a + b1 * weight_s + b2 * weight_s2,
a ~ dnorm(178, 20),
b1 ~ dlnorm(0, 1),
b2 ~ dnorm(0, 1),
sigma ~ dunif(0, 50)) %>%
quap(data = d_stand)
```
To get the prior prediction, we can sample from the prior distribution using the `extract.prior()` function. We then pass these samples from the prior to the link function for the weight space we are interested in. As we want to try out various priors and how they affect the prediction, I'll make a function that takes a model definition (an `alist`) and the number of predicted curves as argument.
```{r question 3 part 3}
modify_prior_poly <- function(my_alist, N) {
# set seed for reproducibility
set.seed(0709)
# fit model
m_poly <- my_alist %>%
quap(data = d_stand)
# make weight sequence with both standardised weight and the square of it
weight_seq <- tibble(weight = seq(
from = min(d_stand$weight),
to = max(d_stand$weight),
by = 1
)) %>%
mutate(weight_s = (weight - mean(weight)) / sd(weight),
weight_s2 = weight_s ^ 2)
# extract samples from the prior
m_poly_prior <- extract.prior(m_poly, n = N)
# now apply the polynomial equation to the priors to get predicted heights
m_poly_mu <- link(
m_poly,
post = m_poly_prior,
data = list(
weight_s = weight_seq$weight_s,
weight_s2 = weight_seq$weight_s2
)
) %>%
as_tibble() %>%
pivot_longer(cols = everything(), values_to = "height") %>%
add_column(
weight = rep(weight_seq$weight, N),
type = rep(as.character(1:N), each = length(weight_seq$weight))
)
# plot it
ggplot(m_poly_mu) +
geom_line(aes(x = weight, y = height, group = type), alpha = 0.5) +
geom_hline(yintercept = c(0, 272), colour = "steelblue4") +
annotate(
geom = "text",
x = c(6, 12),
y = c(11, 285),
label = c("Embryo", "World's tallest person"),
colour = c(rep("steelblue4", 2))
) +
labs(x = "Weight in kg", y = "Height in cm") +
theme_minimal()
}
```
We can now define our own model via an `alist` and then throw it into our function, and directly get the (visualised) output. Let's start with the priors used in the chapter, with 40 predictive curves sampled from the priors.
```{r question 3 part 4}
alist(height ~ dnorm(mu, sigma),
mu <- a + b1 * weight_s + b2 * weight_s2,
a ~ dnorm(178, 20),
b1 ~ dlnorm(0, 1),
b2 ~ dnorm(0, 1),
sigma ~ dunif(0, 50)) %>%
modify_prior_poly(my_alist = ., N = 40)
```
It seems like our priors are a bit too narrow as we do not cover the whole potential/ realistic space. So what we can do is decrease the prior on the mean of alpha a bit, while increasing its standard deviation:
```{r question 3 part 5}
alist(height ~ dnorm(mu, sigma),
mu <- a + b1 * weight_s + b2 * weight_s2,
a ~ dnorm(130, 35), # decrease mean and increase sd
b1 ~ dlnorm(0, 1),
b2 ~ dnorm(0, 1),
sigma ~ dunif(0, 50)) %>%
modify_prior_poly(my_alist = ., N = 40)
```
Now that looks more realistic to me, without knowing anything about the data. The prior on alpha now looks good to me, as we are expressing enough uncertainty now. The trend lines hower are a bit too straight (can you say that?) basically assuming no relationship between height and weight. But we already know, without looking at the data, that there is a positive relationship. This is prior knowledge we can and should use. So let's try making the slope of our prior predictive lines a bit more positive. We can do so by adjusting the prior on beta1. Now to tune the prior for the log-normal distribution is quite difficult. If we decrease the standard deviation, our distribution gets less skewed, meaning that we get less values that are very high but also less values that are very low. Instead, we will directly change the mean of the log-normal. When we increase it, we increase the chance of getting higher values for the slope:
```{r question 3 part 6}
alist(height ~ dnorm(mu, sigma),
mu <- a + b1 * weight_s + b2 * weight_s2,
a ~ dnorm(130, 35),
b1 ~ dlnorm(1, 1), # increase mean
b2 ~ dnorm(0, 1),
sigma ~ dunif(0, 50)) %>%
modify_prior_poly(my_alist = ., N = 40)
```
This looks better now. We are not too certain about the relationship, but at least we directly included our prior knowledge now. One last thing we could change is the prior on beta2. This prior affects the curvature of the line. But, taking into account what we know about the relationship, I am quite happy about the curvature. We could just force it to be positive by using a narrow log-normal distribution, because a negative value will result in a downwards curvature (= negative relationship).
```{r question 3 part 7}
alist(height ~ dnorm(mu, sigma),
mu <- a + b1 * weight_s + b2 * weight_s2,
a ~ dnorm(130, 35),
b1 ~ dlnorm(1, 1),
b2 ~ dlnorm(0, 0.5), # force positivity
sigma ~ dunif(0, 50)) %>%
modify_prior_poly(my_alist = ., N = 40)
```
That's it. This is my first exercise fuzzing with priors and it really forced me to *think* about my model assumption. This takes a lot of time, but will eventually result in better models.
# Easy practices
## Question 4E1
**In the model definition below, which line is the likelihood?**
$y_{i} ∼ Normal(μ,σ)$
$μ ∼ Normal(0,10)$
$σ ∼ Exponential(1)$
We can follow the example on page 82: The first line is therefore the likelihood, the second line the prior on the mean, and the third the prior on the standard deviation.
## Question 4E2
**In the model definition just above, how many parameters are in the posterior distribution?**
Y is not a parameter. It is the observed data (page 82). Hence, there are two parameters to be estimated in this model: μ and σ.
## Question 4E3
**Using the model definition above, write down the appropriate form of Bayes’ theorem that includes the proper likelihood and priors**
We can simply follow the example on page 83:
$Pr(μ,σ|y)=∏iNormal(yi|μ,σ)Normal(μ|0,10)Uniform(σ|0,10)/∫∫∏iNormal(hi|μ,σ)Normal(μ|0,10)Uniform(σ|0,10)dμdσ$
Question 4E4
**In the model definition below, which line is the linear model?**
$y_{i} ∼ Normal(μ,σ)$
$μ_{i} = α + βx_{i}$
$α ~ Normal(0, 10)$
$β ∼ Normal(0,1)$
$σ ∼ Exponential(2)$
Following the example on page 77: The first line is the likelihood. The second the linear model. The third the prior on α. The fourth the prior on β. And the fifth the prior on σ.
# Medium practices
## Question 4M1
**For the model definition below, simulate observed y values from the prior (not the posterior).**
$y_{i} ∼ Normal(μ,σ)$
$μ ∼ Normal(0,10)$
$σ ∼ Exponential(1)$
We can simply sample from each prior distribution:
```{r question 4M1}
# sample mu
sample_mu <- rnorm(1e4, 0, 10)
# sample sigma
sample_sigma <- rexp(1e4, 1)
# sample y
prior_y <- rnorm(1e4, sample_mu, sample_sigma) %>%
enframe()
# plot
ggplot(prior_y) +
geom_density(aes(value), size = 1) +
theme_light()
```
## Question 4M2
**Translate the model just above into a `quap()` formula**
```{r question 4M2}
quap_frm <- alist(y ~ dnorm(mu, sigma),
mu ~ dnorm(0, 10),
sigma ~ dexp(1))
```
## Question 4M3
**Translate the `quap()` model formula below into a mathematical model definition.**
```{r question 4M3}
flist <- alist(
y ~ dnorm(mu, sigma),
mu <- a + b*x,
a ~ dnorm(0, 50),
b ~ dunif(0, 10),
sigma ~ dunif(0, 50))
```
$y_{1} ∼ Normal(μ,σ)$
$μ_{1} ∼ α + βx_{1}$
$α ∼ Normal(0, 50)$
$β = Uniform(0, 10)$
$σ ∼ Uniform(0, 50)$
## Question 4M4
**A sample of students is measured for height each year for 3 years. After the third year, you want to fit a linear regression predicting height using year as a predictor. Write down the mathematical model definition for this regression, using any variable names and priors you choose. Be prepared to defend your choice of priors.**
We can use a simple linear regression approach for this.
First the likelihood, assuming that height is normally distributed:
$h_{1} ∼ Normal(μ,σ)$
The linear model:
$μ_{1} ∼ α + βx_{1}$
The prior for the intercept. It says nothing about the age of the students, so I use a weak prior for the intercept, covering elementary school to adults as shown in the plot below:
$α ∼ Normal(150, 20)$
```{r question 4M4 part 1}
tibble(height = rnorm(10000, 150, 20)) %>%
ggplot() +
geom_density(aes(height)) +
theme_minimal()
```
Likewise, the prior for the growth rate needs to adjust for fast growing juveniles and for smaller growing older students. We can't use a normal distribution because students don't shrink (to my knowledge). A uniform distribution forces only positive growth rates.
$β = Uniform(0, 7)$
```{r question 4M4 part 2}
tibble(growth = runif(10000, 0, 7)) %>%
ggplot() +
geom_density(aes(growth)) +
labs(x = "Growth rate in cm/year") +
theme_minimal()
```
Now we just need to add a prior for sigma, the standard deviation of the height. I choose a weak prior, which is within the reasonable height space (the first plot). A sigma above 40 would lead to height values outside of this range (second plot):
$σ ∼ Uniform(0, 30)$
```{r question 4M4 part 3, fig.show="hold", out.width="50%"}
tibble(height = rnorm(10000, 150, 30)) %>%
ggplot() +
geom_density(aes(height)) +
labs(x = "Height in cm", title = "sigma = 30") +
theme_minimal()
tibble(height = rnorm(10000, 150, 40)) %>%
ggplot() +
geom_density(aes(height)) +
labs(x = "Height in cm", title = "sigma = 40") +
theme_minimal()
```
## Question 4M5
**Now suppose I remind you that every student got taller each year. Does this information lead you to change your choice of priors? How?**
An increasing height tells us that the students are children or juveniles. We can keep our likelihood and linear model, but need to adjust the prior for alpha, beta, and sigma slightly. For alpha, we can use a new mean of 120. For beta (the growth rate), we can use a log normal distribution to force positive values and make them slightly bigger as before, as children tend to grow faster. For sigma, we can reduce it slightly as the spread is probably lower within children.
$h_{1} ∼ Normal(μ,σ)$
$μ_{1} ∼ α + βx_{1}$
$α ∼ Normal(120, 20)$
$β = LogNormal(2, 0.5)$
$σ ∼ Uniform(0, 20)$
## Question 4M6
**Now suppose I tell you that the variance among heights for students of the same age is never more than 64 cm. How does this lead you to revise your priors?**
The variance is the square of σ. If the variance is never more than 64 cm, then sigma is never higher than 8 cm. We can update our prior:
$σ ∼ Uniform(0, 8)$
Such a low standard deviation of height has implication for our intercept prior alpha. We are now more certain that the intercept is around 120cm and can narrow the spread a bit:
$α ∼ Normal(120, 10)$
## Question 4M7
**Refit model `m4.3` from the chapter but omit the mean weight `xbar`. Compare the new model’s posterior to that of the original model. In particular, look at the covariance among the parameters. What is difference?**
First, let us take a look at the old model `m4.3`.
```{r question 4M7 part 1}
# only adults
adults <- Howell1 %>% filter(age >= 18)
# mean weight
xbar <- adults %>%
summarise(my_mean = mean(weight)) %>%
pull()
# fit model
m4.3 <- alist(height ~ dnorm(mu, sigma), # likelihood
mu <- a + b * (weight - xbar), # linear model
a ~ dnorm(178, 20), # alpha
b ~ dlnorm(0, 1), # beta
sigma ~ dunif(0, 50)) %>% # sigma
quap(data = adults) # quadratic approximation
```
Now do the same but without `xbar`.
```{r question 4M7 part 2}
m4.3new <- alist(height ~ dnorm(mu, sigma), # likelihood
mu <- a + b*weight, # linear model
a ~ dnorm(178, 20), # alpha
b ~ dlnorm(0, 1), # beta
sigma ~ dunif(0, 50)) %>% # sigma
quap(data = adults, start = list(a = 115, b = 0.9)) # quadratic approximation
```
We shall look at the covariance of each model.
```{r question 4M7 part 3}
m4.3new %>% vcov() %>% round(digits = 3) # lots of covariation
m4.3 %>% vcov() %>% round(digits = 3) # low covariation
```
So we seem to increase the covariation by not centering. Let's dig deeper by looking at summaries of the posterior distribution for each parameter:
```{r question 4M7 part 4}
m4.3new %>%
extract.samples() %>%
as_tibble() %>%
summarise(alpha = mean(a), beta = mean(b), sigma = mean(sigma))
m4.3 %>%
extract.samples() %>%
as_tibble() %>%
summarise(alpha = mean(a), beta = mean(b), sigma = mean(sigma))
```
While beta and sigma are the same, alpha is drastically lower in the new model without centering. But what does this mean? Before alpha was the average height for when $x − x$ was 0 (when the weight is equal to the average weight). In the new model, alpha is the average height for observation with weight equal 0. As there are no people with a weight of zero, this alpha is harder to interpret.
But does this actually change the way our model works? We can test this by comparing the posterior predictions by appling our prediction functions to the data:
```{r question 4M7 part 5, include=FALSE}
# define weight range
weight_seq <- seq(from = min(adults$weight), to = max(adults$weight), by = 1)
# calculate 89% intervals for each weight
intervals <- tidy_intervals(link, m4.3, HPDI,
x_var = "weight", x_seq = weight_seq)
# calculate means
reg_line <- tidy_mean(m4.3, data = data.frame(weight = weight_seq))
# calculate prediction intervals
pred_intervals <- tidy_intervals(sim, m4.3, PI,
x_var = "weight", x_seq = weight_seq)
m4.3.plot <- plot_regression(adults, weight, height) +
labs(title = "Centered model (m4.3)")
# the same for m4.3new
# calculate 89% intervals for each weight
intervals <- tidy_intervals(link, m4.3new, HPDI,
x_var = "weight", x_seq = weight_seq)
# calculate means
reg_line <- tidy_mean(m4.3new, data = data.frame(weight = weight_seq))
# calculate prediction intervals
pred_intervals <- tidy_intervals(sim, m4.3new, PI,
x_var = "weight", x_seq = weight_seq)
m4.3new.plot <- plot_regression(adults, weight, height) +
labs(title = "Uncentered model (m4.3new)")
```
```{r question 4M7 part 6, fig.show="hold", out.width="50%"}
m4.3.plot
m4.3new.plot
```
Short answer: No, it does not. We get the same prediction intervals.
## Question 4M8
**In the chapter, we used 15 knots with the cherry blossom spline. Increase the number of knots and observe what happens to the resulting spline. Then adjust also the width of the prior on the weights - change the standard deviation of the prior and watch what happens. What do you think the combination of knot number and the prior on the weights controls?**
First of all, set up the model environment by importing the `cherry_blossoms` data and loading the `splines` package:
```{r question 4M8 part 1}
data("cherry_blossoms")
# remove na's
cherry_blossoms <- cherry_blossoms %>%
as_tibble() %>%
drop_na()
# define nr of knots
library(splines)
```
Instead of typing out the code all the time whenever we change the number of knots or the prior on the weights as needed, we make a function dependant on these two parameters and just call the function each time. All we change within the function is the number of knots and the prior on weights, everything else stays as it is. I already process and plot the output of the splines regression in the function.
```{r question 4M8 part 2}
# make function for it, dependant on number of knots and the prior on weights
cherry_spliner <- function(nr_knots, vl_sigma){
# get knot points
knot_list <- cherry_blossoms$year %>%
quantile(probs = seq(0, 1, length.out = nr_knots)) %>%
discard(~ . %in% c(851, 1980))
# construct basis functions
B <- cherry_blossoms$year %>% bs(knots = knot_list, degree = 3, intercept = TRUE)
# run bspline regression
m4.7 <- alist(D ~ dnorm(mu, sigma),
mu <- a + B %*% w,
a ~ dnorm(100, 10),
w ~ dnorm(0, my_sigma),
sigma ~ dexp(1)) %>%
quap(data = list(D = cherry_blossoms$doy, B = B, my_sigma = vl_sigma),
start = list(w = rep(0, ncol(B))))
# get 97% posterior interval for mean
post_int <- m4.7 %>%
link() %>% as_tibble() %>%
map_dfr(PI, prob = 0.97) %>%
select("lower_pi" = '2%', "upper_pi" = '98%') %>%
add_column(year = cherry_blossoms$year) %>%
# add doy
left_join(cherry_blossoms, by = "year")
# plot it and add nr_knots and vl_sigma to plot title
ggplot(post_int, aes(year, doy)) +
geom_point(colour = "steelblue4", alpha = 0.8) +
geom_ribbon(aes(ymin = lower_pi, ymax = upper_pi), alpha = 0.9) +
labs(y = "Day in year", x = "year",
title = paste(nr_knots, " knots, prior on weights ~ N(0,", vl_sigma, ")")) +
theme_minimal()
}
```
Now we can increase the number of knots. We start with the normal model with 15 knots and sigma = 10.
```{r question 4M8 part 3, fig.show="hold", out.width="50%", message=FALSE}
cherry_spliner(nr_knots = 15, vl_sigma = 10)
cherry_spliner(nr_knots = 20, vl_sigma = 10)
cherry_spliner(nr_knots = 30, vl_sigma = 10)
```
The more knots we have, the *wigglier* our trend line gets, as we capture more signal.
Now we can play around with the prior on weights. First, we decrease it significantly to 1, and then increase it to 100, while keeping the number of knots equal.
```{r question 4M8 part 4, fig.show="hold", out.width="50%", message=FALSE}
cherry_spliner(nr_knots = 15, vl_sigma = 1)
cherry_spliner(nr_knots = 15, vl_sigma = 10)
cherry_spliner(nr_knots = 15, vl_sigma = 100)
```
So by increasing the prior or weights, we render our trend line more *wigglier* as well. This makes sense as the weights will be closer to 0 and hence wiggle around closer to the mean line if the standard deviation is small. If the weights become larger, we allow the curve to have more peaks.
# Hard practices
## Question 4H1
**The weights listed below were recorded in the !Kung census, but heights were not recorded for these individuals. Provide predicted heights and 89% intervals (either HPDI or PI) for each of these individuals. That is, fill in the table below, using model-based predictions.**
```{r question 4H1 part 1, echo=FALSE}
opts <- options(knitr.kable.NA = "")
tibble('Individual' = 1:5,
'weight' = c(46.95, 43.72, 64.78, 32.59, 54.63),
'expected height' = NA,
'89% Interval' = NA) %>%
knitr::kable(align = "ccrr")
```
This question is similar to the question 1 in the homework. We solved it there by using a linear regression, but this time we model the relationship between height (cm) and the natural logarithm of weight (log-kg). Let's see if that makes any difference at all.
```{r question 4H1 part 2}
# new data to predict from
new_weight <- c(46.95, 43.72, 64.78, 32.59, 54.63)
# fit the logarithmic model
m_log <- alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight),
a ~ dnorm(178, 20),
b ~ dlnorm(0, 1),
sigma ~ dunif(0, 50)) %>%
quap(data = d,
start = list(a = mean(d$height), b = 1.5, sigma = 9))
```
Now we can make predictions from our model:
```{r question 4H1 part 3}
# predict height
pred_height <- link(m_log, data = data.frame(weight = new_weight))
# calculate means
expected <- pred_height %>%
as_tibble() %>%
summarise_all(mean) %>%
as_vector()
# calculate percentile interval
interval <- pred_height %>%
as_tibble() %>%
summarise_all(HPDI, prob = 0.89) %>%
as_vector()
```
And add the predictions to the table.
```{r question 4H1 part 4}
tibble(individual = 1:5, weight = new_weight, expected = expected,
lower = interval[c(TRUE, FALSE)], upper = interval[c(FALSE, TRUE)]) %>%
knitr::kable(align = "cccrr")
```
And now we can compare our results to the predictions from the regular model, which we named `table_height`.
```{r question 4H1 part 5}
table_height
```
And we can see that it actually makes a big difference, especially for those with a large weight (*individual 3*), or with a particularly low weight (*individual 4*).
## Question 4H2
**Select out all the rows in the Howell1 data with ages below 18 years of age. If you do it right, you should end up with a new data frame with 192 rows in it.**
```{r question 4H2 part 1}
young <- d %>% filter(age < 18)
# double-check
str(young)
```
### Part 1
**Fit a linear regression to these data, using `quap()`. Present and interpret the estimates. For every 10 units of increase in weight, how much taller does the model predict a child gets?**
We can use the same priors as before, but should decrease the prior on alpha to account for lower overall height in our (non-adult) data set. It is important to center the weight values for interpretation.
```{r question 4H2 part 2, warning=FALSE}
# again, get mean weight for centering
xbar <- young %>% summarise(xbar = mean(weight)) %>% pull
m_young <- alist(
height ~ dnorm(mu, sigma),
mu <- a + b * (weight - xbar),
a ~ dnorm(120, 30),
b ~ dlnorm(0, 1),
sigma ~ dunif(0, 60)) %>%
quap(data = young)
# results
m_young_res <- precis(m_young) %>%
as_tibble() %>%
add_column(parameter = rownames(precis(m_young)), .before = "mean") %>%
rename("lower" = '5.5%', "upper" = '94.5%')
knitr::kable(m_young_res)
```
b can be interpreted as the slope in our regression, where with 1 unit change, height increases by `r m_young_res %>% filter(parameter == "b") %>% pull(mean) %>% round(2)`. Hence, for every 10 units of increase in weight the model predicts that a child gets `r m_young_res %>% filter(parameter == "b") %>% pull(mean) %>% round(2)*10` cm taller.
### Part 2
**Plot the raw data, with height on the vertical axis and weight on the horizontal axis. Superimpose the `quap` regression line and 89% HPDI for the mean. Also superimpose the 89% HPDI for predicted heights.**
And this is the reasion why we build our functions for this. We don't need to do it manually all over again:
```{r question 4H2 part 3}
# define weight range
weight_seq <- seq(from = min(young$weight), to = max(young$weight), by = 1)
# calculate 89% intervals for each weight
intervals <- tidy_intervals(link, m_young, HPDI,
x_var = "weight", x_seq = weight_seq)
# calculate means
reg_line <- tidy_mean(m_young, data = data.frame(weight = weight_seq))
# calculate prediction intervals
pred_intervals <- tidy_intervals(sim, m_young, PI,
x_var = "weight", x_seq = weight_seq)
plot_regression(young, weight, height) +
labs(title = "Linear model (m_young)")
```
### Part 3
**What aspects of the model fit concern you? Describe the kinds of assumptions you would change, if any, to improve the model. You don’t have to write any new code. Just explain what the model appears to be doing a bad job of, and what you hypothesize would be a better model.**
The model clearly fails to estimate height at both low (<10) and high (>30) weights. By just eyeballing the data, it seems that a linear relationship assumption is not realistic. To improve the model, we can either transform one of the parameters, or use a polynomial regression.
## Question 4H3
**Suppose a colleague of yours, who works on allometry, glances at the practice problems just above. Your colleague exclaims, “That’s silly. Everyone knows that it’s only the logarithm of body weight that scales with height!” Let’s take your colleague’s advice and see what happens.**
### Part 1
**Model the relationship between height (cm) and the natural logarithm of weight (log-kg). Use the entire Howell1 data frame, all 544 rows, adults and non-adults. Can you interpret the resulting estimates.?**
All we need to do is make a formula for the log of the weights, and the fit it to `quap()`. Remember that this uses all the data, so we should adjust our priors to it. Then we get the coefficient with `precis()`.
```{r question 4H3 part 1, warning=FALSE}
# fit model
m_log <- alist(
height ~ dnorm(mu, sigma),
mu <- a + b * log(weight),
a ~ dnorm(178, 100),
b ~ dlnorm(0, 1),
sigma ~ dunif(0, 50)) %>%
quap(data = d)
# get coefficients
m_log_res <- precis(m_log) %>% as_tibble() %>%
add_column(parameter = rownames(precis(m_log)), .before = "mean") %>%
rename("lower" = '5.5%', "upper" = '94.5%')
knitr::kable(m_log_res, align = "lccc")
```
We get a weird alpha estimate of -24, what does this mean? It's just the predicted height of an individual with the weight of 0 log-kg. Beta shows the predicted increase (41 cm) for a 1 log-kg increase in weight. The standard deviation of height prediction, sigma, is around 5 cm. We can see that using a transformation for one parameter renders the coefficients less interpretable.
### Part 2
**Begin with this plot: `plot(height ~ weight, data = Howell1), col = col.alpha(rangi2, 0.4))`. Then use samples from the quadratic approximate posterior of the model in (a) to superimpose on the plot: (1) the predicted mean height as a function of weight, (2) the 97% HPDI for the mean, and (3) the 97% HPDI for predicted heights.**
Again, our predefined functions come in quite handy here. But first let's recreate the plot in `ggplot2`.
```{r question 4H3 part 2}
ggplot(data = d) +
geom_point(aes(x = weight, y = height), colour = "steelblue4", alpha = 0.5) +
theme_minimal()
```
Now we can calculate the predictions from our model:
```{r question 4H3 part 3}
# define weight range
weight_seq <- seq(from = min(d$weight), to = max(d$weight), by = 1)
# calculate 89% intervals for each weight
intervals <- tidy_intervals(link, m_log, HPDI,
x_var = "weight", x_seq = weight_seq)
# calculate means
reg_line <- tidy_mean(m_log, data = data.frame(weight = weight_seq))
# calculate prediction intervals
pred_intervals <- tidy_intervals(sim, m_log, PI,
x_var = "weight", x_seq = weight_seq)
# plot it
plot_regression(d, weight, height)
```
We can see a pretty good fit for the relationship between height (cm) and the natural logarithm of weight (log-kg).
## Question 4H4
**Plot the prior predictive distribution for the parabolic polynomial regression model by modifying the code that plots the linear regression prior predictive distribution. Can you modify the prior distributions of a, b1, and b2 so that the prior predictions stay withing the biologically reasonable outcome space? That is to say: Do not try to fit the data by hand. But do try to keep the curves consisten with what you know about height and weight, before seeing these exact data.**
We already covered this in *homework question 3* above.
## Question 4H5
**Return to data(cherry_blossoms) and model the association between blossom date (`doy`) and March temperature (`temp`). Note that there are many missing values in both variables. You may consider a linear model, a polynomial, or a spline on temperature. How well does temperature rend predict the blossom trend?**
The easiest way to deal with missing values is to omit them. There are more sophisticated ways but for now it's sufficient for us. Let's load the data, select the parameters and add a column with the centered temperature in case we apply a polynomial regression.
```{r question 4H5 part 2}
data("cherry_blossoms")
cherry_blossoms <- cherry_blossoms %>%
as_tibble() %>%
select(doy, temp) %>%
drop_na() %>%
mutate(temp_sc = (temp - mean(temp))/ sd(temp))
```
Now let's take a quick look at the data to choose which model to use.
```{r question 4H5 part 3}
ggplot(cherry_blossoms) +
geom_point(aes(temp_sc, doy), alpha = 0.5) +
labs(x = "Centered temperature", y = "Day in year") +
theme_minimal()
```
It seems like there is a nuanced negative relationship, but with a lot of noise. Using a polynomial regression or a spline would probably capture too much of this noise without providing much insight. I will therefore use a simple linear regression approach and the temperature parameter as it is (without centering). I will mainly use less informative, wide priors. There is, however enough data so that priors do not matter that much. The prior on alpha covers most realistic values and the prior on beta that favours negative values (= negative slope = negative relationship).
```{r question 4H5 part 4}
# define average temp
xbar <- cherry_blossoms %>%
summarise(mean_temp = mean(temp)) %>%
pull()
# fit modell
cherry_linear <-
# define formula
alist(doy ~ dnorm(mu, sigma),
mu <- a + b * (temp - xbar),
a ~ dnorm(115, 30),
b ~ dnorm(-2, 5),
sigma ~ dunif(0, 50)) %>%
# calculate maximum a posterior
quap(data = cherry_blossoms)
```
Let's glimpse at the estimated parameters:
```{r question 4H5 part 5, message=FALSE, warning=FALSE}
precis(cherry_linear) %>% as_tibble() %>%
add_column(parameter = rownames(precis(cherry_linear))) %>%
rename("lower" = '5.5%', "upper" = '94.5%') %>%
select(parameter, everything()) %>%
knitr::kable(align = "lcrrr")
```
The intercept is around day 105. The relationship is indeed negative: With every degree celsius warmer, the day of blossom is 3 days earlier on average. The confidence intervals are negative as well, showing that this relationship is somewhat strong (puhh, I almost said *significant*).
Let's look at the 89% prediction intervals:
```{r question 4H5 part 6}
# define sequence of temperature