-
Notifications
You must be signed in to change notification settings - Fork 38
/
Copy pathExercise_07_CI.Rmd
246 lines (186 loc) · 15.1 KB
/
Exercise_07_CI.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
title: "Lab 08 - Interval Estimation and model selection"
author: "EE 509"
output: html_document
---
The objective for this lab is to apply the frequentist tools for generating confidence and predictive intervals and for selecting among models. In the course of this lab we will revisit some of the case studies we presented earlier in the semester when we were focused on point estimation.
## Case Study: Lab 4 pine cones data
In Lab 4 we investigated the effect that elevated CO2 has on tree reproduction by fitting a nonlinear model that accounted for both tree reproductive maturation and overall tree fecundity. In that Lab we performed a Likelihood Ratio Test that showed that CO2 did have a significant effect on tree reproduction. In this lab we will complete this analysis by estimating confidence intervals on the model parameters, as well as model confidence and predictive intervals, using bootstrap methods. We will also compare the results of the LRT with model selection using AIC.
As a reminder, our model had included tree size, as measured by stem diameter (dia), as the primary covariate. This model had two parts. First, tree reproductive maturation was modeled as a probit function, $\theta = probit(x \vert b_0,b_1)$, that described the probability that a tree was reproductive. Second, the function $g = a_0x^2$ described the fecundity in terms of the number of pine cones produced conditional on the tree being reproductive. Recall that in order to account for observation error (observing no cones on a tree that is mature) we had to fit both models simultaneously.
We'll pick up from where we were at the end of Lab 4 by loading up the the state of the R workspace at the end of that lab:
```{r}
load("data/Lab7_cone.RData")
```
At this point we had fit the pine cone count data to the whole data set (out), to just the ambient CO2 data (out.amb) and to just the elevated CO2 data (out.elev).
### AIC vs Likelihood Ratio Test
Begin by calculating the AIC for the two models. We had previously stored the negative log likelihoods in a vector named lnL that contained the values for (combined, amb, elev). To estimate AIC we need to know the total number of parameters for each model. For the model that analyzes the combined data there are 3 parameters, a0, b0, and b1. Because we did not use a Normal likelihood there's not an additional variance parameter that needs to be estimated. The model that treats ambient and elevated CO2 as separate treatments thus has 6 parameters.
** Lab Report Task 1**
Compare the AICs for the two models, interpret the results for this test, and compare with the results from the LRT previously performed.
## Bootstrapped Interval Estimates
Next, let's calculate the bootstrap confidence and predictive intervals. This begins by setting up variables that control the sampling and storage for the output. All of the code to follow for calculating bootstrapped intervals is very similar to the code we've used in the last two labs to generate interval estimates from MCMC output. One important difference is that since all of the bootstrap estimates are independent we can generally get away with a smaller sample size because there is no issue of autocorrelation or burn-in. The other important difference is in how we are treating the data and the parameters – in the bootstrap we're generating random DATA while in the MCMC the data are fixed but the parameters are random.
**For your report, make sure to set nboot to an appropriate number of replicates**
```{r}
## bootstrap
nboot <- 100 ## number of bootstrap samples
npred <- 31 ## number of X values you predict for
dseq <- seq(0,30,length=npred) ## diameter sequence
mle <- matrix(NA,nrow=nboot,ncol=3) ## storage for parameter estimates
conf.mat <- matrix(NA,nrow=nboot,ncol=npred) ## storage for maturation CI
conf.cone <- matrix(NA,nrow=nboot,ncol=npred) ## storage for fecundity CI
pred.mat <- matrix(NA,nrow=nboot,ncol=npred) ## storage for predictive maturation
pred.cone <- matrix(NA,nrow=nboot,ncol=npred) ## storage for predictive fecundity
```
Next we'll specify the data set that we'll be sampling from. In this case we'll begin by constructing interval estimates for the ambient CO2 data
```{r}
## Ambient
dia.t <- b$diam[tmt=="AMB"] ## tree diameters
cones.t <- (b$c00 > 0) ## whether cones are present
cones.t <- cones[tmt=="AMB"]
ncone.t <- b$c00[tmt=="AMB"] ## number of cones
```
Next is the main bootstrap loop. There are a number of steps to this process as we'll be generating both the parameters CI and the model CI and PI in one large loop. In the first part we define the resampled data set, making sure to sample so that we keep all xy pairs of data together. Next we'll fit the model to the resampled data using numerical optimization and store those parameters. After that we'll calculate the mean model prediction given the most recent set of parameters in order to calculate the model confidence interval. Finally we'll simulate pseudodata from the model to calculate the model predictive interval.
```{r,echo=FALSE}
likfit = function(param,dia,ncone){
a0 = param[1]
b0 = param[2]
b1 = param[3]
cones = ncone > 0
## trees with cones
dia.cone = dia[cones > 0] ##find just the trees with cones
g.cone = a0 * dia.cone^2 ## Fecundity fnc - g(x)
theta.cone = pnorm(dia.cone,b0,b1,log.p=TRUE) ## maturation probit
prob.cone = theta.cone + dpois(ncone[cones],g.cone,log=TRUE)
##trees with zero counts
dia.zero = dia[cones == 0]
g.zero = a0 * dia.zero^2
theta.zero = pnorm(dia.zero,b0,b1) ##maturation probit
prob.zero = log((1-theta.zero) + theta.zero*dpois(0,g.zero))
return(-sum(prob.cone,prob.zero))
}
```
```{r}
for(i in 1:nboot){
if(i%%100 == 0) print(i) ## progress indicator
samp <- sample(length(dia.t),replace=T) ##Sample row indices
out.boot <- optim(param,likfit,method="L-BFGS-B", ## fit model to sample
lower=c(0.001,10,1),upper=c(1,30,30),dia=dia.t[samp],ncone=ncone.t[samp])
mle[i,] <- out.boot$par ## store parameters
}
```
Once we've generated the bootstrap sample we'll want to begin by looking at the parameter level estimates. Lets first calculate the confidence intervals for each parameter using “quantile” and then use a pairs plot to assess parameter correlations.
```{r}
colnames(mle) = c("a0","b0","b1")
a0.ci <- quantile(mle[,1],c(0.025,0.975))
b0.ci <- quantile(mle[,2],c(0.025,0.975))
b1.ci <- quantile(mle[,3],c(0.025,0.975))
pairs(mle)
```
Next, construct density plots for each of these parameters and add both the MLE and the confidence intervals.
```{r}
## density plots
plot(density(mle[,1],width=0.005),type='l',xlim=c(0,0.05),ylim=c(0,120))
abline(v=a0.ci,lty=2)
abline(v=a0[2])
```
**Lab Report Task 2**
Generate density and CI plots for all 3 model parameters. Include three figures and a table of parameter estimates, standard errors, and CI.
### Model Intervals
Let’s now move on to constructing the model confidence intervals and predictive intervals. Since we have two process models for this analysis we'll start with the intervals on the maturation model. First, we'll want to generate the Monte Carlo predictions
```{r}
for(i in 1:nboot){
conf.mat[i,] <- pnorm(dseq,mle[i,2],mle[i,3]) ## store model | parms
pred.mat[i,] <- rbinom(npred,1,conf.mat[i,])
}
```
Next we'll want to generate the CI and PI from the full distribution of predictions
```{r}
ci.mat <- apply(conf.mat,2,quantile,c(0.025,0.5,0.975))
pi.mat <- apply(pred.mat,2,quantile,c(0.025,0.975))
```
Finally, lets plot the model confidence and predictive intervals and compare them to the data
```{r}
## maturation plot
cones = b$c00 > 0
plot(b$diam,cones,col = b$tmt,ylim=c(0,1))
lines(dseq,ci.mat[2,],col=2,lwd=3) ## median model
lines(dseq,ci.mat[1,],col=2,lty=2,lwd=3) ## 95% CI
lines(dseq,ci.mat[3,],col=2,lty=2,lwd=3)
lines(dseq,pi.mat[1,],col=3,lty=2,lwd=3) ## 95% PI
lines(dseq,pi.mat[2,],col=3,lty=2,lwd=3)
```
**Lab Report Task 3**
Generate and plot the confidence and predictive interval for the full fecundity process model (which combines both maturation and fecundity|maturation). Include the observed data in the plot and interpret both this plot and the maturation plot in your lab report.
**Lab Report Task 4**
Repeat the bootstrap analysis for the elevated CO2 treatment. Include the figures and tables from tasks 2 & 3 for the elevated plots as well as a final plot that compares the model confidence intervals between the elevated and ambient data on one figure
** Extra Credit: Likelihood Profile Interval Estimates: Lab 3 Fire Scar data**
Recall in Lab 3 we investigated the distribution of fire return intervals in Itasca State Park in northwestern Minnesota. In that lab we fit a Weibull distribution to the fire interval data. Recall that the Weibull is a generalization of the exponential that allows the rate parameter to either increase or decrease with time and the exponential can be recovered by setting c=1. In this lab we revisit this analysis in order to estimate confidence intervals for the MLE parameters we previously found to test whether or not the fire risk is constant or increasing with time.
For estimation of parameter confidence intervals we will use the likelihood profile technique. Following this we will use a Likelihood Ratio Test to test the Weibull model against the Exponential.
Rather than re-running the whole code for Lab 3, let's load up the saved image of the workspace as it was at the end of that lab.
```{r}
load("data/Lab7_fire.RData")
```
Where we left off we had just fit the MLE numerically, the output of which is stored in the variable “out”, and then constructed a 2D contour plot of the likelihood surface by calculating the likelihood on a grid “z” for values that ranged from 20%-200% of the MLE (variable rp). Let's start by replotting that contour plot
```{r}
c_mle = out$par[1] ## MLE for the parameter “C”
lam_mle = out$par[2] ## MLE for the parameter “Lambda”
contour(rp*c_mle,rp*lam_mle,z,levels=c(350,360,370,380),xlab="C",ylab="Lambda")
## add the MLE's
abline(v=c_mle,lty=2)
abline(h=lam_mle,lty=2)
```
In order to estimate the confidence intervals for each parameter we need to be able to go from this likelihood surface to the likelihood profiles for each variable. Because of the covariance between these variables this is not simply the profile along the horizontal and vertical lines we've plotted. Instead it requires that we specify a sequence of values for the parameter of interest, and then sequentially find the MLE for all other parameters in the model while holding the parameter of interest constant at a specified value along the profile. For each parameter, we'll loop over a vector of parameter values, find the MLE at each point, and we record the corresponding likelihood. In this way we'll end up with the likelihood profile for that parameter To estimate these profiles we need to define two new likelihood functions that each find the minimum likelihood for a specific value of our parameter of interest by searching over the values of the other parameter(s).
```{r}
## likelihood profile
prof_lam <- function(c,lambda){ ## find the MLE holding lambda at a constant
-sum(dweibull(firedata,c,lambda,log=TRUE))
}
prof_c <- function(lambda,c){ ## find the MLE holding C at a constant
-sum(dweibull(firedata,c,lambda,log=TRUE))
}
```
While these functions LOOK identical, in the first the value for lambda is being held constant and the optimization routine will vary the value of C to find the MLE while in the second the C is held constant and lambda is varied. To construct the likelihood profiles we'll then loop over a sequence of values for one variable and find the MLE of the other. When looking at the following code, remember that optimize returns a minimum, which is the parameter value, and an objective that is result of evaluating the function at that minimum (which in our case is the negative log likelihood).
Let's begin with the profile for lambda
```{r}
lambdas = lam_mle*rp ## sequence of lambda values being evaluated
lambda.like = numeric(nstep) ## storage for lambda likelihood values
CgivenL = numeric(nstep) ## storage for c value that pairs with the lambda
for(i in 1:nstep){ ## set lambda, solve for MLE
lout <- optimize(prof_lam,interval=range(c_mle*rp),maximum=FALSE,lambda=lambdas[i])
lambda.like[i] <- lout$objective ## save the likelihood
CgivenL[i] = lout$minimum ## save the MLE of c
}
contour(rp*c_mle,rp*lam_mle,z,levels=c(350,360,370,380),xlab="C",ylab="Lambda")
abline(v=c_mle,lty=2)
abline(h=lam_mle,lty=2)
lines(CgivenL,lambdas,type='b',pch="+",cex=0.5) ## add points to the contour plot
c.like = rp*0 ## defined just so Knitr runs, replace with estimation
```
In the last line of the code we add points to the likelihood surface plot that show's the MLE of C at each of the lambda values we evaluated. In other words, this is the line along the likelihood surface that lets you know where we are computing the likelihood profile (which we'll plot in just a minute). If you imagine the likelihood surface as a valley, as you walked from north to south carrying a GPS this path would always keep you at the minimum you could be at with each step and the resulting plot of elevation (likelihood) vs latitude (lambda) would be the likelihood profile itself. The equivalent profile for C would be the lowest path walking east to west.
**Lab Report Task 5**
Modify the previous section of code to calculate the vector c.like, that is found by fixing the value of C and fitting using the prof_c likelihood function. Be aware that you will need to flip many components of the code, including the order of the two variables in the “points” command. Turn in the following:
* R code for calculating c.like
* Likelihood surface plot that has the points for BOTH the profile of lambda and the profile of C
Now let's plot the likelihood profile for each variable
```{r}
c.seq = c_mle*rp
lambda.seq = lam_mle*rp
plot(c.seq,c.like,type='l',xlab="C",ylab="Likelihood")
plot(lambda.seq,lambda.like,type='l',xlab="Lambda",ylab="Likelihood")
```
In order to find the CI from the likelihood profiles we'll need to know the threshold deviance from the chi-square test and some way of finding which values in our vectors correspond to those values. Recall that “out” is the output from the optimization we performed in Lab 3 and that out$value is the negative log likelihood.
```{r}
Dtest = qchisq(0.95,1) ## Deviance for 95% CI
Dthresh = 2*out$value + Dtest ## Deviance threshold value
findThresh <- function(vec,Thr){ ## find indices where vector crosses Thr
which(as.logical(diff(sign(vec-Thr))))
}
```
Next let's calculate our CI bounds and add them to the plot
```{r}
c.CI = c.seq[findThresh(c.like,Dthresh/2)]
plot(c.seq,c.like,type='l',xlab="C",ylab="Likelihood")
abline(h=Dthresh/2) ## deviance is divided by 2 to convert to neg log likelihood
abline(v=c.CI)
```
** Lab Report Task 6**
Modify the previous section of code to calculate the CI for lambda as well. Turn in the likelihood profile plots with the CI and threshold for both variables and report the numeric values for the MLE and CI for both parameters