-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathlec08.rnw
579 lines (506 loc) · 16.2 KB
/
lec08.rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
\documentclass[table,10pt]{beamer}
\mode<presentation>{
%\usetheme{Goettingen}
\usetheme{Boadilla}
\usecolortheme{default}
}
\usepackage{CJK}
\usepackage{graphicx}
\usepackage{amsmath, amsopn}
\usepackage{xcolor}
\usepackage[english]{babel}
\usepackage[T1]{fontenc}
\usepackage[latin1]{inputenc}
\usepackage{enumerate}
\usepackage{multirow}
\usepackage{url}
\ifx\hypersetup\undefined
\AtbBeginDocument{%
\hypersetup{unicode=true,pdfusetitle,
bookmarks=true,bookmarksnumbered=false,bookmarksopen=false,
breaklinks=false,pdfborder={0 0 0},pdfborderstyle={},backref=false,colorlinks=false}
}
\else
\hypersetup{unicode=true,pdfusetitle,
bookmarks=true,bookmarksnumbered=false,bookmarksopen=false,
breaklinks=false,pdfborder={0 0 0},pdfborderstyle={},backref=false,colorlinks=false}
\fi
\usepackage{breakurl}
\usepackage{color}
\usepackage{times}
\usepackage{xcolor}
\usepackage{listings}
\lstset{
language=R,
keywordstyle=\color{blue!70}\bfseries,
basicstyle=\ttfamily,
commentstyle=\ttfamily,
showspaces=false,
showtabs=false,
frame=shadowbox,
rulesepcolor=\color{red!20!green!20!blue!20},
breaklines=true}
\setlength{\parskip}{.5em}
\title[BI476]{BI476: Biostatistics - Case Studies}
\subtitle[survival]{Lec08: Semiparametric Survival Data Analysis}
\author[Maoying Wu]{Maoying,Wu\\{\small [email protected]}}
\institute[CBB] % (optional, but mostly needed)
{
\inst{}
Dept. of Bioinformatics \& Biostatistics\\
Shanghai Jiao Tong University
}
\date{Spring, 2018}
\AtBeginSection[]
{
\begin{frame}<beamer>{Next Section ...}
\tableofcontents[currentsection]
\end{frame}
}
\begin{document}
\begin{CJK*}{UTF8}{gbsn}
<<setup, include=FALSE>>=
library(knitr)
opts_chunk$set(fig.path='figure/beamer-', fig.align='center',
fig.show='hold',size='footnotesize', out.width='0.60\\textwidth')
carcinoma <- read.table("data/carcinoma-ct-it.txt", header=T)
@
\begin{frame}
\titlepage
\end{frame}
\begin{frame}
\frametitle{Outline}
\tableofcontents
\end{frame}
\section{Semiparametric: Cox proportional hazard regression}
\begin{frame}[t,containsverbatim]
\frametitle{Semiparametric modeling}
\begin{itemize}
\item A semiparametric model contains two portions:
\begin{itemize}
\item Nonparametric portion: The underlying survival
distribution.
\item Parametric portion: The way in which covariates
affect that underlying distribution.
\end{itemize}
\item Two popular semiparametric models for survival data
regression
\begin{itemize}
\item Proportional hazard (PH) framework
\item Accelerated failure time (AFT) framework
\end{itemize}
\end{itemize}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Semiparametric Models: Introduction}
\begin{itemize}
\item For the AFT model:
$$
Y_i = x_i^T \beta + W_i, \mbox{ where } Y_i = \log T_i
$$
\item The parametric aspect of the model lies in the $x_i^T
\beta$ portion, while the nonparametric aspect
involves assuming that $W_i \stackrel{\text{i.i.d}}{\sim} F$,
where $F$ is some generic, unspecified distribution.
\item That is, $\beta$ should be inferred without depending
on a specific distribution $F$.
\end{itemize}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Cox proportional hazard regression}
\begin{itemize}
\item Proposed by David Cox (1972)
\item The model is specified in terms of the hazard function
instead of the survival function
\item Assumption of changes in the concomitant variable
corresponding to multiplicative changes in the hazard
function.
\item Additive changes in the log of hazard function.
\end{itemize}
$$
h(t | X) = \exp(X\beta) h_0(t)
$$
where
\begin{itemize}
\item $X$ is a vector of concomitant, covariate or regressor
information $X = (x_1, x_2, \dots, x_p)$
\item $\beta$ is the column vector of parameters
\item $\exp(X\beta)$ is the \textbf{parametric portion}.
\item $h_0(t)$ (\textbf{nonparametric portion}) is the
time-specific baseline hazard function, referred to
as a homogeneous or baseline hazard function.
\end{itemize}
$$
\frac{h(t|X_1)}{h(t|X_2)} = \frac{\exp(X_1\beta)}{\exp(X_2\beta)}
$$
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Special case: Two observations}
\begin{itemize}
\item Assume that we have only two observations, $T_1$ and
$T_2$, with hazard functions $\lambda_1(t)$ and
$\lambda_2(t)$
\item Suppose that the first failure occurs at time $t$, how
likely is the failure subject 1?
\item \textbf{Proposition}
$$
\mathbb{P}(T_1=t | T_{(1)} = t) = \frac{\lambda_1(t)}{\lambda_1(t)+\lambda_2(t)}
$$
\item Under the PH assumption, we have
$$
\mathbb{P}(T_1 < T_2) = \frac{\exp(x_1^T \beta)}{\exp(x_1^T \beta) + \exp(x_2^T \beta)}
$$
Now the baseline hazard, $\lambda_0(t)$, is canceled out.
\item This can be extended to multiple subjects \textbf{without censoring}:
$$
\mathbb{P}(T_1 < T_2 < \dots < T_J) = \prod_{j=1}^J \frac{\exp{x_j^T \beta}}{\sum_{k=j}^J \exp(x_k^T \beta)}
$$
\end{itemize}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{PH model with censoring}
\begin{itemize}
\item The probability that subject $j$ fails at time $t$ given
that one of the subjects from the risk set $R(t)$
failed at time $t$ is
$$
\frac{\exp(x_j^T \beta)}{\sum_{k \in R(t)} \exp(x_k^T \beta)}
$$
\item Therefore, the full likelihood is
$$
L(\beta) = \prod_j \frac{\exp(x_j^T \beta)}{\sum_{k \in R(t)} \exp(x_k^T \beta)}
$$
\item This is not exactly a likelihood, but a \textbf{partial
likelihood}.
\item But the partial likelihood, \textbf{Cox partial
likelihood} still yields a score with mean
zero and a variance given by the negative Hessian
matrix.
\end{itemize}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Cox proportional hazard regression estimator}
The estimation is based on the principle of \textbf{maximum partial
likelihood} through working out the score vector and Hessian matrix and
then applying an \textbf{iterative Newton-Raphson procedure}.
$$
L(\beta) = \prod_j \frac{\exp(x_j^T \beta)}{\sum_{k \in R(t_j)} \exp(x_k^T \beta)}
$$
\begin{itemize}
\item The denominator can be written as
$$
\sum_{k=1}^n Y_k(t_j) \exp(x_k^T \beta)
$$
where
$$
Y_i(t) = \begin{cases}
1 & \mbox{subject } i \mbox{ is at risk at time } t\\
0 & \mbox{otherwise}
\end{cases}
$$
\end{itemize}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Cox's PH Model Inference}
Let
$$
w_j = \exp(\mathbf{x}^T \beta), W_i = \sum_{j \in R_i} w_j
$$
then
$$
L(\beta) = \prod_i \left\{ \frac{w_i}{\sum_{R_i} w_j}\right\}^{d_i}
$$
where $w_i$ is the relative probability of failure for subject $i$.
With these, we can compute the absolute probability of failure for subject $i$ at time $t_j$:
$$
\pi_{ij} = Y_i(t_j) \frac{w_i}{W_j}
$$
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Score function}
The partial log-likelihood is:
$$
\begin{array}{lcl}
l &=& \sum_i d_i \log w_i - \sum_i d_i \log W_i\\
&=& \sum_i d_i \eta_i - \sum_i d_i \log W_i
\end{array}
$$
Note that $W_i$ contains many $\eta$ terms in addition to $\eta_i$.
Then the score equation is obtained by maximizing the log-likelihood function:
$$
\frac{\partial l}{\partial \beta} = 0
$$
we start by
$$
\frac{\partial l}{\partial \eta_k} = d_k - \sum_i \pi_{ki} d_i
$$
which can be written as the matrix-form:
$$
\mathbf{u(\eta) = d - Pd}
$$
where $\mathbf{P} \in \mathbb{R}^{n \times n} = (\pi_{ij})_{n \times n}$
Then we can write:
$$
\frac{\partial l}{\partial \beta} = \frac{\partial l}{\partial \eta} \frac{\partial \eta}{\partial \beta} = \mathbf{X^T (d - Pd)}
$$
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Hessian function}
The hessian function can be obtained by:
$$
\begin{array}{lcl}
\frac{\partial^2 l}{\partial \eta_k^2} &=& - \sum_i d_i \pi_{ki} (1-\pi_{ki})\\
\frac{\partial^2 l}{\partial \eta_k \partial \eta_j} &=& \sum_i d_i \pi_{ki} \pi_{ji}
\end{array}
$$
and the Hessian matrix can be computed as
$$
\begin{array}{lcl}
H(\beta) &=& -\mathbf{X^T W X}\\
&=& - \sum_j \sum_k \pi_{kj} (x_k - E_j x) (x_k - E_j x)^T
\end{array}
$$
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Newton-Raphson Updates}
$$
\hat{\beta}^{(m+1)} = \mathbf{(X^T W X)^{-1} X^T (d - Pd)} + \hat{\beta}^{(m)}
$$
\begin{lstlisting}
for (i in 1:m){
eta <- X %*% b
haz <- as.numeric(exp(eta)) # w[i]
rsk <- rev(cumsum(rev(haz))) # W[i]
P <- outer(haz, rsk, '/')
P[upper.tri(P)] <- 0
W <- - P %*% diag(d) %*% t(1-P)
diag(W) <- diag(P %*% diag(P) %*% t(1-P))
b <- solve(t(X) %*% W %*% X) %*% t(X) %*% (d - P%*%d) + b
}
\end{lstlisting}
The above R code assumes that the data has been sorted by time on study, and
assumes that no ties are present.
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Fit Cox Regression Model}
<<eval=TRUE>>=
# fit Cox
fit.Cox <- coxph(Surv(Time, Status==0) ~ TRT, data=carcinoma)
summary(fit.Cox)
@
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Fit Cox Regression Model with Covariate}
<<eval=TRUE>>=
# fit Cox
fit.Cox.age <- coxph(Surv(Time, Status==0) ~ TRT + Age, data=carcinoma)
summary(fit.Cox.age)
@
\end{frame}
\subsection{Cox Model inference}
\begin{frame}[t,containsverbatim]
\frametitle{Cox model inference}
\begin{itemize}
\item Wald test
\item Score test
\item Likelihood ratio test (LRT)
\end{itemize}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Wald test}
\begin{itemize}
\item Wald inference is based on the asymptotic result:
$$
\hat{\beta} \stackrel{\cdot}{\sim} N(\beta, (X^T W X)^{-1}),
$$
\item The confidence interval are constructed using $\hat{\beta} \pm z_{\alpha/2} \sqrt{I_{jj}^{-1}}$
\end{itemize}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Likelihood ratio test}
Let $\hat{\beta}_0$ denote the fit of the first model and $\hat{\beta}_1$ denote the fit
of the second model, the likelihood ratio test (which is only requires fitting two nested
models) is based on
$$
2 (l(\hat{\beta}_1) - l(\hat{\beta}_0)) \stackrel{\cdot}{\sim} \chi_k^2
$$
where $k$ is the difference of number of parameters for the two models.
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Score test}
The score test statistic for testing $H_0: \beta = 0$ is
$$
\begin{array}{lcl}
u(0) &=& \sum_j (x_j - E_j x)\\
&=& \sum_j \left(d_{1j} - d_j \frac{n_{1j}}{n_j} \right)
\end{array}
$$
The score test is in some sense equivalent to the log-rank test, although the variances
are calculated differently and therefore do not produce the exact same $p$-value.
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{coxph function}
\begin{itemize}
\item The function for fitting Cox proportional hazards models in
\texttt{survival} package.
\item The syntax is similar to other model-fitting functions in R:
\begin{lstlisting}
fit <- coxph(Surv(time, status) ~ treatment + stage + hepato, pbc)
\end{lstlisting}
\item Several functions on the \texttt{coxph} object:
\begin{itemize}
\item \texttt{coef(fit)}: return the MLE of the coefficient vector.
\item \texttt{vcov(fit)}: return the inverse of the information matrix
$\mathbf{(X^T W X)^{-1}}$
\item \texttt{model.matrix(fit)}: return $X$.
\item The Wald, score, LRT tests are testing the global hypotheses $H_0:
\mathbf{\beta} = 0$
\item \texttt{anova(fit0, fit1)}
\item \texttt{loglik(fit), AIC(fit), BIC(fit)}
\item \texttt{predict(fit)}
\end{itemize}
\end{itemize}
\end{frame}
\section{Cox PH Model Assessment}
\begin{frame}[t,containsverbatim]
\frametitle{Assesssment of the Cox Model}
The Cox (PH) model:
$$
\lambda(t| Z(t)) = \lambda_0(t) \exp(Z(t)\beta)
$$
Assumption of this model:
\begin{enumerate}[(1)]
\item the regression effect of $\beta$ is constant over time (PH assumption)
\item linear combination of the covariates (including possibly higher order
terms, interactions)
\item the link function is exponential
\end{enumerate}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Using residuals for checking assumptions}
In the regression analysis, the residuals are often used to check the assumptions.
However, residuals for survival data are somewhat different from those for the other
type of data models, mainly due to the existence of censoring:
\begin{itemize}
\item Generalized (Cox-Snell) residuals
\item \textbf{Schoenfeld residuals}
\item Martingale residuals
\end{itemize}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{\texttt{cox.zph} function}
<<eval=TRUE,echo=TRUE>>=
check <- cox.zph(fit.Cox, transform="log", global=TRUE)
print(check)
plot(check)
@
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Summary}
Test of proportionality. The \texttt{cox.zph} function will test proportionality of
all the predictors in the model by creating interactions with time using the
transformation of time specified in the transform option. In this example we are
testing proportionality by looking at the interactions with \textbf{log(time)}. The
column \textbf{rho} is the Pearson product-moment correlation between the scaled
Schoenfeld residuals and \textbf{log(time)} for each covariate. The last row contains
the global test for all the interactions tested at once. A $p$-value less than 0.05
indicates a violation of the proportionality assumption.
\end{frame}
\section{Competing Risk}
\begin{frame}[t,containsverbatim]
\frametitle{Stem cell transplant for acute leukemia}
In this clinical trial, 117 acute leukemia patients received stem cell
transplant. The aim of the analysis was to estimate the cumulative
incidence of relapse in the presence of transplant-related death, which
deals with competing events. The effect on relapse of predictive
factors and covariates such as Sex, Disease (lymphoblastic or
myeloblastic leukemia), Phase at transplant (Relapse, CR1, CR2, CR3),
Source of stem cells (bone marrow and peripheral blood (BM+PB), or
peripheral blood (PB), and Age will be evaluated.
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Importing the data}
<<eval=TRUE>>=
bmt <- read.csv("data/bmtcrr.csv", header=TRUE)
head(bmt)
@
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Type-specific hazard}
\begin{itemize}
\item Assume that we have multiple failure types
\item Let $T$ denote the time until failure, and $J$ indicate
the type of failure
\item The type-specific hazard is then defined as
$$
\lambda_j(t) = \lim_{h \to 0} \frac{\mathcal{P}(t \le T < t + h, J=j|T \ge t)}{h}
$$
\item Then the overall hazard is
$$
\lambda(t) = \sum_j \lambda_j(t)
$$
\item Therefore the overall survival becomes
$$
S(t) = \exp\left\{ -\int_0^t \lambda(s) ds \right\}
$$
\end{itemize}
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Recoding the covariates}
<<eval=TRUE>>=
attach(bmt)
source("code/crr-addson.R")
x <- cbind(Age, factor2ind(Sex, "M"), factor2ind(D, "ALL"),
factor2ind(Phase, "Relapse"), factor2ind(Source))
head(x)
@
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Competing risk modeling}
The main function to fit regression models for competing risks data is
\texttt{crr()}, which is contained in the \texttt{cmprsk} package. In the simplest
form it requires
\begin{itemize}
\item a vector of follow-up times,
\item a vector of status with a code for each failure type or censoring,
\item a matrix of fixed covariates.
\end{itemize}
By default, the censoring code for status is set by the optional argument \texttt{cencode=0},
and the code that denotes the failure type of interest is set by the optional argument
\texttt{failcode=1}.
In our example, transplant-related death, which is the competing event, is coded with 2.
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Regresison model for relapse}
<<eval=TRUE>>=
mod1 <- crr(ftime, Status, x)
summary(mod1)
@
\end{frame}
\begin{frame}[t,containsverbatim]
\frametitle{Analysis of Results}
\begin{itemize}
\item The first part of the output shows for each term in the
design matrix
\begin{itemize}
\item the estimated coefficient $\hat{\beta}_j$
\item the relative risk $\exp(\hat{\beta}_j)$
\item the standard error
\item the $z$-value
\item the corresponding $P$-value
\end{itemize}
\item For Sex,
\begin{itemize}
\item $\hat{\beta}_{female} = -0.0352$
\item relative risk $0.965$ is the ratio of risk for
female w.r.t. the male, with all other
covariates equal.
\item $p$-value of 0.9000 indicates no significant
effect.
\item for age, the relative risk of 0.982 is the
relative risk for a 1 year increase in age.
\end{itemize}
\end{itemize}
\end{frame}
\end{CJK*}
\end{document}