Skip to content

Commit

Permalink
Add RTX prediction notebook (#18)
Browse files Browse the repository at this point in the history
* Ignore private data

* Add exploratory data analyses for RTX

* Add notebook for predicting response from baseline samples (RTX)
  • Loading branch information
jaclyn-taroni authored Jun 24, 2018
1 parent ca9bb19 commit 2e5cd97
Show file tree
Hide file tree
Showing 6 changed files with 1,013 additions and 0 deletions.
266 changes: 266 additions & 0 deletions 25-predict_response.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
---
title: "First pass at predicting response from baseline samples in RTX trial"
output: html_notebook
---

**J. Taroni 2018**

## Install `caret`

```{r}
# devtools::install_github('topepo/caret/pkg/caret',
# ref = "6546939345fe10649cefcbfee55d58fb682bc902")
# devtools::install_version("e1071", version = "1.6-8")
```


## Functions and directory set up

```{r}
# magrittr pipe
`%>%` <- dplyr::`%>%`
```

```{r}
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "25")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "25")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```

## Read in data

#### Covariates

```{r}
covariate.df <- readr::read_tsv(file.path("data", "rtx",
"RTX_full_covariates.tsv"))
```

#### Gene expression data

This is gene-level expression data that has been vst-transformed and filtered
to only genes that are in the recount2 PLIER model.

```{r}
exprs <- readRDS(file.path("data", "rtx", "VST_blind_filtered.RDS"))
```

#### recount2 `B`

The multiPLIER approach

```{r}
recount.b <- readRDS(file.path("data", "rtx", "RTX_recount2_B.RDS"))
```

#### RTX PLIER model

```{r}
rtx.plier <- readRDS(file.path("data", "rtx", "RTX_PLIER_model.RDS"))
rtx.b <- rtx.plier$B
```


## LASSO

### Prep data

First, we'll change the sample names to match the barcodes in the covariates.
The first six characters of the current column/sample names should correspond
to a barcode.

```{r}
# in the expression data
colnames(exprs) <- substr(colnames(exprs), start = 1, stop = 6)
all(covariate.df$barcode == colnames(exprs))
```

```{r}
# in the recount B data
colnames(recount.b) <- substr(colnames(recount.b), start = 1, stop = 6)
all(covariate.df$barcode == colnames(recount.b))
```

```{r}
# in the RTX B
colnames(rtx.b) <- substr(colnames(rtx.b), start = 1, stop = 6)
all(covariate.df$barcode == colnames(rtx.b))
```

The `mainclass` column in `covariate.df` is what we are interested in
predicting; it contains whether or not a patient is a _nonresponder_ or
_responder_ (divided into _tolerant_ or _nontolerant_ depending on, I believe,
long-term outcome) to treatment.
(We'll exclude samples with `NA` in this column.)

We'll want to try and predict this from baseline samples
(`covariate.df$timepoint == "BL"`).
We will not be adjusting for covariates at this point.
The earlier publications on this trial suggest that the majority of
covariates have no significant association with response.

Let's take a look at the sample size and class balance.

```{r}
table(covariate.df$mainclass, covariate.df$timepoint)
```

We can see that there are `r sum(covariate.df$timepoint == "BL")` baseline
samples and that the three classes (`Nonresponder`, `Nontolerant`, and
`Tolerant`) are pretty balanced.
If we use these three classes, we can likely use a metric like total accuracy
to evaluate performance.
Also, the small sample size lends itself to leave-one-out cross-validataion
(LOOCV).

```{r}
# Do all baseline samples have response labels? No, one is NA
baseline.covariate.df <- covariate.df %>%
dplyr::filter(timepoint == "BL") %>%
dplyr::select(c("barcode", "timepoint", "mainclass")) %>%
dplyr::filter(complete.cases(.))
```

```{r}
# we only want the baseline samples with a class label
baseline.samples <- baseline.covariate.df$barcode
```

```{r}
baseline.exprs <- t(exprs[, which(colnames(exprs) %in% baseline.samples)])
dim(baseline.exprs)
```

```{r}
recount.baseline.b <-
t(recount.b[, which(colnames(recount.b) %in% baseline.samples)])
dim(recount.baseline.b)
```

```{r}
rtx.baseline.b <- t(rtx.b[, which(colnames(rtx.b) %in% baseline.samples)])
dim(rtx.baseline.b)
```

```{r}
all(rownames(recount.baseline.b) == baseline.covariate.df$barcode)
```

```{r}
all(rownames(baseline.exprs) == baseline.covariate.df$barcode)
```

```{r}
all(rownames(rtx.baseline.b) == baseline.covariate.df$barcode)
```


### Prediction

```{r}
set.seed(12345)
```


#### Expression data

```{r}
exprs.results <- glmnet::cv.glmnet(x = baseline.exprs,
y = baseline.covariate.df$mainclass,
type.measure = "class",
family = "multinomial",
nfolds = nrow(baseline.exprs)) # LOOCV
saveRDS(exprs.results, file.path(results.dir, "expression_cv.glmnet.RDS"))
```

```{r}
exprs.predicted.labels <- stats::predict(exprs.results,
baseline.exprs,
s = exprs.results$lambda.1se,
type = "class")
caret::confusionMatrix(data = as.factor(exprs.predicted.labels),
reference = as.factor(baseline.covariate.df$mainclass))
```

#### recount2 `B`

```{r}
recount.b.results <- glmnet::cv.glmnet(x = recount.baseline.b,
y = baseline.covariate.df$mainclass,
type.measure = "class",
family = "multinomial",
nfolds = nrow(recount.baseline.b)) # LOOCV
saveRDS(recount.b.results, file.path(results.dir, "recount2_B_cv.glmnet.RDS"))
```

```{r}
recount.b.predicted.labels <- stats::predict(recount.b.results,
recount.baseline.b,
s = recount.b.results$lambda.1se,
type = "class")
caret::confusionMatrix(data = as.factor(recount.b.predicted.labels),
reference = as.factor(baseline.covariate.df$mainclass))
```

#### RTX `B`

```{r}
rtx.b.results <- glmnet::cv.glmnet(x = rtx.baseline.b,
y = baseline.covariate.df$mainclass,
type.measure = "class",
family = "multinomial",
nfolds = nrow(rtx.baseline.b)) # LOOCV
saveRDS(rtx.b.results, file.path(results.dir, "RTX_B_cv.glmnet.RDS"))
```

```{r}
rtx.b.predicted.labels <- stats::predict(rtx.b.results,
rtx.baseline.b,
s = rtx.b.results$lambda.1se,
type = "class")
caret::confusionMatrix(data = as.factor(rtx.b.predicted.labels),
reference = as.factor(baseline.covariate.df$mainclass))
```

### Plotting accuracy

```{r}
acc.df <- data.frame(Model = c("Expression", "RTX LVs", "multiPLIER LVs"),
Accuracy = c(0.9444, 1, 0.3889),
Lower = c(0.8134, 0.9026, 0.2314),
Upper = c(0.9932, 1, 0.5654))
```

```{r}
acc.df %>%
ggplot2::ggplot() +
ggplot2::geom_pointrange(mapping = ggplot2::aes(x = Model, y = Accuracy,
ymin = Lower, ymax = Upper)) +
ggplot2::theme_bw() +
ggplot2::labs(title = "Predicting response with LASSO") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold")) +
ggplot2::theme(text = ggplot2::element_text(size = 15))
```

```{r}
ggplot2::ggsave(file.path(plot.dir, "total_accuracy_CI.pdf"),
plot = ggplot2::last_plot())
```

I wonder if the poor performance in the case of the multiPLIER LVs could be due
to a smaller range of values.

```{r}
summary(as.vector(baseline.exprs))
```

```{r}
summary(as.vector(recount.baseline.b))
```

```{r}
summary(as.vector(rtx.baseline.b))
```
738 changes: 738 additions & 0 deletions 25-predict_response.nb.html

Large diffs are not rendered by default.

Binary file added plots/25/total_accuracy_CI.pdf
Binary file not shown.
3 changes: 3 additions & 0 deletions results/25/RTX_B_cv.glmnet.RDS
Git LFS file not shown
3 changes: 3 additions & 0 deletions results/25/expression_cv.glmnet.RDS
Git LFS file not shown
3 changes: 3 additions & 0 deletions results/25/recount2_B_cv.glmnet.RDS
Git LFS file not shown

0 comments on commit 2e5cd97

Please sign in to comment.