Skip to content

Commit

Permalink
Merge pull request carpentries-incubator/issues/166 from alanocallagh…
Browse files Browse the repository at this point in the history
…an/captions-and-coefs

Captions and coefs
  • Loading branch information
ailithewing authored Apr 2, 2024
2 parents 67549ee + a204ef7 commit 587caf6
Show file tree
Hide file tree
Showing 9 changed files with 154 additions and 146 deletions.
Binary file modified data/coefHorvath.rds
Binary file not shown.
18 changes: 18 additions & 0 deletions data/cres.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# These two functions are to help us make crescents. Don't worry it you do not understand all this code.
# The important bit is the object "cres", which consists of two columns (x and y coordinates of two crescents).
is.insideCircle <- function(co, r=0.5, offs=c(0,0)){
sqrt((co[,1]+offs[1])^2 + (co[,2]+offs[2])^2) <= r
}
make.crescent <- function(n){
raw <- cbind(x=runif(n)-0.5, y=runif(n)-0.5)
raw[is.insideCircle(raw) & !is.insideCircle(raw, offs=c(0, -0.2)),]
}
# make x/y data in shape of two crescents
set.seed(123)
cres1 <- make.crescent(1000) # 1st crescent
cres2 <- make.crescent(1000) # 2nd crescent
cres2[,2] <- -cres2[,2] -0.1 # flip 2nd crescent upside-down and shift down
cres2[,1] <- cres2[,1] + 0.5 # shift second crescent to the right

cres <- rbind(cres1, cres2) # concatente x/y values
saveRDS(cres, here::here("data/cres.rds"))
Binary file added data/cres.rds
Binary file not shown.
12 changes: 6 additions & 6 deletions episodes/01-introduction-to-high-dimensional-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -142,18 +142,18 @@ of the challenges we are facing when working with high-dimensional data.
> >
> >
> > ```{r dim-prostate, eval = FALSE}
> > dim(prostate) #print the number of rows and columns
> > dim(prostate) # print the number of rows and columns
> > ```
> >
> > ```{r head-prostate, eval = FALSE}
> > names(prostate) # examine the variable names
> > head(prostate) #print the first 6 rows
> > names(prostate) # examine the variable names
> > head(prostate) # print the first 6 rows
> > ```
> >
> > ```{r pairs-prostate}
> > names(prostate) #examine column names
> > ```{r pairs-prostate, fig.cap="Pairwise plots of the 'prostate' dataset.", fig.alt="A set of pairwise scatterplots of variables in the 'prostate' dataset, namely lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45, lpsa. The plots are shown in a grid."}
> > names(prostate) # examine column names
> >
> > pairs(prostate) #plot each pair of variables against each other
> > pairs(prostate) # plot each pair of variables against each other
> > ```
> > The `pairs()` function plots relationships between each of the variables in
> > the `prostate` dataset. This is possible for datasets with smaller numbers
Expand Down
47 changes: 19 additions & 28 deletions episodes/02-high-dimensional-regression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ methyl_mat <- assay(methylation)
The distribution of these M-values looks like this:

```{r histx, fig.cap="Methylation levels are generally bimodally distributed.", fig.alt="Histogram of M-values for all features. The distribution appears to be bimodal, with a large number of unmethylated features as well as many methylated features, and many intermediate features."}
hist(methyl_mat, breaks = "FD", xlab = "M-value")
hist(methyl_mat, xlab = "M-value")
```

You can see that there are two peaks in this distribution, corresponding
Expand All @@ -105,7 +105,11 @@ sample-level metadata we have relating to these data. In this case, the
metadata, phenotypes, and groupings in the `colData` look like this for
the first 6 samples:

```{r datatable}
```{r, eval=FALSE}
head(colData(methylation))
```

```{r datatable, echo=FALSE}
knitr::kable(head(colData(methylation)), row.names = FALSE)
```

Expand Down Expand Up @@ -1029,15 +1033,10 @@ conservative, especially with a lot of features!
```{r p-fwer, fig.cap="Bonferroni correction often produces very large p-values, especially with low sample sizes.", fig.alt="Plot of Bonferroni-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."}
p_raw <- toptab_age$P.Value
p_fwer <- p.adjust(p_raw, method = "bonferroni")
library("ggplot2")
ggplot() +
aes(p_raw, p_fwer) +
geom_point() +
scale_x_log10() + scale_y_log10() +
geom_abline(slope = 1, linetype = "dashed") +
geom_hline(yintercept = 0.05, linetype = "dashed", col = "red") +
geom_vline(xintercept = 0.05, linetype = "dashed", col = "red") +
labs(x = "Raw p-value", y = "Bonferroni p-value")
plot(p_raw, p_fwer, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")
```

You can see that the p-values are exactly one for the vast majority of
Expand Down Expand Up @@ -1090,7 +1089,7 @@ experiment over and over.
> > \frac{0.05}{100} = 0.0005
> > $$
> >
> > 2. Trick question! We can't say what proportion of these genes are
> > 2. We can't say what proportion of these genes are
> > truly different. However, if we repeated this experiment and
> > statistical test over and over, on average 5% of the results
> > from each run would be false discoveries.
Expand All @@ -1100,25 +1099,17 @@ experiment over and over.
> >
> > ```{r p-fdr, fig.cap="Benjamini-Hochberg correction is less conservative than Bonferroni", fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."}
> > p_fdr <- p.adjust(p_raw, method = "BH")
> > ggplot() +
> > aes(p_raw, p_fdr) +
> > geom_point() +
> > scale_x_log10() + scale_y_log10() +
> > geom_abline(slope = 1, linetype = "dashed") +
> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") +
> > labs(x = "Raw p-value", y = "Benjamini-Hochberg p-value")
> > plot(p_raw, p_fdr, pch = 16, log="xy")
> > abline(0:1, lty = "dashed")
> > abline(v = 0.05, lty = "dashed", col = "red")
> > abline(h = 0.05, lty = "dashed", col = "red")
> > ```
> >
> > ```{r plot-fdr-fwer, fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against Bonferroni-adjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."}
> > ggplot() +
> > aes(p_fdr, p_fwer) +
> > geom_point() +
> > scale_x_log10() + scale_y_log10() +
> > geom_abline(slope = 1, linetype = "dashed") +
> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") +
> > labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value")
> > plot(p_fwer, p_fdr, pch = 16, log="xy")
> > abline(0:1, lty = "dashed")
> > abline(v = 0.05, lty = "dashed", col = "red")
> > abline(h = 0.05, lty = "dashed", col = "red")
> > ```
> >
> {: .solution}
Expand Down
83 changes: 47 additions & 36 deletions episodes/03-regression-regularisation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,8 @@ when we have more features than observations.
> `methylation` dataset (we selected 500 features only as it can be hard to
> visualise patterns when there are too many features!).
>
> ```{r corr-mat-meth, fig.cap="Cap", fig.alt="Alt"}
library("ComplexHeatmap")
> ```{r corr-mat-meth, fig.cap="Heatmap of pairwise feature-feature correlations between CpG sites in DNA methylation data", fig.alt="A symmetrical heatmap where rows and columns are features in a DNA methylation dataset. Colour corresponds to correlation, with red being large positive correlations and blue being large negative correlations. There are large blocks of deep red and blue throughout the plot."}
> library("ComplexHeatmap")
> small <- methyl_mat[, 1:500]
> cor_mat <- cor(small)
> Heatmap(cor_mat,
Expand Down Expand Up @@ -188,9 +188,9 @@ different combinations of slope and intercept accomplish this objective.
library("viridis")
set.seed(42)
noise_sd <- 1
nobs <- 50
slope <- 2
intercept <- 2
nobs <- 200
slope <- 1
intercept <- 3
maxlim <- max(abs(slope), abs(intercept)) * 2
maxlim <- max(maxlim, 5)
Expand Down Expand Up @@ -295,7 +295,11 @@ avoid estimating very large effect sizes.
> > "ordinary least squares". The "ordinary" really means "original" here,
> > to distinguish between this method, which dates back to ~1800, and some
> > more "recent" (think 1940s...) methods.
> > 2. Least squares assumes that, when we account for the change in the mean of
> > 2. Squared error is useful because it ignores the *sign* of the residuals
> > (whether they are positive or negative).
> > It also penalises large errors much more than small errors.
> > On top of all this, it also makes the solution very easy to compute mathematically.
> > Least squares assumes that, when we account for the change in the mean of
> > the outcome based on changes in the income, the data are normally
> > distributed. That is, the *residuals* of the model, or the error
> > left over after we account for any linear relationships in the data,
Expand Down Expand Up @@ -331,29 +335,36 @@ coefficient values minimise the training error, but they don't minimise the
test error on unseen data. First, we'll go through an example of what exactly
this means.

For the next few challenges, we'll work with a set of features
known to be associated with age from a paper by Horvath et al.[^2]. Horvath et al
use methylation markers alone to predict the biological age of an individual.
This is useful in studying age-related disease amongst many other things.


To compare the training and test errors for a model for age using the Hovarth data, we'll split the
To compare the training and test errors for a model of methylation features and age, we'll split the
data into training and test sets, fit a linear model and calculate the errors. First, let's
calculate the training error. Let's start by splitting the data into training and test sets:



```{r coefhorvath}
coef_horvath <- readRDS(here::here("data/coefHorvath.rds"))
```{r data}
methylation <- readRDS(here::here("data/methylation.rds"))
library("SummarizedExperiment")
age <- methylation$Age
methyl_mat <- t(assay(methylation))
```

We will then subset the data to a set of CpG markers that are known to be associated with
age from a previous study by Horvath et al.[^2].

```{r echo=FALSE, eval=FALSE}
# these CpGs were obtained using:
dput(coef_horvath$CpGmarker[1:20])
```

```{r coefhorvath}
cpg_markers <- c("cg16241714", "cg14424579", "cg22736354", "cg02479575", "cg00864867",
"cg25505610", "cg06493994", "cg04528819", "cg26297688", "cg20692569",
"cg04084157", "cg22920873", "cg10281002", "cg21378206", "cg26005082",
"cg12946225", "cg25771195", "cg26845300", "cg06144905", "cg27377450"
)
coef_horvath <- coef_horvath[1:20, ]
features <- coef_horvath$CpGmarker
horvath_mat <- methyl_mat[, features]
horvath_mat <- methyl_mat[, cpg_markers]
## Generate an index to split the data
set.seed(42)
Expand All @@ -364,7 +375,6 @@ train_mat <- horvath_mat[train_ind, ]
train_age <- age[train_ind]
test_mat <- horvath_mat[-train_ind, ]
test_age <- age[-train_ind]
```

Now let's fit a linear model to our training data and calculate the training error. Here we use the mean of the squared difference between our predictions and the observed data, or "mean squared error" (MSE).
Expand All @@ -381,9 +391,9 @@ Now let's fit a linear model to our training data and calculate the training err
fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
## Function to calculate the (mean squared) error
mse <- function(true, prediction) {
mean((true - prediction)^2)
}
mse <- function(true, prediction) {
mean((true - prediction)^2)
}
## Calculate the training error
err_lm_train <- mse(train_age, fitted(fit_horvath))
Expand Down Expand Up @@ -521,7 +531,7 @@ of 20 features (stored as `features`) selected earlier in this episode (these
are a subset of the features identified by Horvarth et al), using both
regularised and ordinary least squares.

```{r plot-ridge, fig.cap="Cap", fig.alt="Alt"}
```{r plot-ridge, fig.cap="A line plot of coefficient estimates against log lambda for a ridge regression model.", fig.alt="A line plot of coefficient estimates against log lambda for a ridge regression model. Lines are depicted in different colours, with coefficients generally having large values on the left of the plot (small log lambda) and moving smoothly and gradually towards zero to the right of the plot (large log lambda). Some coefficients appear to increase and then decrease in magnitude as lambda increases, or switch signs."}
library("glmnet")
## glmnet() performs scaling by default, supply un-scaled data:
Expand Down Expand Up @@ -553,7 +563,7 @@ regression predicts the test data better than the model with no regularisation.
Let's generate our predictions under the ridge regression model and calculate
the mean squared error in the test set:

```{r pred-ridge-lm, fig.width = 10, fig.cap="Cap", fig.alt="Alt"}
```{r pred-ridge-lm}
# Obtain a matrix of predictions from the ridge model,
# where each column corresponds to a different lambda value
pred_ridge <- predict(ridge_fit, newx = test_mat)
Expand Down Expand Up @@ -581,7 +591,7 @@ We can see where on the continuum of lambdas we've picked a model by plotting
the coefficient paths again. In this case, we've picked a model with fairly
modest coefficient shrinking.

```{r chooselambda, fig.cap="Cap", fig.alt="Alt"}
```{r chooselambda, fig.cap="A line plot of coefficient estimates against log lambda for a ridge regression model, showing the optimal value based on the minimal test error.", fig.alt="A line plot of coefficient estimates against log lambda for a ridge regression model. A dashed vertical line depicts the optimal lambda value towards the left of the plot. Lines are depicted in different colours, with coefficients generally having large values on the left of the plot (small log lambda) and moving smoothly and gradually towards zero to the right of the plot (large log lambda). Some coefficients appear to increase and then decrease in magnitude as lambda increases, or switch signs."}
chosen_lambda <- ridge_fit$lambda[which.min(err_ridge)]
plot(ridge_fit, xvar = "lambda")
abline(v = log(chosen_lambda), lty = "dashed")
Expand All @@ -605,7 +615,7 @@ abline(v = log(chosen_lambda), lty = "dashed")
> > err_lm
> > ```
> > 2. The ridge ones are much less spread out with far fewer extreme predictions.
> > ```{r plot-ridge-prediction, fig.width = 10, fig.cap="Cap", fig.alt="Alt"}
> > ```{r plot-ridge-prediction, fig.width = 10, fig.cap="Two plots showing OLS predictions (left) and ridge regression predictions (right) of age (y) against true age (x).", fig.alt="Two plots showing OLS predictions and ridge regression predictions of age (y) against true age (x). A dashed line shows the line y=x. In the OLS plot, predictions are quite extreme, while in the ridge regression plot, they are generally more conservative."}
> > all <- c(pred_lm, test_age, pred_min_ridge)
> > lims <- range(all)
> > par(mfrow = 1:2)
Expand Down Expand Up @@ -648,8 +658,8 @@ drawplot <- function() {
set.seed(42)
noise_sd <- 1
nobs <- 200
slope <- 2
intercept <- 2
slope <- 1
intercept <- 3
maxlim <- max(abs(slope), abs(intercept)) * 2
maxlim <- max(maxlim, 5)
lims <- c(-maxlim, maxlim)
Expand Down Expand Up @@ -734,8 +744,9 @@ drawplot()
> > to the coefficients. When we instead plot the L1 norm on the x-axis,
> > increasing L1 norm means that we are allowing our
> > coefficients to take on increasingly large values.
> > ```{r plotlas, fig.width = 10, fig.height = 20, echo=FALSE}
> > par(mfrow = c(2, 1))
> >
> > ```{r plotlas, fig.width = 10, fig.cap="Line plots showing coefficient values from a LASSO model against log lambda (left) and L1 norm (right).", fig.alt="Two line plots side-by-side, showing coefficient values from a LASSO model against log lambda (left) and L1 norm (right). The coefficients, generally, suddenly become zero as log lambda increases or, equivalently, L1 norm decreases. However, some coefficients increase in size before decreasing as log lamdba increases."}
> > par(mfrow = c(1, 2))
> > plot(fit_lasso, xvar = "lambda")
> > plot(fit_lasso)
> > ```
Expand Down Expand Up @@ -771,7 +782,7 @@ that minimises the error across each of the test and training splits. In R:

```{r lasso-cv, fig.cap="Cross-validated mean squared error for different values of lambda under a LASSO penalty.", fig.alt="Alt"}
# fit lasso model with cross-validation across a range of lambda values
lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)
lasso <- cv.glmnet(methyl_mat, age, alpha = 1)
plot(lasso)
# Extract the coefficients from the model with the lowest mean squared error from cross-validation
Expand Down Expand Up @@ -822,7 +833,7 @@ plot_elastic <- function(alpha) {
noise_sd <- 1
nobs <- 100
slope <- 1
intercept <- 1
intercept <- 3
norm <- 1
maxlim <- max(abs(slope), abs(intercept)) * 2
Expand Down Expand Up @@ -900,20 +911,20 @@ plot_elastic(0.75)
> > You can see that coefficients tend to go exactly to zero,
> > but the paths are a bit less
> > extreme than with pure LASSO; similar to ridge.
> > ```{r elastic}
> > ```{r elastic, fig.cap="Line plot showing coefficient values from an elastic net model against L1 norm.", fig.alt="A line plot showing coefficient values from an elastic net model against L1 norm. The coefficients, generally, suddenly become zero as L1 norm decreases. However, some coefficients increase in size before decreasing as L1 norm decreases."}
> > elastic <- glmnet(methyl_mat[, -1], age, alpha = 0.5)
> > plot(elastic)
> > ```
> > 2. The process of model selection is similar for elastic net models as for
> > LASSO models.
> > ```{r elastic-cv, fig.cap="Elastic", fig.alt="Alt"}
> > ```{r elastic-cv, fig.cap="A plot of the cross-validation mean squared error of an elastic net model against log lambda.", fig.cap="A plot depicting mean squared error (MSE) against discrete values of lambda, with red points showing the average mean squared error and grey error bars showing the 1 standard error interval around them. The MSE rises as log lambda increases, indicating a larger prediction error. Two dashed lines indicate the lambda value corresponding to the lowest overall MSE value and the lambda value corresponding to the lambda value with MSE with one standard error of the minimum."}
> > elastic_cv <- cv.glmnet(methyl_mat[, -1], age, alpha = 0.5)
> > plot(elastic_cv)
> > ```
> > 3. You can see that the coefficients from these two methods are broadly
> > similar, but the elastic net coefficients are a bit more conservative.
> > Further, more coefficients are exactly zero in the LASSO model.
> > ```{r elastic-plot, fig.cap="LASSO-Elastic", fig.alt="Alt"}
> > ```{r elastic-plot, fig.cap="Scatter plot of coefficient estimates from an elastic net model (y) against coefficient estimates from a LASSO model (x).", fig.alt="A scatter plot depicting coefficient estimates from elastic net and LASSO models. Generally, the elastic net model coefficients are more conservative, with one notable outlier, whereas the LASSO model coefficients are more frequently exactly equal to zero."}
> > coefe <- coef(elastic_cv, elastic_cv$lambda.1se)
> > sum(coefe[, 1] == 0)
> > sum(coefl[, 1] == 0)
Expand Down Expand Up @@ -999,7 +1010,7 @@ plot_elastic(0.75)
> The [package documentation](https://glmnet.stanford.edu/articles/glmnet.html)
> explains this in more detail.
>
> ```{r binomial, fig.cap = "Title", fig.alt = "Alt"}
> ```{r binomial, fig.cap="A plot of the cross-validation binomial deviance of a logistic regression elastic net model against log lambda.", fig.cap="A plot depicting binomial deviance against discrete values of lambda, with red points showing the average mean squared error and grey error bars showing the 1 standard error interval around them. The MSE decreases as log lambda increases, indicating a smaller prediction error. Two dashed lines indicate the lambda value corresponding to the lowest overall MSE value and the lambda value corresponding to the lambda value with MSE with one standard error of the minimum, with both being exactly on the right side of the plot, indicating an intercept-only model."}
> smoking <- as.numeric(factor(methylation$smoker)) - 1
> # binary outcome
> table(smoking)
Expand Down
Loading

0 comments on commit 587caf6

Please sign in to comment.