Merge pull request carpentries-incubator/issues/166 from alanocallagh…

…an/captions-and-coefs Captions and coefs
alanocallaghan · Apr 2, 2024 · 587caf6 · 587caf6
2 parents 67549ee + a204ef7
commit 587caf6
Show file tree

Hide file tree

Showing 9 changed files with 154 additions and 146 deletions.
diff --git a/data/coefHorvath.rds b/data/coefHorvath.rds
diff --git a/data/cres.R b/data/cres.R
@@ -0,0 +1,18 @@
+# These two functions are to help us make crescents. Don't worry it you do not understand all this code.
+# The important bit is the object "cres", which consists of two columns (x and y coordinates of two crescents).
+is.insideCircle <- function(co, r=0.5, offs=c(0,0)){
+  sqrt((co[,1]+offs[1])^2 + (co[,2]+offs[2])^2) <= r
+}
+make.crescent <- function(n){
+  raw <- cbind(x=runif(n)-0.5, y=runif(n)-0.5)
+  raw[is.insideCircle(raw) & !is.insideCircle(raw, offs=c(0, -0.2)),]
+}
+# make x/y data in shape of two crescents
+set.seed(123)
+cres1 <- make.crescent(1000) # 1st crescent
+cres2 <- make.crescent(1000) # 2nd crescent
+cres2[,2] <- -cres2[,2] -0.1 # flip 2nd crescent upside-down and shift down
+cres2[,1] <- cres2[,1] + 0.5 # shift second crescent to the right
+
+cres <- rbind(cres1, cres2) # concatente x/y values
+saveRDS(cres, here::here("data/cres.rds"))
diff --git a/data/cres.rds b/data/cres.rds
diff --git a/episodes/01-introduction-to-high-dimensional-data.Rmd b/episodes/01-introduction-to-high-dimensional-data.Rmd
@@ -142,18 +142,18 @@ of the challenges we are facing when working with high-dimensional data.
 > > 
 > > 
 > > ```{r dim-prostate, eval = FALSE}
-> > dim(prostate)   #print the number of rows and columns
+> > dim(prostate)    # print the number of rows and columns
 > > ```
 > >
 > > ```{r head-prostate, eval = FALSE}
-> > names(prostate) # examine the variable names
-> > head(prostate)   #print the first 6 rows
+> > names(prostate)  # examine the variable names
+> > head(prostate)   # print the first 6 rows
 > > ```
 > > 
-> > ```{r pairs-prostate}
-> > names(prostate)  #examine column names
+> > ```{r pairs-prostate, fig.cap="Pairwise plots of the 'prostate' dataset.", fig.alt="A set of pairwise scatterplots of variables in the 'prostate' dataset, namely lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45, lpsa. The plots are shown in a grid."}
+> > names(prostate)  # examine column names
 > >
-> > pairs(prostate)  #plot each pair of variables against each other
+> > pairs(prostate)  # plot each pair of variables against each other
 > > ```
 > > The `pairs()` function plots relationships between each of the variables in
 > > the `prostate` dataset. This is possible for datasets with smaller numbers

diff --git a/episodes/02-high-dimensional-regression.Rmd b/episodes/02-high-dimensional-regression.Rmd
@@ -94,7 +94,7 @@ methyl_mat <- assay(methylation)
 The distribution of these M-values looks like this:
 
 ```{r histx, fig.cap="Methylation levels are generally bimodally distributed.", fig.alt="Histogram of M-values for all features. The distribution appears to be bimodal, with a large number of unmethylated features as well as many methylated features, and many intermediate features."}
-hist(methyl_mat, breaks = "FD", xlab = "M-value")
+hist(methyl_mat, xlab = "M-value")
 ```
 
 You can see that there are two peaks in this distribution, corresponding
@@ -105,7 +105,11 @@ sample-level metadata we have relating to these data. In this case, the
 metadata, phenotypes, and groupings in the `colData` look like this for
 the first 6 samples:
 
-```{r datatable}
+```{r, eval=FALSE}
+head(colData(methylation))
+```
+
+```{r datatable, echo=FALSE}
 knitr::kable(head(colData(methylation)), row.names = FALSE)
 ```
 
@@ -1029,15 +1033,10 @@ conservative, especially with a lot of features!
 ```{r p-fwer, fig.cap="Bonferroni correction often produces very large p-values, especially with low sample sizes.", fig.alt="Plot of Bonferroni-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."}
 p_raw <- toptab_age$P.Value
 p_fwer <- p.adjust(p_raw, method = "bonferroni")
-library("ggplot2")
-ggplot() +
-    aes(p_raw, p_fwer) +
-    geom_point() +
-    scale_x_log10() + scale_y_log10() +
-    geom_abline(slope = 1, linetype = "dashed") +
-    geom_hline(yintercept = 0.05, linetype = "dashed", col = "red") +
-    geom_vline(xintercept = 0.05, linetype = "dashed", col = "red") +
-    labs(x = "Raw p-value", y = "Bonferroni p-value")
+plot(p_raw, p_fwer, pch = 16, log="xy")
+abline(0:1, lty = "dashed")
+abline(v = 0.05, lty = "dashed", col = "red")
+abline(h = 0.05, lty = "dashed", col = "red")
 ```
 
 You can see that the p-values are exactly one for the vast majority of
@@ -1090,7 +1089,7 @@ experiment over and over.
 > >          \frac{0.05}{100} = 0.0005
 > >     $$
 > >
-> > 2.  Trick question! We can't say what proportion of these genes are
+> > 2.  We can't say what proportion of these genes are
 > >     truly different. However, if we repeated this experiment and
 > >     statistical test over and over, on average 5% of the results
 > >     from each run would be false discoveries.
@@ -1100,25 +1099,17 @@ experiment over and over.
 > >
 > >     ```{r p-fdr, fig.cap="Benjamini-Hochberg correction is less conservative than Bonferroni", fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."}
 > >     p_fdr <- p.adjust(p_raw, method = "BH")
-> >     ggplot() +
-> >         aes(p_raw, p_fdr) +
-> >         geom_point() +
-> >         scale_x_log10() + scale_y_log10() +
-> >         geom_abline(slope = 1, linetype = "dashed") +
-> >         geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
-> >         geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") +
-> >         labs(x = "Raw p-value", y = "Benjamini-Hochberg p-value")
+> >     plot(p_raw, p_fdr, pch = 16, log="xy")
+> >     abline(0:1, lty = "dashed")
+> >     abline(v = 0.05, lty = "dashed", col = "red")
+> >     abline(h = 0.05, lty = "dashed", col = "red")
 > >     ```
 > >
 > >     ```{r plot-fdr-fwer, fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against Bonferroni-adjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."}
-> >     ggplot() +
-> >         aes(p_fdr, p_fwer) +
-> >         geom_point() +
-> >         scale_x_log10() + scale_y_log10() +
-> >         geom_abline(slope = 1, linetype = "dashed") +
-> >         geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
-> >         geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") +
-> >         labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value")
+> >     plot(p_fwer, p_fdr, pch = 16, log="xy")
+> >     abline(0:1, lty = "dashed")
+> >     abline(v = 0.05, lty = "dashed", col = "red")
+> >     abline(h = 0.05, lty = "dashed", col = "red")
 > >     ```
 > >
 > {: .solution}

diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd
@@ -124,8 +124,8 @@ when we have more features than observations.
 > `methylation` dataset (we selected 500 features only as it can be hard to
 > visualise patterns when there are too many features!). 
 > 
-> ```{r corr-mat-meth, fig.cap="Cap", fig.alt="Alt"}
-library("ComplexHeatmap")
+> ```{r corr-mat-meth, fig.cap="Heatmap of pairwise feature-feature correlations between CpG sites in DNA methylation data", fig.alt="A symmetrical heatmap where rows and columns are features in a DNA methylation dataset. Colour corresponds to correlation, with red being large positive correlations and blue being large negative correlations. There are large blocks of deep red and blue throughout the plot."}
+> library("ComplexHeatmap")
 > small <- methyl_mat[, 1:500]
 > cor_mat <- cor(small)
 > Heatmap(cor_mat,
@@ -188,9 +188,9 @@ different combinations of slope and intercept accomplish this objective.
 library("viridis")
 set.seed(42)
 noise_sd <- 1
-nobs <- 50
-slope <- 2
-intercept <- 2
+nobs <- 200
+slope <- 1
+intercept <- 3
 
 maxlim <- max(abs(slope), abs(intercept)) * 2
 maxlim <- max(maxlim, 5)
@@ -295,7 +295,11 @@ avoid estimating very large effect sizes.
 > >    "ordinary least squares". The "ordinary" really means "original" here,
 > >    to distinguish between this method, which dates back to ~1800, and some
 > >    more "recent" (think 1940s...) methods.
-> > 2. Least squares assumes that, when we account for the change in the mean of
+> > 2. Squared error is useful because it ignores the *sign* of the residuals
+> >    (whether they are positive or negative).
+> >    It also penalises large errors much more than small errors.
+> >    On top of all this, it also makes the solution very easy to compute mathematically.
+> >    Least squares assumes that, when we account for the change in the mean of
 > >    the outcome based on changes in the income, the data are normally
 > >    distributed. That is, the *residuals* of the model, or the error 
 > >    left over after we account for any linear relationships in the data,
@@ -331,29 +335,36 @@ coefficient values minimise the training error, but they don't minimise the
 test error on unseen data. First, we'll go through an example of what exactly 
 this means.
 
-For the next few challenges, we'll work with a set of features
-known to be associated with age from a paper by Horvath et al.[^2]. Horvath et al
-use methylation markers alone to predict the biological age of an individual.
-This is useful in studying age-related disease amongst many other things.
-
-
-To compare the training and test errors for a model for age using the Hovarth data, we'll split the
+To compare the training and test errors for a model of methylation features and age, we'll split the
 data into training and test sets, fit a linear model and calculate the errors. First, let's
 calculate the training error. Let's start by splitting the data into training and test sets: 
 
 
 
-```{r coefhorvath}
-coef_horvath <- readRDS(here::here("data/coefHorvath.rds"))
+```{r data}
 methylation <- readRDS(here::here("data/methylation.rds"))
 
 library("SummarizedExperiment")
 age <- methylation$Age
 methyl_mat <- t(assay(methylation))
+```
+
+We will then subset the data to a set of CpG markers that are known to be associated with
+age from a previous study by Horvath et al.[^2].
+
+```{r echo=FALSE, eval=FALSE}
+# these CpGs were obtained using:
+dput(coef_horvath$CpGmarker[1:20])
+```
+
+```{r coefhorvath}
+cpg_markers <- c("cg16241714", "cg14424579", "cg22736354", "cg02479575", "cg00864867", 
+    "cg25505610", "cg06493994", "cg04528819", "cg26297688", "cg20692569", 
+    "cg04084157", "cg22920873", "cg10281002", "cg21378206", "cg26005082", 
+    "cg12946225", "cg25771195", "cg26845300", "cg06144905", "cg27377450"
+)
 
-coef_horvath <- coef_horvath[1:20, ]
-features <- coef_horvath$CpGmarker
-horvath_mat <- methyl_mat[, features]
+horvath_mat <- methyl_mat[, cpg_markers]
 
 ## Generate an index to split the data
 set.seed(42)
@@ -364,7 +375,6 @@ train_mat <- horvath_mat[train_ind, ]
 train_age <- age[train_ind]
 test_mat <- horvath_mat[-train_ind, ]
 test_age <- age[-train_ind]
-
 ```
 
 Now let's fit a linear model to our training data and calculate the training error. Here we use the mean of the squared difference between our predictions and the observed data, or "mean squared error" (MSE).
@@ -381,9 +391,9 @@ Now let's fit a linear model to our training data and calculate the training err
 fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
 
 ## Function to calculate the (mean squared) error
- mse <- function(true, prediction) { 
-     mean((true - prediction)^2) 
- } 
+mse <- function(true, prediction) { 
+    mean((true - prediction)^2) 
+} 
 
 ## Calculate the training error 
 err_lm_train <- mse(train_age, fitted(fit_horvath)) 
@@ -521,7 +531,7 @@ of 20 features (stored as `features`) selected earlier in this episode (these
 are a subset of the features identified by Horvarth et al), using both 
 regularised and ordinary least squares.
 
-```{r plot-ridge, fig.cap="Cap", fig.alt="Alt"}
+```{r plot-ridge, fig.cap="A line plot of coefficient estimates against log lambda for a ridge regression model.", fig.alt="A line plot of coefficient estimates against log lambda for a ridge regression model. Lines are depicted in different colours, with coefficients generally having large values on the left of the plot (small log lambda) and moving smoothly and gradually towards zero to the right of the plot (large log lambda). Some coefficients appear to increase and then decrease in magnitude as lambda increases, or switch signs."}
 library("glmnet")
 
 ## glmnet() performs scaling by default, supply un-scaled data:
@@ -553,7 +563,7 @@ regression predicts the test data better than the model with no regularisation.
 Let's generate our predictions under the ridge regression model and calculate
 the mean squared error in the test set:
 
-```{r pred-ridge-lm, fig.width = 10, fig.cap="Cap", fig.alt="Alt"}
+```{r pred-ridge-lm}
 # Obtain a matrix of predictions from the ridge model,
 # where each column corresponds to a different lambda value
 pred_ridge <- predict(ridge_fit, newx = test_mat)
@@ -581,7 +591,7 @@ We can see where on the continuum of lambdas we've picked a model by plotting
 the coefficient paths again. In this case, we've picked a model with fairly
 modest coefficient shrinking.
 
-```{r chooselambda, fig.cap="Cap", fig.alt="Alt"}
+```{r chooselambda, fig.cap="A line plot of coefficient estimates against log lambda for a ridge regression model, showing the optimal value based on the minimal test error.", fig.alt="A line plot of coefficient estimates against log lambda for a ridge regression model. A dashed vertical line depicts the optimal lambda value towards the left of the plot. Lines are depicted in different colours, with coefficients generally having large values on the left of the plot (small log lambda) and moving smoothly and gradually towards zero to the right of the plot (large log lambda). Some coefficients appear to increase and then decrease in magnitude as lambda increases, or switch signs."}
 chosen_lambda <- ridge_fit$lambda[which.min(err_ridge)]
 plot(ridge_fit, xvar = "lambda")
 abline(v = log(chosen_lambda), lty = "dashed")
@@ -605,7 +615,7 @@ abline(v = log(chosen_lambda), lty = "dashed")
 > >    err_lm
 > >    ```
 > > 2. The ridge ones are much less spread out with far fewer extreme predictions.
-> >    ```{r plot-ridge-prediction, fig.width = 10, fig.cap="Cap", fig.alt="Alt"}
+> >    ```{r plot-ridge-prediction, fig.width = 10, fig.cap="Two plots showing OLS predictions (left) and ridge regression predictions (right) of age (y) against true age (x).", fig.alt="Two plots showing OLS predictions and ridge regression predictions of age (y) against true age (x). A dashed line shows the line y=x. In the OLS plot, predictions are quite extreme, while in the ridge regression plot, they are generally more conservative."}
 > >    all <- c(pred_lm, test_age, pred_min_ridge)
 > >    lims <- range(all)
 > >    par(mfrow = 1:2)
@@ -648,8 +658,8 @@ drawplot <- function() {
     set.seed(42)
     noise_sd <- 1
     nobs <- 200
-    slope <- 2
-    intercept <- 2
+    slope <- 1
+    intercept <- 3
     maxlim <- max(abs(slope), abs(intercept)) * 2
     maxlim <- max(maxlim, 5)
     lims <- c(-maxlim, maxlim)
@@ -734,8 +744,9 @@ drawplot()
 > >    to the coefficients. When we instead plot the L1 norm on the x-axis,
 > >    increasing L1 norm means that we are allowing our
 > >    coefficients to take on increasingly large values.
-> >    ```{r plotlas, fig.width = 10, fig.height = 20, echo=FALSE}
-> >    par(mfrow = c(2, 1))
+> > 
+> >    ```{r plotlas, fig.width = 10, fig.cap="Line plots showing coefficient values from a LASSO model against log lambda (left) and L1 norm (right).", fig.alt="Two line plots side-by-side, showing coefficient values from a LASSO model against log lambda (left) and L1 norm (right). The coefficients, generally, suddenly become zero as log lambda increases or, equivalently, L1 norm decreases. However, some coefficients increase in size before decreasing as log lamdba increases."}
+> >    par(mfrow = c(1, 2))
 > >    plot(fit_lasso, xvar = "lambda")
 > >    plot(fit_lasso)
 > >    ```
@@ -771,7 +782,7 @@ that minimises the error across each of the test and training splits. In R:
 
 ```{r lasso-cv, fig.cap="Cross-validated mean squared error for different values of lambda under a LASSO penalty.", fig.alt="Alt"}
 # fit lasso model with cross-validation across a range of lambda values
-lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)
+lasso <- cv.glmnet(methyl_mat, age, alpha = 1)
 plot(lasso)
 
 # Extract the coefficients from the model with the lowest mean squared error from cross-validation
@@ -822,7 +833,7 @@ plot_elastic <- function(alpha) {
     noise_sd <- 1
     nobs <- 100
     slope <- 1
-    intercept <- 1
+    intercept <- 3
     norm <- 1
 
     maxlim <- max(abs(slope), abs(intercept)) * 2
@@ -900,20 +911,20 @@ plot_elastic(0.75)
 > >    You can see that coefficients tend to go exactly to zero,
 > >    but the paths are a bit less
 > >    extreme than with pure LASSO; similar to ridge.
-> >    ```{r elastic}
+> >    ```{r elastic, fig.cap="Line plot showing coefficient values from an elastic net model against L1 norm.", fig.alt="A line plot showing coefficient values from an elastic net model against L1 norm. The coefficients, generally, suddenly become zero as L1 norm decreases. However, some coefficients increase in size before decreasing as L1 norm decreases."}
 > >    elastic <- glmnet(methyl_mat[, -1], age, alpha = 0.5)
 > >    plot(elastic)
 > >    ```
 > > 2. The process of model selection is similar for elastic net models as for
 > >    LASSO models.
-> >    ```{r elastic-cv, fig.cap="Elastic", fig.alt="Alt"}
+> >    ```{r elastic-cv, fig.cap="A plot of the cross-validation mean squared error of an elastic net model against log lambda.", fig.cap="A plot depicting mean squared error (MSE) against discrete values of lambda, with red points showing the average mean squared error and grey error bars showing the 1 standard error interval around them. The MSE rises as log lambda increases, indicating a larger prediction error. Two dashed lines indicate the lambda value corresponding to the lowest overall MSE value and the lambda value corresponding to the lambda value with MSE with one standard error of the minimum."}
 > >    elastic_cv <- cv.glmnet(methyl_mat[, -1], age, alpha = 0.5)
 > >    plot(elastic_cv)
 > >    ```
 > > 3. You can see that the coefficients from these two methods are broadly
 > >    similar, but the elastic net coefficients are a bit more conservative.
 > >    Further, more coefficients are exactly zero in the LASSO model.
-> >    ```{r elastic-plot, fig.cap="LASSO-Elastic", fig.alt="Alt"}
+> >    ```{r elastic-plot, fig.cap="Scatter plot of coefficient estimates from an elastic net model (y) against coefficient estimates from a LASSO model (x).", fig.alt="A scatter plot depicting coefficient estimates from elastic net and LASSO models. Generally, the elastic net model coefficients are more conservative, with one notable outlier, whereas the LASSO model coefficients are more frequently exactly equal to zero."}
 > >    coefe <- coef(elastic_cv, elastic_cv$lambda.1se)
 > >    sum(coefe[, 1] == 0)
 > >    sum(coefl[, 1] == 0)
@@ -999,7 +1010,7 @@ plot_elastic(0.75)
 > The [package documentation](https://glmnet.stanford.edu/articles/glmnet.html)
 > explains this in more detail.
 > 
-> ```{r binomial, fig.cap = "Title", fig.alt = "Alt"}
+> ```{r binomial, fig.cap="A plot of the cross-validation binomial deviance of a logistic regression elastic net model against log lambda.", fig.cap="A plot depicting binomial deviance against discrete values of lambda, with red points showing the average mean squared error and grey error bars showing the 1 standard error interval around them. The MSE decreases as log lambda increases, indicating a smaller prediction error. Two dashed lines indicate the lambda value corresponding to the lowest overall MSE value and the lambda value corresponding to the lambda value with MSE with one standard error of the minimum, with both being exactly on the right side of the plot, indicating an intercept-only model."}
 > smoking <- as.numeric(factor(methylation$smoker)) - 1
 > # binary outcome
 > table(smoking)