Merge pull request carpentries-incubator#99 from carpentries-incubato…

…r/cata Changes to episodes 1 and 2
alanocallaghan · Feb 6, 2023 · a43927f · a43927f
2 parents ce990ee + d115cb9
commit a43927f
Show file tree

Hide file tree

Showing 10 changed files with 479 additions and 454 deletions.
diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,8 @@
 __pycache__
 _site
 .Rproj.user
+*.Rproj
+.Rbuildignore
 .Rhistory
 .RData
 .bundle/

diff --git a/AUTHORS b/AUTHORS
@@ -3,3 +3,4 @@ Gail Robertson
 Catalina Vallejos
 Ailith Ewing
 Alison Meynert
+Hannes Becher
diff --git a/README.md b/README.md
@@ -51,6 +51,7 @@ Current maintainers of this lesson are
 * Alan O'Callaghan
 * Ailith Ewing
 * Catalina Vallejos
+* Hannes Becher
 
 
 ## Authors

diff --git a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd
@@ -1,11 +1,11 @@
 ---
-title: "Regression with many features"
+title: "Regression with many outcomes"
 source: Rmd
 teaching: 60
 exercises: 30
 questions:
 - "How can we apply linear regression in a high-dimensional setting?"
-- "How can we benefit from the fact that we have many variables?"
+- "How can we benefit from the fact that we have many outcomes?"
 - "How can we control for the fact that we do many tests?"
 objectives:
 - "Perform and critically analyse high dimensional regression."
@@ -40,7 +40,8 @@ knitr_fig_path("02-")
 For the following few episodes, we will be working with human DNA
 methylation data from flow-sorted blood samples. DNA methylation assays
 measure, for each of many sites in the genome, the proportion of DNA
-that carries a methyl mark (a chemical modification that does not alter the DNA sequence). In this case, the methylation data come in
+that carries a methyl mark (a chemical modification that does not alter the 
+DNA sequence). In this case, the methylation data come in
 the form of a matrix of normalised methylation levels (M-values), where negative
 values correspond to unmethylated DNA and positive values correspond to
 methylated DNA. Along with this, we have a number of sample phenotypes
@@ -54,6 +55,9 @@ library("minfi")
 methylation <- readRDS(here("data/methylation.rds"))
 ```
 
+Note: the code that we used to download these data from its source is available
+[here](https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/main/data/methylation.R)
+
 This `methylation` object is a `GenomicRatioSet`, a Bioconductor data
 object derived from the `SummarizedExperiment` class. These
 `SummarizedExperiment` objects contain `assay`s, in this case
@@ -102,46 +106,41 @@ knitr::kable(head(colData(methylation)), row.names = FALSE)
 ```
 
 In this episode, we will focus on the association between age and
-methylation. The association between age and methylation status in blood
-samples has been studied extensively, and is actually a well-known
-application of some high-dimensional regression techniques, with the
-idea of "epigenetic age" having been derived using these techniques. The
-methylation levels for these data can be presented in a heatmap:
+methylation. The following heatmap summarises age and methylation levels 
+available in the Prostate dataset:
 
-```{r heatmap, echo=FALSE, fig.cap="Visualising the data as a heatmap, it's clear that there's too many models to fit 'by hand'.", fig.alt="Heatmap of methylation values across all features. Samples are ordered according to age."}
+```{r heatmap, fig.cap="Visualising the data as a heatmap, it's clear that there's too many models to fit 'by hand'.", fig.alt="Heatmap of methylation values across all features. Samples are ordered according to age."}
 age <- methylation$Age
 
 library("ComplexHeatmap")
 order <- order(age)
 age_ord <- age[order]
 methyl_mat_ord <- methyl_mat[, order]
 
-acol <- circlize::colorRamp2(
-    breaks = seq(min(age), max(age), length.out = 50),
-    colors = viridis::viridis(50)
-)
 Heatmap(methyl_mat_ord,
-    cluster_columns = FALSE,
-    # cluster_rows = FALSE,
-    name = "M-value",
-    col = RColorBrewer::brewer.pal(10, "RdYlBu"),
-    top_annotation = columnAnnotation(
-        age = age_ord,
-        col = list(age = acol)
-    ),
-    show_row_names = FALSE,
-    show_column_names = FALSE,
-    row_title = "Feature",
-    column_title =  "Sample",
-    use_raster = FALSE
-)
+        name = "M-value",
+        cluster_columns = FALSE,
+        show_row_names = FALSE,
+        show_column_names = FALSE,
+        row_title = "Feature",
+        column_title =  "Sample",
+        top_annotation = columnAnnotation(age = age_ord))
 ```
+Depending on the scientific question of interest, two types of high-dimensional 
+problems could be explored in this context:
 
-We would like to identify features that are related to our outcome of
-interest (age). It is clear from the heatmap that there are too many
-features to do so manually.
+1. To predict age using methylation leves as predictors. In this case, we would 
+have a single outcome (age) which will be predicted using 5000 covariates 
+(methylation levels across the genome). 
 
-> ## Exercise
+2. To predict methylation levels using age as a predictor. In this case, we 
+would have 5000 outcomes (methylation levels across the genome) and a single 
+covariate (age). 
+
+The examples in this episode will focus on the second type of problem, whilst 
+the next episode will focus on the first.
+
+> ## Challenge 1
 >
 > Why can we not just fit many linear regression models, one for each of the columns
 > in the `colData` above against each of the features in the matrix of
@@ -202,15 +201,16 @@ features to do so manually.
 > therefore can be easier to work with in statistical models.
 {: .callout}
 
-# Identifying associations using linear regression
+# Regression with many outcomes
 
 In high-throughput studies, it is common to have one or more phenotypes
 or groupings that we want to relate to features of interest (eg, gene
 expression, DNA methylation levels). In general, we want to identify
 differences in the features of interest that are related to a phenotype
 or grouping of our samples. Identifying features of interest that vary
 along with phenotypes or groupings can allow us to understand how
-phenotypes arise or manifest.
+phenotypes arise or manifest. Analysis of this type are sometimes referred 
+to using the term *differential analysis*. 
 
 For example, we might want to identify genes that are expressed at a
 higher level in mutant mice relative to wild-type mice to understand the
@@ -305,6 +305,7 @@ from the model object:
 
 ```{r fit1}
 age <- methylation$Age
+# methyl_mat[1, ] indicates that the 1st CpG will be used as outcome variable
 lm_age_methyl1 <- lm(methyl_mat[1, ] ~ age)
 lm_age_methyl1
 ```
@@ -344,21 +345,21 @@ relationship between the predictor variable(s) and the outcome. This may
 not always be the most realistic or useful null hypothesis, but it is
 the one we have!
 
-For this linear model, we can use `tidy()` from the **`broom`** package to extract detailed
-information about the coefficients and the associated hypothesis tests
-in this model:
+For this linear model, we can use `tidy()` from the **`broom`** package to 
+extract detailed information about the coefficients and the associated 
+hypothesis tests in this model:
 
 ```{r tidyfit}
 library("broom")
 tidy(lm_age_methyl1)
 ```
 
-The standard errors (`std.error`) represent the uncertainty in our
+The standard errors (`std.error`) represent the statistical uncertainty in our
 effect size estimate. The test statistics and p-values represent
 measures of how (un)likely it would be to observe results like this
 under the "null hypothesis".
 
-> ## Exercise
+> ## Challenge 2
 >
 > In the model we fitted, the p-value for the intercept term is
 > significant at $p < 0.05$. What does this mean?
@@ -488,7 +489,9 @@ the test statistic under the null hypothesis that is equal or greater to
 the value we observe for the intercept term. This shaded region is small
 relative to the total area of the null distribution; therefore, the
 p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits =
-3)`$). The blue-ish shaded region represents the same measure for the slope  term; this is larger, relative to the total area of the distribution, therefore the p-value is larger than the one for the intercept term 
+3)`$). The blue-ish shaded region represents the same measure for the slope term; 
+this is larger, relative to the total area of the distribution, therefore the 
+p-value is larger than the one for the intercept term 
 ($p=`round(table_age_methyl1$p.value[[2]], digits = 3)`$). You can see that
 the p-value is a function of the size of the effect we're estimating and
 the uncertainty we have in that effect. A large effect with large
@@ -528,7 +531,7 @@ small uncertainty may lead to a small p-value.
 >
 >{: .callout}
 
-# Sharing information between features
+# Sharing information across outcome variables
 
 Now that we understand how hypothesis tests work in the 
 linear model framework, we are going to introduce an idea that allows us to
@@ -576,8 +579,8 @@ statistic by shrinking the estimates of standard errors towards a common
 value. This results in a *moderated t-statistic*.
 
 The process of running a model in **`limma`** is somewhat different to what you
-may have seen when running linear models. Here, we define a *model
-matrix* or *design matrix*, which is a way of representing the
+may have seen when running linear models. Here, we define a *model matrix* or 
+*design matrix*, which is a way of representing the
 coefficients that should be fit in each linear model. These are used in
 similar ways in many different modelling libraries.
 
@@ -588,6 +591,27 @@ dim(design_age)
 head(design_age)
 ```
 
+> ## What is a model matrix?
+> When R fits a regression model, it chooses a vector of regression coefficients 
+> that minimises the differences between outcome values and those values 
+> predicted by using the covariates (or predictor variables). But how do we get 
+> from a set of predictors and regression coefficients to predicted values? This 
+> is done via matrix multipliciation. The matrix of predictors is (matrix) 
+> multiplied by the vector of coefficients. That matrix is called the 
+> **model matrix** (or design matrix). It has one row for each observation and 
+> one column for each predictor plus (by default) one aditional column of ones 
+> (the intercept column). Many R libraries (but not **`limma`** ) contruct the 
+> model matrix behind the scenes. Usually, it can be extracted from a model fit 
+> using the function `model.matrix()`. Here is an example:
+> 
+> ```{r}
+> data(cars)
+> head(cars)
+> mod1 <- lm(dist ~ speed, data=cars) # fit regression model using speed as a predictor
+> head(model.matrix(mod1)) # the model matrix contains two columns: intercept and speed
+> ```
+{: .callout}
+
 As you can see, the design matrix has the same number of rows as our
 methylation data has samples. It also has two columns - one for the
 intercept (similar to the linear model we fit above) and one for age.
@@ -671,7 +695,7 @@ continuous measures like these, it is often convenient to obtain a list
 of features which we are confident have non-zero effect sizes. This is
 made more difficult by the number of tests we perform.
 
-> ## Exercise
+> ## Challenge 3
 >
 > The effect size estimates are very small, and yet many of the p-values
 > are well below a usual significance level of p \< 0.05. Why is this?
@@ -755,7 +779,7 @@ understand, but it is useful to develp an intuition why these approaches may be
 precise and sensitive than the naive approach of fitting a model to each
 feature separately.
 
-> ## Exercise
+> ## Challenge 4
 >
 > 1.  Try to run the same kind of linear model with smoking status as
 >     covariate instead of age, and making a volcano plot. *Note:
@@ -898,7 +922,7 @@ threshold in a real experiment, it is likely that we would identify many
 features as associated with age, when the results we are observing are
 simply due to chance.
 
-> ## Exercise
+> ## Challenge 5
 >
 > 1.  If we run `r nrow(methylation)` tests under the null hypothesis,
 >     how many of them (on average) will be statistically significant at
@@ -1022,7 +1046,7 @@ experiment over and over.
 | \- Very conservative                                    | \- Does not control probability of making errors |
 | \- Requires larger statistical power                    | \- May result in false discoveries               |
 
-> ## Exercise
+> ## Challenge 6
 >
 > 1.  At a significance level of 0.05, with 100 tests performed, what is
 >     the Bonferroni significance threshold?
@@ -1090,10 +1114,11 @@ experiment over and over.
 
 ## Further reading
 
--   [limma tutorial by Kasper D.
+-   [**`limma`** tutorial by Kasper D.
     Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html)
--   [limma user
+-   [**`limma`** user
     manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf).
+-   [The **`VariancePartition`** package](https://bioconductor.org/packages/release/bioc/vignettes/variancePartition/inst/doc/dream.html) has similar functionality than **`limma`** but allows the inclusion of random effects. 
 
 ## Footnotes
 

diff --git a/_episodes_rmd/03-regression-regularisation.Rmd b/_episodes_rmd/03-regression-regularisation.Rmd
@@ -60,6 +60,8 @@ Then, we try to fit a model of age using all of the 5,000 features in this
 dataset (average methylation levels, M-values, across different sites in the genome).
 
 ```{r fitall, R.options=list(max.print=20)}
+# by using methyl_mat in the formula below, R will run a multivariate regression
+# model in which each of the columns in methyl_mat is used as a predictor. 
 fit <- lm(age ~ methyl_mat)
 summary(fit)
 ```

diff --git a/data/methylation.R b/data/methylation.R
@@ -1,3 +1,6 @@
+# This follows the download instructions available here:
+# https://bioconductor.org/packages/release/data/experiment/vignettes/FlowSorted.Blood.EPIC/inst/doc/FlowSorted.Blood.EPIC.html 
+
 pkgs <- c("FlowSorted.Blood.EPIC", "ExperimentHub", "here")
 BiocManager::install(pkgs, upgrade = FALSE, ask = FALSE)
 for (pkg in pkgs) {

diff --git a/data/prostate.rds b/data/prostate.rds
diff --git a/data/small_methylation.R b/data/small_methylation.R
diff --git a/data/small_methylation.rds b/data/small_methylation.rds