Skip to content

Commit

Permalink
Merge pull request carpentries-incubator#99 from carpentries-incubato…
Browse files Browse the repository at this point in the history
…r/cata

Changes to episodes 1 and 2
  • Loading branch information
hannesbecher authored Feb 6, 2023
2 parents ce990ee + d115cb9 commit a43927f
Show file tree
Hide file tree
Showing 10 changed files with 479 additions and 454 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
__pycache__
_site
.Rproj.user
*.Rproj
.Rbuildignore
.Rhistory
.RData
.bundle/
Expand Down
1 change: 1 addition & 0 deletions AUTHORS
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ Gail Robertson
Catalina Vallejos
Ailith Ewing
Alison Meynert
Hannes Becher
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ Current maintainers of this lesson are
* Alan O'Callaghan
* Ailith Ewing
* Catalina Vallejos
* Hannes Becher


## Authors
Expand Down
784 changes: 397 additions & 387 deletions _episodes_rmd/01-introduction-to-high-dimensional-data.Rmd

Large diffs are not rendered by default.

121 changes: 73 additions & 48 deletions _episodes_rmd/02-high-dimensional-regression.Rmd
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: "Regression with many features"
title: "Regression with many outcomes"
source: Rmd
teaching: 60
exercises: 30
questions:
- "How can we apply linear regression in a high-dimensional setting?"
- "How can we benefit from the fact that we have many variables?"
- "How can we benefit from the fact that we have many outcomes?"
- "How can we control for the fact that we do many tests?"
objectives:
- "Perform and critically analyse high dimensional regression."
Expand Down Expand Up @@ -40,7 +40,8 @@ knitr_fig_path("02-")
For the following few episodes, we will be working with human DNA
methylation data from flow-sorted blood samples. DNA methylation assays
measure, for each of many sites in the genome, the proportion of DNA
that carries a methyl mark (a chemical modification that does not alter the DNA sequence). In this case, the methylation data come in
that carries a methyl mark (a chemical modification that does not alter the
DNA sequence). In this case, the methylation data come in
the form of a matrix of normalised methylation levels (M-values), where negative
values correspond to unmethylated DNA and positive values correspond to
methylated DNA. Along with this, we have a number of sample phenotypes
Expand All @@ -54,6 +55,9 @@ library("minfi")
methylation <- readRDS(here("data/methylation.rds"))
```

Note: the code that we used to download these data from its source is available
[here](https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/main/data/methylation.R)

This `methylation` object is a `GenomicRatioSet`, a Bioconductor data
object derived from the `SummarizedExperiment` class. These
`SummarizedExperiment` objects contain `assay`s, in this case
Expand Down Expand Up @@ -102,46 +106,41 @@ knitr::kable(head(colData(methylation)), row.names = FALSE)
```

In this episode, we will focus on the association between age and
methylation. The association between age and methylation status in blood
samples has been studied extensively, and is actually a well-known
application of some high-dimensional regression techniques, with the
idea of "epigenetic age" having been derived using these techniques. The
methylation levels for these data can be presented in a heatmap:
methylation. The following heatmap summarises age and methylation levels
available in the Prostate dataset:

```{r heatmap, echo=FALSE, fig.cap="Visualising the data as a heatmap, it's clear that there's too many models to fit 'by hand'.", fig.alt="Heatmap of methylation values across all features. Samples are ordered according to age."}
```{r heatmap, fig.cap="Visualising the data as a heatmap, it's clear that there's too many models to fit 'by hand'.", fig.alt="Heatmap of methylation values across all features. Samples are ordered according to age."}
age <- methylation$Age
library("ComplexHeatmap")
order <- order(age)
age_ord <- age[order]
methyl_mat_ord <- methyl_mat[, order]
acol <- circlize::colorRamp2(
breaks = seq(min(age), max(age), length.out = 50),
colors = viridis::viridis(50)
)
Heatmap(methyl_mat_ord,
cluster_columns = FALSE,
# cluster_rows = FALSE,
name = "M-value",
col = RColorBrewer::brewer.pal(10, "RdYlBu"),
top_annotation = columnAnnotation(
age = age_ord,
col = list(age = acol)
),
show_row_names = FALSE,
show_column_names = FALSE,
row_title = "Feature",
column_title = "Sample",
use_raster = FALSE
)
name = "M-value",
cluster_columns = FALSE,
show_row_names = FALSE,
show_column_names = FALSE,
row_title = "Feature",
column_title = "Sample",
top_annotation = columnAnnotation(age = age_ord))
```
Depending on the scientific question of interest, two types of high-dimensional
problems could be explored in this context:

We would like to identify features that are related to our outcome of
interest (age). It is clear from the heatmap that there are too many
features to do so manually.
1. To predict age using methylation leves as predictors. In this case, we would
have a single outcome (age) which will be predicted using 5000 covariates
(methylation levels across the genome).

> ## Exercise
2. To predict methylation levels using age as a predictor. In this case, we
would have 5000 outcomes (methylation levels across the genome) and a single
covariate (age).

The examples in this episode will focus on the second type of problem, whilst
the next episode will focus on the first.

> ## Challenge 1
>
> Why can we not just fit many linear regression models, one for each of the columns
> in the `colData` above against each of the features in the matrix of
Expand Down Expand Up @@ -202,15 +201,16 @@ features to do so manually.
> therefore can be easier to work with in statistical models.
{: .callout}

# Identifying associations using linear regression
# Regression with many outcomes

In high-throughput studies, it is common to have one or more phenotypes
or groupings that we want to relate to features of interest (eg, gene
expression, DNA methylation levels). In general, we want to identify
differences in the features of interest that are related to a phenotype
or grouping of our samples. Identifying features of interest that vary
along with phenotypes or groupings can allow us to understand how
phenotypes arise or manifest.
phenotypes arise or manifest. Analysis of this type are sometimes referred
to using the term *differential analysis*.

For example, we might want to identify genes that are expressed at a
higher level in mutant mice relative to wild-type mice to understand the
Expand Down Expand Up @@ -305,6 +305,7 @@ from the model object:

```{r fit1}
age <- methylation$Age
# methyl_mat[1, ] indicates that the 1st CpG will be used as outcome variable
lm_age_methyl1 <- lm(methyl_mat[1, ] ~ age)
lm_age_methyl1
```
Expand Down Expand Up @@ -344,21 +345,21 @@ relationship between the predictor variable(s) and the outcome. This may
not always be the most realistic or useful null hypothesis, but it is
the one we have!

For this linear model, we can use `tidy()` from the **`broom`** package to extract detailed
information about the coefficients and the associated hypothesis tests
in this model:
For this linear model, we can use `tidy()` from the **`broom`** package to
extract detailed information about the coefficients and the associated
hypothesis tests in this model:

```{r tidyfit}
library("broom")
tidy(lm_age_methyl1)
```

The standard errors (`std.error`) represent the uncertainty in our
The standard errors (`std.error`) represent the statistical uncertainty in our
effect size estimate. The test statistics and p-values represent
measures of how (un)likely it would be to observe results like this
under the "null hypothesis".

> ## Exercise
> ## Challenge 2
>
> In the model we fitted, the p-value for the intercept term is
> significant at $p < 0.05$. What does this mean?
Expand Down Expand Up @@ -488,7 +489,9 @@ the test statistic under the null hypothesis that is equal or greater to
the value we observe for the intercept term. This shaded region is small
relative to the total area of the null distribution; therefore, the
p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits =
3)`$). The blue-ish shaded region represents the same measure for the slope term; this is larger, relative to the total area of the distribution, therefore the p-value is larger than the one for the intercept term
3)`$). The blue-ish shaded region represents the same measure for the slope term;
this is larger, relative to the total area of the distribution, therefore the
p-value is larger than the one for the intercept term
($p=`round(table_age_methyl1$p.value[[2]], digits = 3)`$). You can see that
the p-value is a function of the size of the effect we're estimating and
the uncertainty we have in that effect. A large effect with large
Expand Down Expand Up @@ -528,7 +531,7 @@ small uncertainty may lead to a small p-value.
>
>{: .callout}
# Sharing information between features
# Sharing information across outcome variables
Now that we understand how hypothesis tests work in the
linear model framework, we are going to introduce an idea that allows us to
Expand Down Expand Up @@ -576,8 +579,8 @@ statistic by shrinking the estimates of standard errors towards a common
value. This results in a *moderated t-statistic*.
The process of running a model in **`limma`** is somewhat different to what you
may have seen when running linear models. Here, we define a *model
matrix* or *design matrix*, which is a way of representing the
may have seen when running linear models. Here, we define a *model matrix* or
*design matrix*, which is a way of representing the
coefficients that should be fit in each linear model. These are used in
similar ways in many different modelling libraries.
Expand All @@ -588,6 +591,27 @@ dim(design_age)
head(design_age)
```
> ## What is a model matrix?
> When R fits a regression model, it chooses a vector of regression coefficients
> that minimises the differences between outcome values and those values
> predicted by using the covariates (or predictor variables). But how do we get
> from a set of predictors and regression coefficients to predicted values? This
> is done via matrix multipliciation. The matrix of predictors is (matrix)
> multiplied by the vector of coefficients. That matrix is called the
> **model matrix** (or design matrix). It has one row for each observation and
> one column for each predictor plus (by default) one aditional column of ones
> (the intercept column). Many R libraries (but not **`limma`** ) contruct the
> model matrix behind the scenes. Usually, it can be extracted from a model fit
> using the function `model.matrix()`. Here is an example:
>
> ```{r}
> data(cars)
> head(cars)
> mod1 <- lm(dist ~ speed, data=cars) # fit regression model using speed as a predictor
> head(model.matrix(mod1)) # the model matrix contains two columns: intercept and speed
> ```
{: .callout}
As you can see, the design matrix has the same number of rows as our
methylation data has samples. It also has two columns - one for the
intercept (similar to the linear model we fit above) and one for age.
Expand Down Expand Up @@ -671,7 +695,7 @@ continuous measures like these, it is often convenient to obtain a list
of features which we are confident have non-zero effect sizes. This is
made more difficult by the number of tests we perform.

> ## Exercise
> ## Challenge 3
>
> The effect size estimates are very small, and yet many of the p-values
> are well below a usual significance level of p \< 0.05. Why is this?
Expand Down Expand Up @@ -755,7 +779,7 @@ understand, but it is useful to develp an intuition why these approaches may be
precise and sensitive than the naive approach of fitting a model to each
feature separately.

> ## Exercise
> ## Challenge 4
>
> 1. Try to run the same kind of linear model with smoking status as
> covariate instead of age, and making a volcano plot. *Note:
Expand Down Expand Up @@ -898,7 +922,7 @@ threshold in a real experiment, it is likely that we would identify many
features as associated with age, when the results we are observing are
simply due to chance.

> ## Exercise
> ## Challenge 5
>
> 1. If we run `r nrow(methylation)` tests under the null hypothesis,
> how many of them (on average) will be statistically significant at
Expand Down Expand Up @@ -1022,7 +1046,7 @@ experiment over and over.
| \- Very conservative | \- Does not control probability of making errors |
| \- Requires larger statistical power | \- May result in false discoveries |

> ## Exercise
> ## Challenge 6
>
> 1. At a significance level of 0.05, with 100 tests performed, what is
> the Bonferroni significance threshold?
Expand Down Expand Up @@ -1090,10 +1114,11 @@ experiment over and over.
## Further reading
- [limma tutorial by Kasper D.
- [**`limma`** tutorial by Kasper D.
Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html)
- [limma user
- [**`limma`** user
manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf).
- [The **`VariancePartition`** package](https://bioconductor.org/packages/release/bioc/vignettes/variancePartition/inst/doc/dream.html) has similar functionality than **`limma`** but allows the inclusion of random effects.
## Footnotes
Expand Down
2 changes: 2 additions & 0 deletions _episodes_rmd/03-regression-regularisation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ Then, we try to fit a model of age using all of the 5,000 features in this
dataset (average methylation levels, M-values, across different sites in the genome).

```{r fitall, R.options=list(max.print=20)}
# by using methyl_mat in the formula below, R will run a multivariate regression
# model in which each of the columns in methyl_mat is used as a predictor.
fit <- lm(age ~ methyl_mat)
summary(fit)
```
Expand Down
3 changes: 3 additions & 0 deletions data/methylation.R
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# This follows the download instructions available here:
# https://bioconductor.org/packages/release/data/experiment/vignettes/FlowSorted.Blood.EPIC/inst/doc/FlowSorted.Blood.EPIC.html

pkgs <- c("FlowSorted.Blood.EPIC", "ExperimentHub", "here")
BiocManager::install(pkgs, upgrade = FALSE, ask = FALSE)
for (pkg in pkgs) {
Expand Down
Binary file modified data/prostate.rds
Binary file not shown.
19 changes: 0 additions & 19 deletions data/small_methylation.R

This file was deleted.

Binary file removed data/small_methylation.rds
Binary file not shown.

0 comments on commit a43927f

Please sign in to comment.