Skip to content

Commit

Permalink
Merge branch 'main' into mary-suggestions-tasks28plus-ep4
Browse files Browse the repository at this point in the history
  • Loading branch information
mallewellyn authored Mar 21, 2024
2 parents 02054a9 + 223978e commit dc3e1f6
Show file tree
Hide file tree
Showing 9 changed files with 465 additions and 424 deletions.
106 changes: 47 additions & 59 deletions _episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: "Introduction to high-dimensional data"
author: "GS Robertson"
source: Rmd
teaching: 20
teaching: 30
exercises: 20
questions:
- What are high-dimensional data and what do these data look like in the
Expand Down Expand Up @@ -38,37 +38,31 @@ knitr_fig_path("01-")

# What are high-dimensional data?

*High-dimensional data* are defined as data in which the number of features (variables observed),
$p$, are close to or larger than the number of observations (or data points), $n$.
The opposite is *low-dimensional data* in which the number of observations,
$n$, far outnumbers the number of features, $p$. A related concept is *wide data*, which
refers to data with numerous features irrespective of the number of observations (similarly,
*tall data* is often used to denote data with a large number of observations).
Analyses of high-dimensional data require consideration of potential problems that
come from having more features than observations.

High-dimensional data have become more common in many scientific fields as new
automated data collection techniques have been developed. More and more datasets
have a large number of features and some have as many features as there are rows
in the dataset. Datasets in which $p \geq n$ are becoming more common. Such datasets
pose a challenge for data analysis as standard methods of analysis, such as linear
regression, are no longer appropriate.

High-dimensional datasets are common in the biological sciences. Data sets in subjects like
genomics and medical sciences are often tall (with large $n$) and wide
(with large $p$), and can be difficult to analyse or visualise using
standard statistical tools. An example of high-dimensional data in biological
sciences may include data collected from hospital patients recording symptoms,
blood test results, behaviours, and general health, resulting in datasets with
large numbers of features. Researchers often want to relate these features to
specific patient outcomes (e.g. survival, length of time spent in hospital).
An example of what high-dimensional data might look like in a biomedical study
is shown in the figure below.
*High-dimensional data* are defined as data with many features (variables observed).
In recent years, advances in information technology have allowed large amounts of data to
be collected and stored with relative ease. As such, high-dimensional
data have become more common in many scientific fields, including the biological sciences,
where datasets in subjects like genomics and medical sciences often have a large numbers of features.
For example, hospital data may record many variables, including symptoms,
blood test results, behaviours, and general health. An example of what high-dimensional data might look like
in a biomedical study is shown in the figure below.



```{r table-intro, echo = FALSE, fig.cap = "Example of a high-dimensional data table with features in the columns and individual observations (patients) in rows.", fig.alt = "Table displaying a high-dimensional data set with many features in individual columns relating to health data such as blood pressure, heart rate and respiratory rate. Each row contains the data for individual patients."}
knitr::include_graphics(here::here("fig/intro-table.png"))
knitr::include_graphics("../fig/intro-table.png")
```

Researchers often want to relate such features to specific patient outcomes
(e.g. survival, length of time spent in hospital). However, analysing
high-dimensional data can be extremely challenging since standard methods of analysis,
such as individual plots of features and linear regression,
are no longer appropriate when we have many features.
In this lesson, we will learn alternative methods
for dealing with high-dimensional data and discover how these can be applied
for practical high-dimensional data analysis in the biological sciences.




> ## Challenge 1
Expand All @@ -92,10 +86,10 @@ knitr::include_graphics(here::here("fig/intro-table.png"))
>
> > ## Solution
> >
> > 1. No. The number of observations (100 patients) is far greater than the number of features (3).
> > 2. Yes, this is an example of high-dimensional data. There are only 100 observations but 200,000+3 features.
> > 3. No. There are many more observations (200 patients) than features (5).
> > 4. Yes. There is only one observation of more than 20,000 features.
> > 1. No. The number of features is relatively small (4 including the response variable since this is an observed variable).
> > 2. Yes, this is an example of high-dimensional data. There are 200,004 features.
> > 3. No. The number of features is relatively small (6).
> > 4. Yes. There are 20,008 features.
> {: .solution}
{: .challenge}

Expand All @@ -107,20 +101,15 @@ about the challenges we face in analysing them.
# Why is dealing with high-dimensional data challenging?

Most classical statistical methods are set up for use on low-dimensional data
(i.e. data where the number of observations $n$ is much larger than the number
of features $p$). This is because low-dimensional data were much more common in
the past when data collection was more difficult and time consuming. In recent
years advances in information technology have allowed large amounts of data to
be collected and stored with relative ease. This has allowed large numbers of
features to be collected, meaning that datasets in which $p$ matches or exceeds
$n$ are common (collecting observations is often more difficult or expensive
than collecting many features from a single observation).

Datasets with large numbers of features are difficult to visualise. When
exploring low-dimensional datasets, it is possible to plot the response variable
against each of the limited number of explanatory variables to get an idea which
of these are important predictors of the response. With high-dimensional data
the large number of explanatory variables makes doing this difficult. In some
(i.e. with a small number of features, $p$).
This is because low-dimensional data were much more common in
the past when data collection was more difficult and time consuming.

One challenge when analysing high-dimensional data is visualising the many variables.
When exploring low-dimensional datasets, it is possible to plot the response variable
against each of features to get an idea which
of these are important predictors of the response. With high-dimensional data,
the large number of features makes doing this difficult. In addition, in some
high-dimensional datasets it can also be difficult to identify a single response
variable, making standard data exploration and analysis techniques less useful.

Expand Down Expand Up @@ -189,17 +178,20 @@ of the challenges we are facing when working with high-dimensional data.
> improve the reproducibility of an analysis.
{: .callout}
Imagine we are carrying out least squares regression on a dataset with 25
As well as many variables causing problems when working with high-dimensional data,
having relatively few observations ($n$) compared to the number of features ($p$) causes
additional challenges. To illustrate these challenges,
imagine we are carrying out least squares regression on a dataset with 25
observations. Fitting a best fit line through these data produces a plot shown
in the left-hand panel of the figure below.
However, imagine a situation in which the number of observations and features in a
dataset are almost equal. In that situation the effective number of observations
per features is low. The result of fitting a best fit line through
per feature is low. The result of fitting a best fit line through
few observations can be seen in the right-hand panel below.
```{r intro-figure, echo = FALSE, fig.cap = "Scatter plot of two variables (x and y) from a data set with 25 observations (left) and 2 observations (right) with a fitted regression line (red).", fig.alt = "Two scatter plots side-by-side, each plotting the relationship between two variables. The scatter plot on the left hand side shows 25 observations and a regression line with the points evenly scattered around. The scatter plot on the right hand side shows 2 observations and a regression line that goes through both points."}
knitr::include_graphics(here::here("fig/intro-scatterplot.png"))
knitr::include_graphics("../fig/intro-scatterplot.png")
```
In the first situation, the least squares regression line does not fit the data
Expand Down Expand Up @@ -258,7 +250,7 @@ explore why high correlations might be an issue in a Challenge.
> > ```
> >
> > Based on these results we conclude that both `gleason` and `pgg45` have a
> > statistically significan univariate effect (also referred to as a marginal
> > statistically significant univariate effect (also referred to as a marginal
> > effect) as predictors of age (5% significance level).
> >
> > Fitting a multivariate regression model using both both `gleason` and `pgg45`
Expand Down Expand Up @@ -294,15 +286,11 @@ regression.
# What statistical methods are used to analyse high-dimensional data?
As we found out in the above challenges, carrying out linear regression on
datasets with large numbers of features can be difficult due to: high levels of correlation
between variables; difficulty in identifying a clear response variable; and risk
of overfitting. These problems are common to the analysis of many high-dimensional datasets,
for example, those using genomics data with multiple genes, or species
composition data in an environment where the relative abundance of different species
within a community is of interest. For such datasets, other statistical methods
may be used to examine whether groups of observations show similar characteristics
and whether these groups may relate to other features in the data (e.g.
We have discussed so far that high-dimensional data analysis can be challenging since variables are difficult to visualise,
leading to challenges identifying relationships between variables and suitable response variables; we may have
relatively few observations compared to features, leading to over-fitting; and features may be highly correlated, leading to
challenges interpreting models. We therefore require alternative approaches to examine whether, for example,
groups of observations show similar characteristics and whether these groups may relate to other features in the data (e.g.
phenotype in genetics data).
In this course, we will cover four methods that help in dealing with high-dimensional data:
Expand Down
59 changes: 29 additions & 30 deletions _episodes_rmd/02-high-dimensional-regression.Rmd
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: "Regression with many outcomes"
source: Rmd
teaching: 60
exercises: 30
teaching: 70
exercises: 50
questions:
- "How can we apply linear regression in a high-dimensional setting?"
- "How can we benefit from the fact that we have many outcomes?"
Expand Down Expand Up @@ -113,7 +113,7 @@ In this episode, we will focus on the association between age and
methylation. The following heatmap summarises age and methylation levels
available in the methylation dataset:

```{r heatmap, fig.cap="Visualising the data as a heatmap, it's clear that there's too many models to fit 'by hand'.", fig.alt="Heatmap of methylation values across all features. Samples are ordered according to age."}
```{r heatmap, fig.cap="Heatmap of methylation values across all features.", fig.alt="Heatmap of methylation values across all features showing that there are many features. Samples are ordered according to age."}
age <- methylation$Age
library("ComplexHeatmap")
Expand All @@ -130,26 +130,11 @@ Heatmap(methyl_mat_ord,
column_title = "Sample",
top_annotation = columnAnnotation(age = age_ord))
```
Depending on the scientific question of interest, two types of high-dimensional
problems could be explored in this context:

1. To predict age using methylation levels as predictors. In this case, we would
have a single outcome (age) which will be predicted using 5000 covariates
(methylation levels across the genome).

2. To predict methylation levels using age as a predictor. In this case, we
would have 5000 outcomes (methylation levels across the genome) and a single
covariate (age).

The examples in this episode will focus on the second type of problem, whilst
the next episode will focus on the first.

> ## Challenge 1
>
> Why can we not just fit many linear regression models, one for each of the columns
> in the `colData` above against each of the features in the matrix of
> assays, and choose all of the significant results at a p-value of
> 0.05?
> Why can we not just fit many linear regression models relating every combination of feature
> (`colData` and assays) and draw conclusions by associating all variables with significant model p-values?
>
> > ## Solution
> >
Expand All @@ -173,6 +158,19 @@ the next episode will focus on the first.
> {: .solution}
{: .challenge}

In general, it is scientifically interesting to explore two modelling problems using the three types of data:

1. Predicting methylation levels using age as a predictor. In this case, we
would have 5000 outcomes (methylation levels across the genome) and a single
covariate (age).

2. Predicting age using methylation levels as predictors. In this case, we would
have a single outcome (age) which will be predicted using 5000 covariates
(methylation levels across the genome).

The examples in this episode will focus on the first type of problem, whilst
the next episode will focus on the second.

> ## Measuring DNA Methylation
>
> DNA methylation is an epigenetic modification of DNA. Generally, we
Expand Down Expand Up @@ -229,12 +227,12 @@ to help us understand how ageing manifests.

Using linear regression, it is possible to identify differences like
these. However, high-dimensional data like the ones we're working with
require some special considerations. A primary consideration, as we saw
require some special considerations. A first consideration, as we saw
above, is that there are far too many features to fit each one-by-one as
we might do when analysing low-dimensional datasets (for example using
`lm` on each feature and checking the linear model assumptions). A
secondary consideration is that statistical approaches may behave
slightly differently in very high-dimensional data, compared to
second consideration is that statistical approaches may behave
slightly differently when applied to very high-dimensional data, compared to
low-dimensional data. A third consideration is the speed at which we can
actually compute statistics for data this large -- methods optimised for
low-dimensional data may be very slow when applied to high-dimensional
Expand Down Expand Up @@ -521,7 +519,7 @@ p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits =
this is larger, relative to the total area of the distribution, therefore the
p-value is larger than the one for the intercept term
($p=`r round(table_age_methyl1$p.value[[2]], digits = 3)`$). The
the p-value is a function of the test statistic: the ratio between the effect size
p-value is a function of the test statistic: the ratio between the effect size
we're estimating and the uncertainty we have in that effect. A large effect with large
uncertainty may not lead to a small p-value, and a small effect with
small uncertainty may lead to a small p-value.
Expand Down Expand Up @@ -708,7 +706,7 @@ while the y-axis is the $-\log_{10}(\text{p-value})$, where larger
values indicate increasing statistical evidence of a non-zero effect
size. A positive effect size represents increasing methylation with
increasing age, and a negative effect size represents decreasing
methylation with increasing age. Points higher on the x-axis represent
methylation with increasing age. Points higher on the y-axis represent
features for which we think the results we observed would be very
unlikely under the null hypothesis.

Expand Down Expand Up @@ -749,7 +747,7 @@ or information sharing that **`limma`** performs has on our results. To do
this, let us compare the effect sizes estimates and p-values from the two
approaches.

```{r plot-limma-lm-effect, echo = FALSE}
```{r plot-limma-lm-effect, echo = FALSE, fig.cap = "Plot of effect sizes using limma vs. those using lm.", fig.alt = "A scatter plot of the effect size using limmma vs. those using lm. The plot also shows a straight line through all points showing that the effect sizes are the same."}
plot(
coef_df[["estimate"]],
toptab_age[coef_df[["feature"]], "logFC"],
Expand All @@ -769,7 +767,7 @@ or moderate the effect size estimates, in the case of **`DESeq2`** by again
sharing information between features about sample-to-sample variability.
In contrast, let us look at the p-values from **`limma`** and R's built-in `lm()` function:

```{r plot-limma-lm-pval, echo = FALSE}
```{r plot-limma-lm-pval, echo = FALSE, fig.cap = "Plot of p-values using limma vs. those using lm.", fig.alt = "A scatter plot of the p-values using limma vs. those using lm. A straight line is also displayed, showing that the p-values for limma tend to be smaller than those using lm towards the left of the plot and higher towards the right of the plot."}
plot(
coef_df[["p.value"]],
toptab_age[coef_df[["feature"]], "P.Value"],
Expand All @@ -782,7 +780,7 @@ plot(
abline(0:1, lty = "dashed")
```

we can see that for the vast majority of features, the results are
We can see that for the vast majority of features, the results are
broadly similar. There seems to be a minor general tendency for **`limma`**
to produce smaller p-values, but for several features, the p-values from
limma are considerably larger than the p-values from `lm()`. This is
Expand Down Expand Up @@ -925,6 +923,7 @@ run the same test again:

```{r volcplotfake, fig.cap="Plotting p-values against effect sizes for a randomised outcome shows we still observe 'significant' results.", fig.alt="Plot of -log10(p) against effect size estimates for a regression of a made-up feature against methylation level for each feature in the data. A dashed line represents a 0.05 significance level."}
set.seed(123)
age_perm <- age[sample(ncol(methyl_mat), ncol(methyl_mat))]
design_age_perm <- model.matrix(~age_perm)
Expand All @@ -951,7 +950,7 @@ simply due to chance.

> ## Challenge 5
>
> 1. If we run `r nrow(methylation)` tests under the null hypothesis,
> 1. If we run `r nrow(methylation)` tests, even if there are no true differences,
> how many of them (on average) will be statistically significant at
> a threshold of $p < 0.05$?
> 2. Why would we want to be conservative in labelling features as
Expand Down Expand Up @@ -1047,7 +1046,7 @@ tests we performed! This is not ideal sometimes, because unfortunately
we usually don't have very large sample sizes in health sciences.

The second main way of controlling for multiple tests is to control the
*false discovery rate*.[^3] This is the proportion of false positives,
*false discovery rate (FDR)*.[^3] This is the proportion of false positives,
or false discoveries, we'd expect to get each time if we repeated the
experiment over and over.

Expand Down
Loading

0 comments on commit dc3e1f6

Please sign in to comment.