Skip to content

Commit

Permalink
Merge pull request carpentries-incubator#78 from hannesbecher/main
Browse files Browse the repository at this point in the history
Fixing boxes, line breaks, etc
  • Loading branch information
hannesbecher authored Jul 19, 2022
2 parents 92e0f61 + 4c77edd commit 3e868f5
Show file tree
Hide file tree
Showing 3 changed files with 74 additions and 74 deletions.
27 changes: 8 additions & 19 deletions _episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,11 @@ knitr_fig_path("01-")
$p$, are close to or larger than the number of observations (or data points), $n$.
The opposite is *low-dimensional data* in which the number of observations,
$n$, far outnumbers the number of features, $p$. A related concept is *wide data*, which

efers to data with numerous features irrespective of the number of observations (similarly, *tall data* is often used to denote data with a large number of observations)
refers to data with numerous features irrespective of the number of observations (similarly,
*tall data* is often used to denote data with a large number of observations)
Analyses of high-dimensional data require consideration of potential problems that
come from having more features than observations.


High-dimensional data have become more common in many scientific fields as new
automated data collection techniques have been developed. More and more datasets
have a large number of features and some have as many features as there are rows
Expand All @@ -56,7 +55,6 @@ regression, are no longer appropriate.

High-dimensional datasets are common in the biological sciences. Subjects like
genomics and medical sciences often use both tall (in terms of $n$) and wide

(in terms of $p$) datasets that can be difficult to analyse or visualise using
standard statistical tools. An example of high-dimensional data in biological
sciences may include data collected from hospital patients recording symptoms,
Expand Down Expand Up @@ -97,7 +95,6 @@ knitr::include_graphics("../fig/intro-table.png")
> > 1. No. The number of observations (100 patients) is far greater than the number of features (3).
> > 2. Yes, this is an example of high-dimensional data. There are only 100 observations but 200,000+3 features.
> > 3. No. There are many more observations (200 patients) than features (5).
> > 4. Yes. There is only one observation of more than 20,000 features.
> {: .solution}
{: .challenge}
Expand Down Expand Up @@ -134,18 +131,13 @@ of the challenges we are facing when working with high-dimensional data.
> ## Challenge 2
>
> Load the `Prostate` dataset from the **`lasso2`** package.
> names. Although technically not a high-dimensional dataset, the `Prostate` data
> will allow us explore the problems encountered when working with many features.
>
> Examine the dataset (in which each row represents a single patient) to:
> a) Determine how many observations ($n$) and features ($p$) are available (hint: see the `dim()` function)
> b) Examine what variables were measured (hint: see the `names()` and `head()` functions)
> c) Plot the relationship between the variables (hint: see the `pairs()` function).
> become more difficult to plot relationships between pairs of variables with
> increasing numbers of variables? Discuss in groups.
>
> > ## Solution
> >
Expand All @@ -159,9 +151,8 @@ of the challenges we are facing when working with high-dimensional data.
> > ```
> >
> > ```{r head-prostate, eval = FALSE}
>> names(Prostate) # examine the variable names
> > names(Prostate) # examine the variable names
> > head(Prostate) #print the first 6 rows
> > ```
> >
> > ```{r pairs-prostate}
Expand All @@ -184,11 +175,9 @@ Imagine we are carrying out least squares regression on a dataset with 25
observations. Fitting a best fit line through these data produces a plot shown
in the left-hand panel of the figure below.
However, imagine a situation in which the number of observations and features in a dataset are almost equal.
In that situation the effective number of
observations per features is low. The result of fitting a best fit line through
However, imagine a situation in which the number of observations and features in a
dataset are almost equal. In that situation the effective number of observations
per features is low. The result of fitting a best fit line through
few observations can be seen in the right-hand panel below.
```{r intro-figure, echo = FALSE}
Expand Down Expand Up @@ -255,7 +244,7 @@ regression.
# What statistical methods are used to analyse high-dimensional data?
As we found out in the above challenges, carrying out linear regression on
datasets with large numbers of features is difficult due to: high correlation
datasets with large numbers of features is difficult due to: high levels of correlation
between variables; difficulty in identifying a clear response variable; and risk
of overfitting. These problems are common to the analysis of many high-dimensional datasets,
for example, those using genomics data with multiple genes, or species
Expand Down Expand Up @@ -325,7 +314,7 @@ plot(xgroups, col = selected, pch = 19)
> > is possible to convince ourselves that there are clusters in the data just
> > by colouring the data points by their respective groups! Formal cluster
> > analysis and validation is necessary to determine whether visual clusters
> > in data are 'real'.
> > in data are likely 'real'.
> >
> {: .solution}
{: .challenge}
Expand Down
118 changes: 65 additions & 53 deletions _episodes_rmd/02-high-dimensional-regression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -142,8 +142,8 @@ features to do so manually.

> ## Exercise
>
> Why can we not just fit linear regression models of all of the columns
> in the `colData` above against all of the features in the matrix of
> Why can we not just fit many linear regression models, one for each of the columns
> in the `colData` above against each of the features in the matrix of
> assays, and choose all of the significant results at a p-value of
> 0.05?
>
Expand All @@ -162,7 +162,8 @@ features to do so manually.
> > associations in the data, we'd be likely to observe some strong
> > spurious associations that arise just from random noise.
> >
> > {: .solution} {: .challenge}
> {: .solution}
{: .challenge}

> ## Measuring DNA Methylation
>
Expand Down Expand Up @@ -197,8 +198,8 @@ features to do so manually.
> $$
>
> M-values are not bounded to an interval as Beta values are, and
> therefore can be easier to work with in statistical models. {:
> .callout}
> therefore can be easier to work with in statistical models.
{: .callout}

# Identifying associations using linear regression

Expand Down Expand Up @@ -342,7 +343,7 @@ relationship between the predictor variable(s) and the outcome. This may
not always be the most realistic or useful null hypothesis, but it is
the one we have!

For this linear model, we can use `broom` to extract detailed
For this linear model, we can use `tidy()` from the **`broom`** package to extract detailed
information about the coefficients and the associated hypothesis tests
in this model:

Expand All @@ -364,12 +365,13 @@ under the "null hypothesis".
> > ## Solution
> >
> > The intercept for this model indicates that the mean of methylation,
> > when age is zero, is 0.902. However, this isn't a particularly
> > note-worthy finding! Also, regardless of the p-value, it's unlikely
> > to be a reliable finding, as we don't have any individuals with age
> > zero, nor even any with age \< 20.
> > when age is zero, is 0.902. However, this is not a particularly
> > noteworthy finding! Also, regardless of the p-value, it is unlikely
> > to be a reliable finding, as we do not have any observations with age
> > zero (nor even any with age \< 20).
> >
> > {: .solution} {: .challenge}
> {: .solution}
{: .challenge}

# Fitting a lot of linear models

Expand Down Expand Up @@ -485,8 +487,8 @@ the test statistic under the null hypothesis that is equal or greater to
the value we observe for the intercept term. This shaded region is small
relative to the total area of the null distribution; therefore, the
p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits =
3)`$). The blue-ish shaded region represents the same measure for the slope term; this is larger, relative to the total area of the distribution, therefore the p-value is larger than the one for the intercept term ($p=`r
round(table_age_methyl1$p.value[[2]], digits = 3)`$). You can see that
3)`$). The blue-ish shaded region represents the same measure for the slope term; this is larger, relative to the total area of the distribution, therefore the p-value is larger than the one for the intercept term
($p=`round(table_age_methyl1$p.value[[2]], digits = 3)`$). You can see that
the p-value is a function of the size of the effect we're estimating and
the uncertainty we have in that effect. A large effect with large
uncertainty may not lead to a small p-value, and a small effect with
Expand All @@ -501,16 +503,18 @@ small uncertainty may lead to a small p-value.
>
> Since the statistic in a linear model is a t-statistic, it follows a
> student t distribution under the null hypothesis, with degrees of
> freedom (a parameter of the student t distribution) given by the
> freedom (a parameter of the student t-distribution) given by the
> number of observations minus the number of coefficients fitted, in
> this case
> $`r ncol(methylation)` - `r length(coef(lm_age_methyl1))` = `r lm_age_methyl1$df`$. We want to know what portion of the distribution function of the test statistic is as extreme as, or more extreme than, the value we observed. The function`pt`(similar to`pnorm\`,
> etc) can give us this information.
> $`r ncol(methylation)` - `r length(coef(lm_age_methyl1))` = `r lm_age_methyl1$df`$.
> We want to know what portion of the distribution function of the test
> statistic is as extreme as, or more extreme than, the value we observed.
> The function`pt()`(similar to`pnorm()`, etc) can give us this information.
>
> Since we're not sure if the coefficient will be larger or smaller than
> zero, we want to do a 2-tail test. Therefore we take the absolute
> value of the t-statistic, and look at the upper rather than lower
> tail. Because in a 2-tail test we're looking at "half" of the
> tailed. Because in a 2-tailed test we're looking at "half" of the
> t-distribution, we also multiply the p-value by 2.
>
> Combining all of this gives us:
Expand All @@ -520,7 +524,8 @@ small uncertainty may lead to a small p-value.
> all.equal(table_age_methyl1$p.value, pvals)
> ```
>
> {: .callout}
>
>{: .callout}
# Sharing information between features
Expand Down Expand Up @@ -655,7 +660,7 @@ methylation with increasing age. Points higher on the x-axis represent
features for which we think the results we observed would be very
unlikely under the null hypothesis.

Since we want to identify feature that have different methylation levels
Since we want to identify features that have different methylation levels
in different age groups, in an ideal case there would be clear
separation between "null" and "non-null" features. However, usually we
observe results as we do here: there is a continuum of effect sizes and
Expand All @@ -681,14 +686,15 @@ made more difficult by the number of tests we perform.
> > Because the uncertainty in our estimates is much smaller than the
> > estimates themselves, the p-values are also small.
> >
> > If we predicted age using methylation level, it's likely we'd see
> > If we predicted age using methylation level, it is likely we would see
> > much larger coefficients, though broadly similar p-values!
> >
> > {: .solution} {: .challenge}
> {: .solution}
{: .challenge}

It's worthwhile considering what exactly the effect of the *moderation*
or information sharing that `limma` performs has on our results. To do
this, let's compare the effect sizes estimates and p-values from the two
It is worthwhile considering what exactly the effect of the *moderation*
or information sharing that **`limma`** performs has on our results. To do
this, let us compare the effect sizes estimates and p-values from the two
approaches.

```{r plot-limma-lm-effect, echo = FALSE}
Expand All @@ -703,13 +709,13 @@ plot(
abline(0:1, lty = "dashed")
```

These are exactly identical! This is because `limma` isn't performing
These are exactly identical! This is because **`limma`** does not perform
any sharing of information when estimating effect sizes. This is in
contrast to similar packages that apply shrinkage to the effect size
estimates, like `DESeq2`. These often use information sharing to shrink
or moderate the effect size estimates, in the case of DESeq2 by again
estimates, like **`DESeq2`**. These often use information sharing to shrink
or moderate the effect size estimates, in the case of **`DESeq2`** by again
sharing information between features about sample-to-sample variability.
In contrast, let's look at the p-values from `limma` and `lm`:
In contrast, let us look at the p-values from **`limma`** and R's built-in `lm()` function:

```{r plot-limma-lm-pval, echo = FALSE}
plot(
Expand All @@ -724,10 +730,10 @@ plot(
abline(0:1, lty = "dashed")
```

You can see that for the vast majority of features, the results are
broadly similar. There seems to be a minor general tendency for `limma`
we can see that for the vast majority of features, the results are
broadly similar. There seems to be a minor general tendency for **`limma`**
to produce smaller p-values, but for several features, the p-values from
limma are considerably larger than the p-values from `lm`. This is
limma are considerably larger than the p-values from `lm()`. This is
because the information sharing tends to shrink large standard error
estimates downwards and small estimates upwards. When the degree of
statistical significance is due to an abnormally small standard error
Expand All @@ -744,15 +750,15 @@ is not much opportunity to generate pooled estimates, and the evidence
of the data can easily outweigh the pooling.

Shrinkage methods like these ones can be complex to implement and
understand, but it's good to understand why these approaches may be more
understand, but it is useful to develp an intuition why these approaches may be more
precise and sensitive than the naive approach of fitting a model to each
feature separately.

> ## Exercise
>
> 1. Try to run the same kind of linear model with smoking status as
> covariate instead of age, and making a volcano plot. *Note:
> smoking status is stored as `methylation$smoker`.*
> smoking status is stored as* `methylation$smoker`.
> 2. We saw in the example in the lesson that this information sharing
> can lead to larger p-values. Why might this be preferable?
>
Expand All @@ -777,7 +783,9 @@ feature separately.
> > hypothesis is based more on a small standard error resulting
> > from abnormally low levels of variability for a given feature,
> > we might want to be a bit more conservative in our expectations.
> > {: .solution} {: .challenge}
> {: .solution}
{: .challenge}


```{r limma-app-ex, echo = FALSE, eval = FALSE}
> ## Exercise
Expand Down Expand Up @@ -825,11 +833,11 @@ feature separately.
> information* about the effect size between schools and shink our
> estimates towards a common value.
>
> For example in `DESeq2`, the authors used the observation that genes
> For example in **`DESeq2`**, the authors used the observation that genes
> with similar expression counts in RNAseq data have similar
> *dispersion*, and a better estimate of these dispersion parameters
> makes estimates of fold changes much more stable. Similarly, in
> `limma` the authors made the assumption that in the absence of
> **`limma`** the authors made the assumption that in the absence of
> biological effects, we can often expect the technical variation in the
> measurement of the expression of each of the genes to be broadly
> similar. Again, better estimates of variability allow us to prioritise
Expand All @@ -842,7 +850,8 @@ feature separately.
> Mahr](https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/)
> - [a book by David Robinson](https://gumroad.com/l/empirical-bayes)
> - [a (relatively technical) book by Gelman and
> Hill](http://www.stat.columbia.edu/~gelman/arm/) {: .callout}
> Hill](http://www.stat.columbia.edu/~gelman/arm/)
{: .callout}

```{r, eval=FALSE, echo=FALSE}
# todo: callout box explaining DESeq2
Expand All @@ -852,14 +861,14 @@ feature separately.

With such a large number of features, it would be useful to decide which
features are "interesting" or "significant" for further study. However,
if we were to apply a normal significance threshold of 0.05, it's likely
that we'd end up with a lot of false positives. That's because a p-value
threshold like this represents a $\frac{1}{20}$ chance that we'd observe
if we were to apply a normal significance threshold of 0.05, it would be likely
we end up with a lot of false positives. This is because a p-value
threshold like this represents a $\frac{1}{20}$ chance that we observe
results as extreme or more extreme under the null hypothesis (that there
is no assocation between age and methylation level). If we do many more
than 20 such tests, we can expect that for a lot of the tests, the null
hypothesis is true, but we will still observe extreme results. To
demonstrate this, it's useful to see what happens if we scramble age and
is no assocation between age and methylation level). If we carry out many more
than 20 such tests, we can expect to see situations where, despite the null
hypothesis being true, we observe observe signifiant p-values due to random chance. To
demonstrate this, it is useful to see what happens if we permute (scramble) the age values and
run the same test again:

```{r volcplotfake, fig.cap="Plotting p-values against effect sizes for a randomised outcome shows we still observe 'significant' results.", fig.alt="Plot of -log10(p) against effect size estimates for a regression of a made-up feature against methylation level for each feature in the data. A dashed line represents a 0.05 significance level."}
Expand All @@ -878,14 +887,14 @@ plot(toptab_age_perm$logFC, -log10(toptab_age_perm$P.Value),
abline(h = -log10(0.05), lty = "dashed", col = "red")
```

Since we've generated a random sequence of ages, we have no reason to
Since we have generated a random sequence of ages, we have no reason to
suspect that there is a true association between methylation levels and
this sequence of random numbers. However, you can see that the p-value
for many features is still lower than a traditional significance level
of $p=0.05$. In fact, here `r sum(toptab_age_perm$P.Value < 0.05)`
features are significant at p \< 0.05. If we were to use this fixed
threshold in a real experiment, it's likely that we'd identify many
features as associated with age, when the results we're observing are
threshold in a real experiment, it is likely that we would identify many
features as associated with age, when the results we are observing are
simply due to chance.

> ## Exercise
Expand Down Expand Up @@ -915,19 +924,21 @@ simply due to chance.
> > 3. One approach to controlling for the number of tests is to divide
> > our significance threshold by the number of tests performed.
> > This is termed "Bonferroni correction" and we'll discuss this
> > further now. {: .solution} {: .challenge}
> > further now.
> {: .solution}
{: .challenge}

# Adjusting for multiple tests

When performing many statistical tests to categorise features, we're
effectively classifying features as "significant" - meaning those for
which we reject the null hypothesis - and "non-significant". We also
When performing many statistical tests to categorise features, we are
effectively classifying features as "non-significant" or "significant", that latter meaning those for
which we reject the null hypothesis. We also
generally hope that there is a subset of features for which the null
hypothesis is truly false, as well as many for which the null truly does
hold. We hope that for all features for which the null hypothesis is
true, we accept it, and for all features for which the null hypothesis
is not true, we reject it. As we showed in the example with permuted
age, with a large number of tests it's inevitable that we'll get some of
age, with a large number of tests it is inevitable that we will get some of
these wrong.

We can think of these features as being "truly different" or "not truly
Expand Down Expand Up @@ -1059,7 +1070,8 @@ experiment over and over.
> > labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value")
> > ```
> >
> > {: .solution} {: .challenge}
> {: .solution}
{: .challenge}
Expand Down
Loading

0 comments on commit 3e868f5

Please sign in to comment.