Merge pull request carpentries-incubator#78 from hannesbecher/main

Fixing boxes, line breaks, etc
alanocallaghan · Jul 19, 2022 · 3e868f5 · 3e868f5
2 parents 92e0f61 + 4c77edd
commit 3e868f5
Show file tree

Hide file tree

Showing 3 changed files with 74 additions and 74 deletions.
diff --git a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
@@ -41,12 +41,11 @@ knitr_fig_path("01-")
 $p$, are close to or larger than the number of observations (or data points), $n$.
 The opposite is *low-dimensional data* in which the number of observations,
 $n$, far outnumbers the number of features, $p$. A related concept is *wide data*, which
-
-efers to data with numerous features irrespective of the number of observations (similarly, *tall data* is often used to denote data with a large number of observations)
+refers to data with numerous features irrespective of the number of observations (similarly,
+*tall data* is often used to denote data with a large number of observations)
 Analyses of high-dimensional data require consideration of potential problems that
 come from having more features than observations.
 
-
 High-dimensional data have become more common in many scientific fields as new
 automated data collection techniques have been developed. More and more datasets
 have a large number of features and some have as many features as there are rows
@@ -56,7 +55,6 @@ regression, are no longer appropriate.
 
 High-dimensional datasets are common in the biological sciences. Subjects like
 genomics and medical sciences often use both tall (in terms of $n$) and wide
-
 (in terms of $p$) datasets that can be difficult to analyse or visualise using
 standard statistical tools. An example of high-dimensional data in biological
 sciences may include data collected from hospital patients recording symptoms,
@@ -97,7 +95,6 @@ knitr::include_graphics("../fig/intro-table.png")
 > > 1. No. The number of observations (100 patients) is far greater than the number of features (3).
 > > 2. Yes, this is an example of high-dimensional data. There are only 100 observations but 200,000+3 features.
 > > 3. No. There are many more observations (200 patients) than features (5). 
-
 > > 4. Yes. There is only one observation of more than 20,000 features.
 > {: .solution}
 {: .challenge}
@@ -134,18 +131,13 @@ of the challenges we are facing when working with high-dimensional data.
 > ## Challenge 2 
 > 
 > Load the `Prostate` dataset from the **`lasso2`** package.
-
 > names. Although technically not a high-dimensional dataset, the `Prostate` data
 > will allow us explore the problems encountered when working with many features.
 >
 > Examine the dataset (in which each row represents a single patient) to:
-
 > a) Determine how many observations ($n$) and features ($p$) are available (hint: see the `dim()` function) 
 > b) Examine what variables were measured (hint: see the `names()` and `head()` functions)
 > c) Plot the relationship between the variables (hint: see the `pairs()` function). 
-
-> become more difficult to plot relationships between pairs of variables with
-> increasing numbers of variables? Discuss in groups.
 > 
 > > ## Solution
 > > 
@@ -159,9 +151,8 @@ of the challenges we are facing when working with high-dimensional data.
 > > ```
 > >
 > > ```{r head-prostate, eval = FALSE}
->> names(Prostate) # examine the variable names
+> > names(Prostate) # examine the variable names
 > > head(Prostate)   #print the first 6 rows
-
 > > ```
 > > 
 > > ```{r pairs-prostate}
@@ -184,11 +175,9 @@ Imagine we are carrying out least squares regression on a dataset with 25
 observations. Fitting a best fit line through these data produces a plot shown
 in the left-hand panel of the figure below.
 
-However, imagine a situation in which the number of observations and features in a dataset are almost equal.
-
-In that situation the effective number of
-
-observations per features is low. The result of fitting a best fit line through
+However, imagine a situation in which the number of observations and features in a
+dataset are almost equal. In that situation the effective number of observations
+per features is low. The result of fitting a best fit line through
 few observations can be seen in the right-hand panel below.
 
 ```{r intro-figure, echo = FALSE}
@@ -255,7 +244,7 @@ regression.
 # What statistical methods are used to analyse high-dimensional data? 
 
 As we found out in the above challenges, carrying out linear regression on
-datasets with large numbers of features is difficult due to: high correlation
+datasets with large numbers of features is difficult due to: high levels of correlation
 between variables; difficulty in identifying a clear response variable; and risk
 of overfitting. These problems are common to the analysis of many high-dimensional datasets,
 for example, those using genomics data with multiple genes, or species
@@ -325,7 +314,7 @@ plot(xgroups, col = selected, pch = 19)
 > > is possible to convince ourselves that there are clusters in the data just
 > > by colouring the data points by their respective groups! Formal cluster
 > > analysis and validation is necessary to determine whether visual clusters
-> > in data are 'real'.
+> > in data are likely 'real'.
 > > 
 > {: .solution}
 {: .challenge}

diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd
@@ -142,8 +142,8 @@ features to do so manually.
 
 > ## Exercise
 >
-> Why can we not just fit linear regression models of all of the columns
-> in the `colData` above against all of the features in the matrix of
+> Why can we not just fit many linear regression models, one for each of the columns
+> in the `colData` above against each of the features in the matrix of
 > assays, and choose all of the significant results at a p-value of
 > 0.05?
 >
@@ -162,7 +162,8 @@ features to do so manually.
 > > associations in the data, we'd be likely to observe some strong
 > > spurious associations that arise just from random noise.
 > >
-> > {: .solution} {: .challenge}
+> {: .solution}
+{: .challenge}
 
 > ## Measuring DNA Methylation
 >
@@ -197,8 +198,8 @@ features to do so manually.
 > $$
 >
 > M-values are not bounded to an interval as Beta values are, and
-> therefore can be easier to work with in statistical models. {:
-> .callout}
+> therefore can be easier to work with in statistical models.
+{: .callout}
 
 # Identifying associations using linear regression
 
@@ -342,7 +343,7 @@ relationship between the predictor variable(s) and the outcome. This may
 not always be the most realistic or useful null hypothesis, but it is
 the one we have!
 
-For this linear model, we can use `broom` to extract detailed
+For this linear model, we can use `tidy()` from the **`broom`** package to extract detailed
 information about the coefficients and the associated hypothesis tests
 in this model:
 
@@ -364,12 +365,13 @@ under the "null hypothesis".
 > > ## Solution
 > >
 > > The intercept for this model indicates that the mean of methylation,
-> > when age is zero, is 0.902. However, this isn't a particularly
-> > note-worthy finding! Also, regardless of the p-value, it's unlikely
-> > to be a reliable finding, as we don't have any individuals with age
-> > zero, nor even any with age \< 20.
+> > when age is zero, is 0.902. However, this is not a particularly
+> > noteworthy finding! Also, regardless of the p-value, it is unlikely
+> > to be a reliable finding, as we do not have any observations with age
+> > zero (nor even any with age \< 20).
 > >
-> > {: .solution} {: .challenge}
+> {: .solution}
+{: .challenge}
 
 # Fitting a lot of linear models
 
@@ -485,8 +487,8 @@ the test statistic under the null hypothesis that is equal or greater to
 the value we observe for the intercept term. This shaded region is small
 relative to the total area of the null distribution; therefore, the
 p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits =
-3)`$). The blue-ish shaded region represents the same measure for the slope  term; this is larger, relative to the total area of the distribution, therefore the p-value is larger than the one for the intercept term ($p=`r
-round(table_age_methyl1$p.value[[2]], digits = 3)`$). You can see that
+3)`$). The blue-ish shaded region represents the same measure for the slope  term; this is larger, relative to the total area of the distribution, therefore the p-value is larger than the one for the intercept term 
+($p=`round(table_age_methyl1$p.value[[2]], digits = 3)`$). You can see that
 the p-value is a function of the size of the effect we're estimating and
 the uncertainty we have in that effect. A large effect with large
 uncertainty may not lead to a small p-value, and a small effect with
@@ -501,16 +503,18 @@ small uncertainty may lead to a small p-value.
 >
 > Since the statistic in a linear model is a t-statistic, it follows a
 > student t distribution under the null hypothesis, with degrees of
-> freedom (a parameter of the student t distribution) given by the
+> freedom (a parameter of the student t-distribution) given by the
 > number of observations minus the number of coefficients fitted, in
 > this case
-> $`r ncol(methylation)` - `r length(coef(lm_age_methyl1))` = `r lm_age_methyl1$df`$. We want to know what portion of the distribution function of the test statistic is as extreme as, or more  extreme than, the value we observed. The function`pt`(similar to`pnorm\`,
-> etc) can give us this information.
+> $`r ncol(methylation)` - `r length(coef(lm_age_methyl1))` = `r lm_age_methyl1$df`$.
+> We want to know what portion of the distribution function of the test
+> statistic is as extreme as, or more  extreme than, the value we observed.
+> The function`pt()`(similar to`pnorm()`, etc) can give us this information.
 >
 > Since we're not sure if the coefficient will be larger or smaller than
 > zero, we want to do a 2-tail test. Therefore we take the absolute
 > value of the t-statistic, and look at the upper rather than lower
-> tail. Because in a 2-tail test we're looking at "half" of the
+> tailed. Because in a 2-tailed test we're looking at "half" of the
 > t-distribution, we also multiply the p-value by 2.
 >
 > Combining all of this gives us:
@@ -520,7 +524,8 @@ small uncertainty may lead to a small p-value.
 > all.equal(table_age_methyl1$p.value, pvals)
 > ```
 >
-> {: .callout}
+>
+>{: .callout}
 
 # Sharing information between features
 
@@ -655,7 +660,7 @@ methylation with increasing age. Points higher on the x-axis represent
 features for which we think the results we observed would be very
 unlikely under the null hypothesis.
 
-Since we want to identify feature that have different methylation levels
+Since we want to identify features that have different methylation levels
 in different age groups, in an ideal case there would be clear
 separation between "null" and "non-null" features. However, usually we
 observe results as we do here: there is a continuum of effect sizes and
@@ -681,14 +686,15 @@ made more difficult by the number of tests we perform.
 > > Because the uncertainty in our estimates is much smaller than the
 > > estimates themselves, the p-values are also small.
 > >
-> > If we predicted age using methylation level, it's likely we'd see
+> > If we predicted age using methylation level, it is likely we would see
 > > much larger coefficients, though broadly similar p-values!
 > >
-> > {: .solution} {: .challenge}
+> {: .solution}
+{: .challenge}
 
-It's worthwhile considering what exactly the effect of the *moderation*
-or information sharing that `limma` performs has on our results. To do
-this, let's compare the effect sizes estimates and p-values from the two
+It is worthwhile considering what exactly the effect of the *moderation*
+or information sharing that **`limma`** performs has on our results. To do
+this, let us compare the effect sizes estimates and p-values from the two
 approaches.
 
 ```{r plot-limma-lm-effect, echo = FALSE}
@@ -703,13 +709,13 @@ plot(
 abline(0:1, lty = "dashed")
 ```
 
-These are exactly identical! This is because `limma` isn't performing
+These are exactly identical! This is because **`limma`** does not perform
 any sharing of information when estimating effect sizes. This is in
 contrast to similar packages that apply shrinkage to the effect size
-estimates, like `DESeq2`. These often use information sharing to shrink
-or moderate the effect size estimates, in the case of DESeq2 by again
+estimates, like **`DESeq2`**. These often use information sharing to shrink
+or moderate the effect size estimates, in the case of **`DESeq2`** by again
 sharing information between features about sample-to-sample variability.
-In contrast, let's look at the p-values from `limma` and `lm`:
+In contrast, let us look at the p-values from **`limma`** and R's built-in `lm()` function:
 
 ```{r plot-limma-lm-pval, echo = FALSE}
 plot(
@@ -724,10 +730,10 @@ plot(
 abline(0:1, lty = "dashed")
 ```
 
-You can see that for the vast majority of features, the results are
-broadly similar. There seems to be a minor general tendency for `limma`
+we can see that for the vast majority of features, the results are
+broadly similar. There seems to be a minor general tendency for **`limma`**
 to produce smaller p-values, but for several features, the p-values from
-limma are considerably larger than the p-values from `lm`. This is
+limma are considerably larger than the p-values from `lm()`. This is
 because the information sharing tends to shrink large standard error
 estimates downwards and small estimates upwards. When the degree of
 statistical significance is due to an abnormally small standard error
@@ -744,15 +750,15 @@ is not much opportunity to generate pooled estimates, and the evidence
 of the data can easily outweigh the pooling.
 
 Shrinkage methods like these ones can be complex to implement and
-understand, but it's good to understand why these approaches may be more
+understand, but it is useful to develp an intuition why these approaches may be more
 precise and sensitive than the naive approach of fitting a model to each
 feature separately.
 
 > ## Exercise
 >
 > 1.  Try to run the same kind of linear model with smoking status as
 >     covariate instead of age, and making a volcano plot. *Note:
->     smoking status is stored as `methylation$smoker`.*
+>     smoking status is stored as* `methylation$smoker`.
 > 2.  We saw in the example in the lesson that this information sharing
 >     can lead to larger p-values. Why might this be preferable?
 >
@@ -777,7 +783,9 @@ feature separately.
 > >     hypothesis is based more on a small standard error resulting
 > >     from abnormally low levels of variability for a given feature,
 > >     we might want to be a bit more conservative in our expectations.
-> >     {: .solution} {: .challenge}
+> {: .solution}
+{: .challenge}
+
 
 ```{r limma-app-ex, echo = FALSE, eval = FALSE}
 > ## Exercise
@@ -825,11 +833,11 @@ feature separately.
 > information* about the effect size between schools and shink our
 > estimates towards a common value.
 >
-> For example in `DESeq2`, the authors used the observation that genes
+> For example in **`DESeq2`**, the authors used the observation that genes
 > with similar expression counts in RNAseq data have similar
 > *dispersion*, and a better estimate of these dispersion parameters
 > makes estimates of fold changes much more stable. Similarly, in
-> `limma` the authors made the assumption that in the absence of
+> **`limma`** the authors made the assumption that in the absence of
 > biological effects, we can often expect the technical variation in the
 > measurement of the expression of each of the genes to be broadly
 > similar. Again, better estimates of variability allow us to prioritise
@@ -842,7 +850,8 @@ feature separately.
 >     Mahr](https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/)
 > -   [a book by David Robinson](https://gumroad.com/l/empirical-bayes)
 > -   [a (relatively technical) book by Gelman and
->     Hill](http://www.stat.columbia.edu/~gelman/arm/) {: .callout}
+>     Hill](http://www.stat.columbia.edu/~gelman/arm/)
+{: .callout}
 
 ```{r, eval=FALSE, echo=FALSE}
 # todo: callout box explaining DESeq2
@@ -852,14 +861,14 @@ feature separately.
 
 With such a large number of features, it would be useful to decide which
 features are "interesting" or "significant" for further study. However,
-if we were to apply a normal significance threshold of 0.05, it's likely
-that we'd end up with a lot of false positives. That's because a p-value
-threshold like this represents a $\frac{1}{20}$ chance that we'd observe
+if we were to apply a normal significance threshold of 0.05, it would be likely
+we end up with a lot of false positives. This is because a p-value
+threshold like this represents a $\frac{1}{20}$ chance that we observe
 results as extreme or more extreme under the null hypothesis (that there
-is no assocation between age and methylation level). If we do many more
-than 20 such tests, we can expect that for a lot of the tests, the null
-hypothesis is true, but we will still observe extreme results. To
-demonstrate this, it's useful to see what happens if we scramble age and
+is no assocation between age and methylation level). If we carry out many more
+than 20 such tests, we can expect to see situations where, despite the null
+hypothesis being true, we observe observe signifiant p-values due to random chance. To
+demonstrate this, it is useful to see what happens if we permute (scramble) the age values and
 run the same test again:
 
 ```{r volcplotfake, fig.cap="Plotting p-values against effect sizes for a randomised outcome shows we still observe 'significant' results.", fig.alt="Plot of -log10(p) against effect size estimates for a regression of a made-up feature against methylation level for each feature in the data. A dashed line represents a 0.05 significance level."}
@@ -878,14 +887,14 @@ plot(toptab_age_perm$logFC, -log10(toptab_age_perm$P.Value),
 abline(h = -log10(0.05), lty = "dashed", col = "red")
 ```
 
-Since we've generated a random sequence of ages, we have no reason to
+Since we have generated a random sequence of ages, we have no reason to
 suspect that there is a true association between methylation levels and
 this sequence of random numbers. However, you can see that the p-value
 for many features is still lower than a traditional significance level
 of $p=0.05$. In fact, here `r sum(toptab_age_perm$P.Value < 0.05)`
 features are significant at p \< 0.05. If we were to use this fixed
-threshold in a real experiment, it's likely that we'd identify many
-features as associated with age, when the results we're observing are
+threshold in a real experiment, it is likely that we would identify many
+features as associated with age, when the results we are observing are
 simply due to chance.
 
 > ## Exercise
@@ -915,19 +924,21 @@ simply due to chance.
 > > 3.  One approach to controlling for the number of tests is to divide
 > >     our significance threshold by the number of tests performed.
 > >     This is termed "Bonferroni correction" and we'll discuss this
-> >     further now. {: .solution} {: .challenge}
+> >     further now.
+> {: .solution}
+{: .challenge}
 
 # Adjusting for multiple tests
 
-When performing many statistical tests to categorise features, we're
-effectively classifying features as "significant" - meaning those for
-which we reject the null hypothesis - and "non-significant". We also
+When performing many statistical tests to categorise features, we are
+effectively classifying features as "non-significant" or "significant", that latter meaning those for
+which we reject the null hypothesis. We also
 generally hope that there is a subset of features for which the null
 hypothesis is truly false, as well as many for which the null truly does
 hold. We hope that for all features for which the null hypothesis is
 true, we accept it, and for all features for which the null hypothesis
 is not true, we reject it. As we showed in the example with permuted
-age, with a large number of tests it's inevitable that we'll get some of
+age, with a large number of tests it is inevitable that we will get some of
 these wrong.
 
 We can think of these features as being "truly different" or "not truly
@@ -1059,7 +1070,8 @@ experiment over and over.
 > >         labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value")
 > >     ```
 > >
-> >     {: .solution} {: .challenge}
+> {: .solution}
+{: .challenge}