Merge branch 'main' into mary-suggestions-tasks28plus-ep4

carpentries-incubator · Mar 21, 2024 · dc3e1f6 · dc3e1f6
2 parents 02054a9 + 223978e
commit dc3e1f6
Show file tree

Hide file tree

Showing 9 changed files with 465 additions and 424 deletions.
diff --git a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
@@ -2,7 +2,7 @@
 title: "Introduction to high-dimensional data"
 author: "GS Robertson"
 source: Rmd
-teaching: 20
+teaching: 30
 exercises: 20
 questions:
 - What are high-dimensional data and what do these data look like in the
@@ -38,37 +38,31 @@ knitr_fig_path("01-")
 
 # What are high-dimensional data? 
 
-*High-dimensional data* are defined as data in which the number of features (variables observed),
-$p$, are close to or larger than the number of observations (or data points), $n$.
-The opposite is *low-dimensional data* in which the number of observations,
-$n$, far outnumbers the number of features, $p$. A related concept is *wide data*, which
-refers to data with numerous features irrespective of the number of observations (similarly,
-*tall data* is often used to denote data with a large number of observations).
-Analyses of high-dimensional data require consideration of potential problems that
-come from having more features than observations.
-
-High-dimensional data have become more common in many scientific fields as new
-automated data collection techniques have been developed. More and more datasets
-have a large number of features and some have as many features as there are rows
-in the dataset. Datasets in which $p \geq n$ are becoming more common. Such datasets
-pose a challenge for data analysis as standard methods of analysis, such as linear
-regression, are no longer appropriate.
-
-High-dimensional datasets are common in the biological sciences. Data sets in subjects like
-genomics and medical sciences are often tall (with large $n$) and wide
-(with large $p$), and can be difficult to analyse or visualise using
-standard statistical tools. An example of high-dimensional data in biological
-sciences may include data collected from hospital patients recording symptoms,
-blood test results, behaviours, and general health, resulting in datasets with
-large numbers of features. Researchers often want to relate these features to
-specific patient outcomes (e.g. survival, length of time spent in hospital).
-An example of what high-dimensional data might look like in a biomedical study
-is shown in the figure below. 
+*High-dimensional data* are defined as data with many features (variables observed).
+In recent years, advances in information technology have allowed large amounts of data to
+be collected and stored with relative ease. As such, high-dimensional
+data have become more common in many scientific fields, including the biological sciences, 
+where datasets in subjects like genomics and medical sciences often have a large numbers of features.
+For example, hospital data may record many variables, including symptoms,
+blood test results, behaviours, and general health. An example of what high-dimensional data might look like 
+in a biomedical study is shown in the figure below. 
+
+
 
 ```{r table-intro, echo = FALSE, fig.cap = "Example of a high-dimensional data table with features in the columns and individual observations (patients) in rows.", fig.alt = "Table displaying a high-dimensional data set with many features in individual columns relating to health data such as blood pressure, heart rate and respiratory rate. Each row contains the data for individual patients."}
-knitr::include_graphics(here::here("fig/intro-table.png"))
+knitr::include_graphics("../fig/intro-table.png")
 ```
 
+Researchers often want to relate such features to specific patient outcomes
+(e.g. survival, length of time spent in hospital). However, analysing 
+high-dimensional data can be extremely challenging since standard methods of analysis, 
+such as individual plots of features and linear regression, 
+are no longer appropriate when we have many features.
+In this lesson, we will learn alternative methods 
+for dealing with high-dimensional data and discover how these can be applied 
+for practical high-dimensional data analysis in the biological sciences. 
+
+
 
 
 > ## Challenge 1 
@@ -92,10 +86,10 @@ knitr::include_graphics(here::here("fig/intro-table.png"))
 > 
 > > ## Solution
 > > 
-> > 1. No. The number of observations (100 patients) is far greater than the number of features (3).
-> > 2. Yes, this is an example of high-dimensional data. There are only 100 observations but 200,000+3 features.
-> > 3. No. There are many more observations (200 patients) than features (5). 
-> > 4. Yes. There is only one observation of more than 20,000 features.
+> > 1. No. The number of features is relatively small (4 including the response variable since this is an observed variable).
+> > 2. Yes, this is an example of high-dimensional data. There are 200,004 features.
+> > 3. No. The number of features is relatively small (6).
+> > 4. Yes. There are 20,008 features.
 > {: .solution}
 {: .challenge}
 
@@ -107,20 +101,15 @@ about the challenges we face in analysing them.
 # Why is dealing with high-dimensional data challenging? 
 
 Most classical statistical methods are set up for use on low-dimensional data
-(i.e. data where the number of observations $n$ is much larger than the number
-of features $p$). This is because low-dimensional data were much more common in
-the past when data collection was more difficult and time consuming. In recent
-years advances in information technology have allowed large amounts of data to
-be collected and stored with relative ease. This has allowed large numbers of
-features to be collected, meaning that datasets in which $p$ matches or exceeds
-$n$ are common (collecting observations is often more difficult or expensive
-than collecting many features from a single observation).
-
-Datasets with large numbers of features are difficult to visualise. When
-exploring low-dimensional datasets, it is possible to plot the response variable
-against each of the limited number of explanatory variables to get an idea which
-of these are important predictors of the response. With high-dimensional data
-the large number of explanatory variables makes doing this difficult. In some
+(i.e. with a small number of features, $p$). 
+This is because low-dimensional data were much more common in
+the past when data collection was more difficult and time consuming. 
+
+One challenge when analysing high-dimensional data is visualising the many variables. 
+When exploring low-dimensional datasets, it is possible to plot the response variable
+against each of features to get an idea which
+of these are important predictors of the response. With high-dimensional data,
+the large number of features makes doing this difficult. In addition, in some
 high-dimensional datasets it can also be difficult to identify a single response
 variable, making standard data exploration and analysis techniques less useful.
 
@@ -189,17 +178,20 @@ of the challenges we are facing when working with high-dimensional data.
 > improve the reproducibility of an analysis.
 {: .callout}
 
-Imagine we are carrying out least squares regression on a dataset with 25
+As well as many variables causing problems when working with high-dimensional data,
+having relatively few observations ($n$) compared to the number of features ($p$) causes 
+additional challenges. To illustrate these challenges,
+imagine we are carrying out least squares regression on a dataset with 25
 observations. Fitting a best fit line through these data produces a plot shown
 in the left-hand panel of the figure below.
 
 However, imagine a situation in which the number of observations and features in a
 dataset are almost equal. In that situation the effective number of observations
-per features is low. The result of fitting a best fit line through
+per feature is low. The result of fitting a best fit line through
 few observations can be seen in the right-hand panel below.
 
 ```{r intro-figure, echo = FALSE, fig.cap = "Scatter plot of two variables (x and y) from a data set with 25 observations (left) and 2 observations (right) with a fitted regression line (red).", fig.alt = "Two scatter plots side-by-side, each plotting the relationship between two variables. The scatter plot on the left hand side shows 25 observations and a regression line with the points evenly scattered around. The scatter plot on the right hand side shows 2 observations and a regression line that goes through both points."}
-knitr::include_graphics(here::here("fig/intro-scatterplot.png"))
+knitr::include_graphics("../fig/intro-scatterplot.png")
 ```
 
 In the first situation, the least squares regression line does not fit the data
@@ -258,7 +250,7 @@ explore why high correlations might be an issue in a Challenge.
 > > ```
 > >
 > > Based on these results we conclude that both `gleason` and `pgg45` have a 
-> > statistically significan univariate effect (also referred to as a marginal
+> > statistically significant univariate effect (also referred to as a marginal
 > > effect) as predictors of age (5% significance level). 
 > >
 > > Fitting a multivariate regression model using both both `gleason` and `pgg45` 
@@ -294,15 +286,11 @@ regression.
 
 # What statistical methods are used to analyse high-dimensional data? 
 
-As we found out in the above challenges, carrying out linear regression on
-datasets with large numbers of features can be difficult due to: high levels of correlation
-between variables; difficulty in identifying a clear response variable; and risk
-of overfitting. These problems are common to the analysis of many high-dimensional datasets,
-for example, those using genomics data with multiple genes, or species
-composition data in an environment where the relative abundance of different species
-within a community is of interest. For such datasets, other statistical methods
-may be used to examine whether groups of observations show similar characteristics
-and whether these groups may relate to other features in the data (e.g.
+We have discussed so far that high-dimensional data analysis can be challenging since variables are difficult to visualise, 
+leading to challenges identifying relationships between variables and suitable response variables; we may have
+relatively few observations compared to features, leading to over-fitting; and features may be highly correlated, leading to
+challenges interpreting models. We therefore require alternative approaches to examine whether, for example,
+groups of observations show similar characteristics and whether these groups may relate to other features in the data (e.g.
 phenotype in genetics data). 
 
 In this course, we will cover four methods that help in dealing with high-dimensional data:

diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd
@@ -1,8 +1,8 @@
 ---
 title: "Regression with many outcomes"
 source: Rmd
-teaching: 60
-exercises: 30
+teaching: 70
+exercises: 50
 questions:
 - "How can we apply linear regression in a high-dimensional setting?"
 - "How can we benefit from the fact that we have many outcomes?"
@@ -113,7 +113,7 @@ In this episode, we will focus on the association between age and
 methylation. The following heatmap summarises age and methylation levels 
 available in the methylation dataset:
 
-```{r heatmap, fig.cap="Visualising the data as a heatmap, it's clear that there's too many models to fit 'by hand'.", fig.alt="Heatmap of methylation values across all features. Samples are ordered according to age."}
+```{r heatmap, fig.cap="Heatmap of methylation values across all features.", fig.alt="Heatmap of methylation values across all features showing that there are many features. Samples are ordered according to age."}
 age <- methylation$Age
 
 library("ComplexHeatmap")
@@ -130,26 +130,11 @@ Heatmap(methyl_mat_ord,
         column_title =  "Sample",
         top_annotation = columnAnnotation(age = age_ord))
 ```
-Depending on the scientific question of interest, two types of high-dimensional 
-problems could be explored in this context:
-
-1. To predict age using methylation levels as predictors. In this case, we would 
-have a single outcome (age) which will be predicted using 5000 covariates 
-(methylation levels across the genome). 
-
-2. To predict methylation levels using age as a predictor. In this case, we 
-would have 5000 outcomes (methylation levels across the genome) and a single 
-covariate (age). 
-
-The examples in this episode will focus on the second type of problem, whilst 
-the next episode will focus on the first.
 
 > ## Challenge 1
 >
-> Why can we not just fit many linear regression models, one for each of the columns
-> in the `colData` above against each of the features in the matrix of
-> assays, and choose all of the significant results at a p-value of
-> 0.05?
+> Why can we not just fit many linear regression models relating every combination of feature 
+> (`colData` and assays) and draw conclusions by associating all variables with significant model p-values?
 >
 > > ## Solution
 > >
@@ -173,6 +158,19 @@ the next episode will focus on the first.
 > {: .solution}
 {: .challenge}
 
+In general, it is scientifically interesting to explore two modelling problems using the three types of data:
+
+1. Predicting methylation levels using age as a predictor. In this case, we 
+would have 5000 outcomes (methylation levels across the genome) and a single 
+covariate (age). 
+
+2. Predicting age using methylation levels as predictors. In this case, we would 
+have a single outcome (age) which will be predicted using 5000 covariates 
+(methylation levels across the genome). 
+
+The examples in this episode will focus on the first type of problem, whilst 
+the next episode will focus on the second.
+
 > ## Measuring DNA Methylation
 >
 > DNA methylation is an epigenetic modification of DNA. Generally, we
@@ -229,12 +227,12 @@ to help us understand how ageing manifests.
 
 Using linear regression, it is possible to identify differences like
 these. However, high-dimensional data like the ones we're working with
-require some special considerations. A primary consideration, as we saw
+require some special considerations. A first consideration, as we saw
 above, is that there are far too many features to fit each one-by-one as
 we might do when analysing low-dimensional datasets (for example using
 `lm` on each feature and checking the linear model assumptions). A
-secondary consideration is that statistical approaches may behave
-slightly differently in very high-dimensional data, compared to
+second consideration is that statistical approaches may behave
+slightly differently when applied to very high-dimensional data, compared to
 low-dimensional data. A third consideration is the speed at which we can
 actually compute statistics for data this large -- methods optimised for
 low-dimensional data may be very slow when applied to high-dimensional
@@ -521,7 +519,7 @@ p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits =
 this is larger, relative to the total area of the distribution, therefore the 
 p-value is larger than the one for the intercept term 
 ($p=`r round(table_age_methyl1$p.value[[2]], digits = 3)`$). The
-the p-value is a function of the test statistic: the ratio between the effect size 
+p-value is a function of the test statistic: the ratio between the effect size 
 we're estimating and the uncertainty we have in that effect. A large effect with large
 uncertainty may not lead to a small p-value, and a small effect with
 small uncertainty may lead to a small p-value.
@@ -708,7 +706,7 @@ while the y-axis is the $-\log_{10}(\text{p-value})$, where larger
 values indicate increasing statistical evidence of a non-zero effect
 size. A positive effect size represents increasing methylation with
 increasing age, and a negative effect size represents decreasing
-methylation with increasing age. Points higher on the x-axis represent
+methylation with increasing age. Points higher on the y-axis represent
 features for which we think the results we observed would be very
 unlikely under the null hypothesis.
 
@@ -749,7 +747,7 @@ or information sharing that **`limma`** performs has on our results. To do
 this, let us compare the effect sizes estimates and p-values from the two
 approaches.
 
-```{r plot-limma-lm-effect, echo = FALSE}
+```{r plot-limma-lm-effect, echo = FALSE, fig.cap = "Plot of effect sizes using limma vs. those using lm.", fig.alt = "A scatter plot of the effect size using limmma vs. those using lm. The plot also shows a straight line through all points showing that the effect sizes are the same."}
 plot(
     coef_df[["estimate"]],
     toptab_age[coef_df[["feature"]], "logFC"],
@@ -769,7 +767,7 @@ or moderate the effect size estimates, in the case of **`DESeq2`** by again
 sharing information between features about sample-to-sample variability.
 In contrast, let us look at the p-values from **`limma`** and R's built-in `lm()` function:
 
-```{r plot-limma-lm-pval, echo = FALSE}
+```{r plot-limma-lm-pval, echo = FALSE, fig.cap = "Plot of p-values using limma vs. those using lm.", fig.alt = "A scatter plot of the p-values using limma vs. those using lm. A straight line is also displayed, showing that the p-values for limma tend to be smaller than those using lm towards the left of the plot and higher towards the right of the plot."}
 plot(
     coef_df[["p.value"]],
     toptab_age[coef_df[["feature"]], "P.Value"],
@@ -782,7 +780,7 @@ plot(
 abline(0:1, lty = "dashed")
 ```
 
-we can see that for the vast majority of features, the results are
+We can see that for the vast majority of features, the results are
 broadly similar. There seems to be a minor general tendency for **`limma`**
 to produce smaller p-values, but for several features, the p-values from
 limma are considerably larger than the p-values from `lm()`. This is
@@ -925,6 +923,7 @@ run the same test again:
 
 ```{r volcplotfake, fig.cap="Plotting p-values against effect sizes for a randomised outcome shows we still observe 'significant' results.", fig.alt="Plot of -log10(p) against effect size estimates for a regression of a made-up feature against methylation level for each feature in the data. A dashed line represents a 0.05 significance level."}
 
+set.seed(123) 
 age_perm <- age[sample(ncol(methyl_mat), ncol(methyl_mat))]
 design_age_perm <- model.matrix(~age_perm)
 
@@ -951,7 +950,7 @@ simply due to chance.
 
 > ## Challenge 5
 >
-> 1.  If we run `r nrow(methylation)` tests under the null hypothesis,
+> 1. If we run `r nrow(methylation)` tests, even if there are no true differences,
 >     how many of them (on average) will be statistically significant at
 >     a threshold of $p < 0.05$?
 > 2.  Why would we want to be conservative in labelling features as
@@ -1047,7 +1046,7 @@ tests we performed! This is not ideal sometimes, because unfortunately
 we usually don't have very large sample sizes in health sciences.
 
 The second main way of controlling for multiple tests is to control the
-*false discovery rate*.[^3] This is the proportion of false positives,
+*false discovery rate (FDR)*.[^3] This is the proportion of false positives,
 or false discoveries, we'd expect to get each time if we repeated the
 experiment over and over.