diff --git a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd index 68d43de7..07cb9150 100644 --- a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd +++ b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd @@ -2,7 +2,7 @@ title: "Introduction to high-dimensional data" author: "GS Robertson" source: Rmd -teaching: 20 +teaching: 30 exercises: 20 questions: - What are high-dimensional data and what do these data look like in the @@ -38,37 +38,31 @@ knitr_fig_path("01-") # What are high-dimensional data? -*High-dimensional data* are defined as data in which the number of features (variables observed), -$p$, are close to or larger than the number of observations (or data points), $n$. -The opposite is *low-dimensional data* in which the number of observations, -$n$, far outnumbers the number of features, $p$. A related concept is *wide data*, which -refers to data with numerous features irrespective of the number of observations (similarly, -*tall data* is often used to denote data with a large number of observations). -Analyses of high-dimensional data require consideration of potential problems that -come from having more features than observations. - -High-dimensional data have become more common in many scientific fields as new -automated data collection techniques have been developed. More and more datasets -have a large number of features and some have as many features as there are rows -in the dataset. Datasets in which $p \geq n$ are becoming more common. Such datasets -pose a challenge for data analysis as standard methods of analysis, such as linear -regression, are no longer appropriate. - -High-dimensional datasets are common in the biological sciences. Data sets in subjects like -genomics and medical sciences are often tall (with large $n$) and wide -(with large $p$), and can be difficult to analyse or visualise using -standard statistical tools. An example of high-dimensional data in biological -sciences may include data collected from hospital patients recording symptoms, -blood test results, behaviours, and general health, resulting in datasets with -large numbers of features. Researchers often want to relate these features to -specific patient outcomes (e.g. survival, length of time spent in hospital). -An example of what high-dimensional data might look like in a biomedical study -is shown in the figure below. +*High-dimensional data* are defined as data with many features (variables observed). +In recent years, advances in information technology have allowed large amounts of data to +be collected and stored with relative ease. As such, high-dimensional +data have become more common in many scientific fields, including the biological sciences, +where datasets in subjects like genomics and medical sciences often have a large numbers of features. +For example, hospital data may record many variables, including symptoms, +blood test results, behaviours, and general health. An example of what high-dimensional data might look like +in a biomedical study is shown in the figure below. + + ```{r table-intro, echo = FALSE, fig.cap = "Example of a high-dimensional data table with features in the columns and individual observations (patients) in rows.", fig.alt = "Table displaying a high-dimensional data set with many features in individual columns relating to health data such as blood pressure, heart rate and respiratory rate. Each row contains the data for individual patients."} -knitr::include_graphics(here::here("fig/intro-table.png")) +knitr::include_graphics("../fig/intro-table.png") ``` +Researchers often want to relate such features to specific patient outcomes +(e.g. survival, length of time spent in hospital). However, analysing +high-dimensional data can be extremely challenging since standard methods of analysis, +such as individual plots of features and linear regression, +are no longer appropriate when we have many features. +In this lesson, we will learn alternative methods +for dealing with high-dimensional data and discover how these can be applied +for practical high-dimensional data analysis in the biological sciences. + + > ## Challenge 1 @@ -92,10 +86,10 @@ knitr::include_graphics(here::here("fig/intro-table.png")) > > > ## Solution > > -> > 1. No. The number of observations (100 patients) is far greater than the number of features (3). -> > 2. Yes, this is an example of high-dimensional data. There are only 100 observations but 200,000+3 features. -> > 3. No. There are many more observations (200 patients) than features (5). -> > 4. Yes. There is only one observation of more than 20,000 features. +> > 1. No. The number of features is relatively small (4 including the response variable since this is an observed variable). +> > 2. Yes, this is an example of high-dimensional data. There are 200,004 features. +> > 3. No. The number of features is relatively small (6). +> > 4. Yes. There are 20,008 features. > {: .solution} {: .challenge} @@ -107,20 +101,15 @@ about the challenges we face in analysing them. # Why is dealing with high-dimensional data challenging? Most classical statistical methods are set up for use on low-dimensional data -(i.e. data where the number of observations $n$ is much larger than the number -of features $p$). This is because low-dimensional data were much more common in -the past when data collection was more difficult and time consuming. In recent -years advances in information technology have allowed large amounts of data to -be collected and stored with relative ease. This has allowed large numbers of -features to be collected, meaning that datasets in which $p$ matches or exceeds -$n$ are common (collecting observations is often more difficult or expensive -than collecting many features from a single observation). - -Datasets with large numbers of features are difficult to visualise. When -exploring low-dimensional datasets, it is possible to plot the response variable -against each of the limited number of explanatory variables to get an idea which -of these are important predictors of the response. With high-dimensional data -the large number of explanatory variables makes doing this difficult. In some +(i.e. with a small number of features, $p$). +This is because low-dimensional data were much more common in +the past when data collection was more difficult and time consuming. + +One challenge when analysing high-dimensional data is visualising the many variables. +When exploring low-dimensional datasets, it is possible to plot the response variable +against each of features to get an idea which +of these are important predictors of the response. With high-dimensional data, +the large number of features makes doing this difficult. In addition, in some high-dimensional datasets it can also be difficult to identify a single response variable, making standard data exploration and analysis techniques less useful. @@ -189,17 +178,20 @@ of the challenges we are facing when working with high-dimensional data. > improve the reproducibility of an analysis. {: .callout} -Imagine we are carrying out least squares regression on a dataset with 25 +As well as many variables causing problems when working with high-dimensional data, +having relatively few observations ($n$) compared to the number of features ($p$) causes +additional challenges. To illustrate these challenges, +imagine we are carrying out least squares regression on a dataset with 25 observations. Fitting a best fit line through these data produces a plot shown in the left-hand panel of the figure below. However, imagine a situation in which the number of observations and features in a dataset are almost equal. In that situation the effective number of observations -per features is low. The result of fitting a best fit line through +per feature is low. The result of fitting a best fit line through few observations can be seen in the right-hand panel below. ```{r intro-figure, echo = FALSE, fig.cap = "Scatter plot of two variables (x and y) from a data set with 25 observations (left) and 2 observations (right) with a fitted regression line (red).", fig.alt = "Two scatter plots side-by-side, each plotting the relationship between two variables. The scatter plot on the left hand side shows 25 observations and a regression line with the points evenly scattered around. The scatter plot on the right hand side shows 2 observations and a regression line that goes through both points."} -knitr::include_graphics(here::here("fig/intro-scatterplot.png")) +knitr::include_graphics("../fig/intro-scatterplot.png") ``` In the first situation, the least squares regression line does not fit the data @@ -258,7 +250,7 @@ explore why high correlations might be an issue in a Challenge. > > ``` > > > > Based on these results we conclude that both `gleason` and `pgg45` have a -> > statistically significan univariate effect (also referred to as a marginal +> > statistically significant univariate effect (also referred to as a marginal > > effect) as predictors of age (5% significance level). > > > > Fitting a multivariate regression model using both both `gleason` and `pgg45` @@ -294,15 +286,11 @@ regression. # What statistical methods are used to analyse high-dimensional data? -As we found out in the above challenges, carrying out linear regression on -datasets with large numbers of features can be difficult due to: high levels of correlation -between variables; difficulty in identifying a clear response variable; and risk -of overfitting. These problems are common to the analysis of many high-dimensional datasets, -for example, those using genomics data with multiple genes, or species -composition data in an environment where the relative abundance of different species -within a community is of interest. For such datasets, other statistical methods -may be used to examine whether groups of observations show similar characteristics -and whether these groups may relate to other features in the data (e.g. +We have discussed so far that high-dimensional data analysis can be challenging since variables are difficult to visualise, +leading to challenges identifying relationships between variables and suitable response variables; we may have +relatively few observations compared to features, leading to over-fitting; and features may be highly correlated, leading to +challenges interpreting models. We therefore require alternative approaches to examine whether, for example, +groups of observations show similar characteristics and whether these groups may relate to other features in the data (e.g. phenotype in genetics data). In this course, we will cover four methods that help in dealing with high-dimensional data: diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index 9f4bda95..2097bac7 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -1,8 +1,8 @@ --- title: "Regression with many outcomes" source: Rmd -teaching: 60 -exercises: 30 +teaching: 70 +exercises: 50 questions: - "How can we apply linear regression in a high-dimensional setting?" - "How can we benefit from the fact that we have many outcomes?" @@ -113,7 +113,7 @@ In this episode, we will focus on the association between age and methylation. The following heatmap summarises age and methylation levels available in the methylation dataset: -```{r heatmap, fig.cap="Visualising the data as a heatmap, it's clear that there's too many models to fit 'by hand'.", fig.alt="Heatmap of methylation values across all features. Samples are ordered according to age."} +```{r heatmap, fig.cap="Heatmap of methylation values across all features.", fig.alt="Heatmap of methylation values across all features showing that there are many features. Samples are ordered according to age."} age <- methylation$Age library("ComplexHeatmap") @@ -130,26 +130,11 @@ Heatmap(methyl_mat_ord, column_title = "Sample", top_annotation = columnAnnotation(age = age_ord)) ``` -Depending on the scientific question of interest, two types of high-dimensional -problems could be explored in this context: - -1. To predict age using methylation levels as predictors. In this case, we would -have a single outcome (age) which will be predicted using 5000 covariates -(methylation levels across the genome). - -2. To predict methylation levels using age as a predictor. In this case, we -would have 5000 outcomes (methylation levels across the genome) and a single -covariate (age). - -The examples in this episode will focus on the second type of problem, whilst -the next episode will focus on the first. > ## Challenge 1 > -> Why can we not just fit many linear regression models, one for each of the columns -> in the `colData` above against each of the features in the matrix of -> assays, and choose all of the significant results at a p-value of -> 0.05? +> Why can we not just fit many linear regression models relating every combination of feature +> (`colData` and assays) and draw conclusions by associating all variables with significant model p-values? > > > ## Solution > > @@ -173,6 +158,19 @@ the next episode will focus on the first. > {: .solution} {: .challenge} +In general, it is scientifically interesting to explore two modelling problems using the three types of data: + +1. Predicting methylation levels using age as a predictor. In this case, we +would have 5000 outcomes (methylation levels across the genome) and a single +covariate (age). + +2. Predicting age using methylation levels as predictors. In this case, we would +have a single outcome (age) which will be predicted using 5000 covariates +(methylation levels across the genome). + +The examples in this episode will focus on the first type of problem, whilst +the next episode will focus on the second. + > ## Measuring DNA Methylation > > DNA methylation is an epigenetic modification of DNA. Generally, we @@ -229,12 +227,12 @@ to help us understand how ageing manifests. Using linear regression, it is possible to identify differences like these. However, high-dimensional data like the ones we're working with -require some special considerations. A primary consideration, as we saw +require some special considerations. A first consideration, as we saw above, is that there are far too many features to fit each one-by-one as we might do when analysing low-dimensional datasets (for example using `lm` on each feature and checking the linear model assumptions). A -secondary consideration is that statistical approaches may behave -slightly differently in very high-dimensional data, compared to +second consideration is that statistical approaches may behave +slightly differently when applied to very high-dimensional data, compared to low-dimensional data. A third consideration is the speed at which we can actually compute statistics for data this large -- methods optimised for low-dimensional data may be very slow when applied to high-dimensional @@ -521,7 +519,7 @@ p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits = this is larger, relative to the total area of the distribution, therefore the p-value is larger than the one for the intercept term ($p=`r round(table_age_methyl1$p.value[[2]], digits = 3)`$). The -the p-value is a function of the test statistic: the ratio between the effect size +p-value is a function of the test statistic: the ratio between the effect size we're estimating and the uncertainty we have in that effect. A large effect with large uncertainty may not lead to a small p-value, and a small effect with small uncertainty may lead to a small p-value. @@ -708,7 +706,7 @@ while the y-axis is the $-\log_{10}(\text{p-value})$, where larger values indicate increasing statistical evidence of a non-zero effect size. A positive effect size represents increasing methylation with increasing age, and a negative effect size represents decreasing -methylation with increasing age. Points higher on the x-axis represent +methylation with increasing age. Points higher on the y-axis represent features for which we think the results we observed would be very unlikely under the null hypothesis. @@ -749,7 +747,7 @@ or information sharing that **`limma`** performs has on our results. To do this, let us compare the effect sizes estimates and p-values from the two approaches. -```{r plot-limma-lm-effect, echo = FALSE} +```{r plot-limma-lm-effect, echo = FALSE, fig.cap = "Plot of effect sizes using limma vs. those using lm.", fig.alt = "A scatter plot of the effect size using limmma vs. those using lm. The plot also shows a straight line through all points showing that the effect sizes are the same."} plot( coef_df[["estimate"]], toptab_age[coef_df[["feature"]], "logFC"], @@ -769,7 +767,7 @@ or moderate the effect size estimates, in the case of **`DESeq2`** by again sharing information between features about sample-to-sample variability. In contrast, let us look at the p-values from **`limma`** and R's built-in `lm()` function: -```{r plot-limma-lm-pval, echo = FALSE} +```{r plot-limma-lm-pval, echo = FALSE, fig.cap = "Plot of p-values using limma vs. those using lm.", fig.alt = "A scatter plot of the p-values using limma vs. those using lm. A straight line is also displayed, showing that the p-values for limma tend to be smaller than those using lm towards the left of the plot and higher towards the right of the plot."} plot( coef_df[["p.value"]], toptab_age[coef_df[["feature"]], "P.Value"], @@ -782,7 +780,7 @@ plot( abline(0:1, lty = "dashed") ``` -we can see that for the vast majority of features, the results are +We can see that for the vast majority of features, the results are broadly similar. There seems to be a minor general tendency for **`limma`** to produce smaller p-values, but for several features, the p-values from limma are considerably larger than the p-values from `lm()`. This is @@ -925,6 +923,7 @@ run the same test again: ```{r volcplotfake, fig.cap="Plotting p-values against effect sizes for a randomised outcome shows we still observe 'significant' results.", fig.alt="Plot of -log10(p) against effect size estimates for a regression of a made-up feature against methylation level for each feature in the data. A dashed line represents a 0.05 significance level."} +set.seed(123) age_perm <- age[sample(ncol(methyl_mat), ncol(methyl_mat))] design_age_perm <- model.matrix(~age_perm) @@ -951,7 +950,7 @@ simply due to chance. > ## Challenge 5 > -> 1. If we run `r nrow(methylation)` tests under the null hypothesis, +> 1. If we run `r nrow(methylation)` tests, even if there are no true differences, > how many of them (on average) will be statistically significant at > a threshold of $p < 0.05$? > 2. Why would we want to be conservative in labelling features as @@ -1047,7 +1046,7 @@ tests we performed! This is not ideal sometimes, because unfortunately we usually don't have very large sample sizes in health sciences. The second main way of controlling for multiple tests is to control the -*false discovery rate*.[^3] This is the proportion of false positives, +*false discovery rate (FDR)*.[^3] This is the proportion of false positives, or false discoveries, we'd expect to get each time if we repeated the experiment over and over. diff --git a/_episodes_rmd/03-regression-regularisation.Rmd b/_episodes_rmd/03-regression-regularisation.Rmd index 634a6168..58c50061 100644 --- a/_episodes_rmd/03-regression-regularisation.Rmd +++ b/_episodes_rmd/03-regression-regularisation.Rmd @@ -1,8 +1,8 @@ --- title: "Regularised regression" source: Rmd -teaching: 60 -exercises: 20 +teaching: 110 +exercises: 60 questions: - "What is regularisation?" - "How does regularisation work?" @@ -42,11 +42,12 @@ feature selection and it is particularly useful when dealing with high-dimension One reason that we need special statistical tools for high-dimensional data is that standard linear models cannot handle high-dimensional data sets -- one cannot fit a linear model where there are more features (predictor variables) than there are observations -(data points). In the previous lesson we dealt with this problem by fitting individual +(data points). In the previous lesson, we dealt with this problem by fitting individual models for each feature and sharing information among these models. Now we will -take a look at an alternative approach called regularisation. Regularisation can be used to -stabilise coefficient estimates (and thus to fit models with more features than observations) -and even to select a subset of relevant features. +take a look at an alternative approach that can be used to fit models with more +features than observations by stabilising coefficient estimates. This approach is called +regularisation. Compared to many other methods, regularisation is also often very fast +and can therefore be extremely useful in practice. First, let us check out what happens if we try to fit a linear model to high-dimensional data! We start by reading in the data from the last lesson: @@ -76,34 +77,34 @@ summary(fit) You can see that we're able to get some effect size estimates, but they seem very high! The summary also says that we were unable to estimate effect sizes for `r format(sum(is.na(coef(fit))), big.mark=",")` features -because of "singularities". What this means is that R couldn't find a way to -perform the calculations necessary due to the fact that we have more features -than observations. - +because of "singularities". We clarify what singularities are in the note below +but this means that R couldn't find a way to +perform the calculations necessary to fit the model. Large effect sizes and singularities are common +when naively fitting linear regression models with a large number of features (i.e., to high-dimensional data), +often since the model cannot distinguish between the effects of many, correlated features or +when we have more features than observations. > ## Singularities > > The message that `lm` produced is not necessarily the most intuitive. What -> are "singularities", and why are they an issue? A singular matrix +> are "singularities" and why are they an issue? A singular matrix > is one that cannot be -> [inverted](https://en.wikipedia.org/wiki/Invertible_matrix). -> The inverse of an $n \times n$ square matrix $A$ is the matrix $B$ for which -> $AB = BA = I_n$, where $I_n$ is the $n \times n$ identity matrix. -> -> Why is the inverse important? Well, to find the -> coefficients of a linear model of a matrix of predictor features $X$ and an -> outcome vector $y$, we may perform the calculation +> [inverted](https://en.wikipedia.org/wiki/Invertible_matrix). R uses +> inverse operations to fit linear models (find the coefficients) using: > > $$ -> (X^TX)^{-1}X^Ty +> (X^TX)^{-1}X^Ty, > $$ > -> You can see that, if we're unable to find the inverse of the matrix $X^TX$, -> then we'll be unable to find the regression coefficients. +> where $X$ is a matrix of predictor features and $y$ is the outcome vector. +> Thus, if the matrix $X^TX$ cannot be inverted to give $(X^TX)^{-1}$, R +> cannot fit the model and returns the error that there are singularities. > -> Why might this be the case? +> Why might R be unable to calculate $(X^TX)^{-1}$ and return the error that there are singularities? > Well, when the [determinant](https://en.wikipedia.org/wiki/Determinant) -> of the matrix is zero, we are unable to find its inverse. +> of the matrix is zero, we are unable to find its inverse. The determinant +> of the matrix is zero when there are more features than observations or often when +> the features are highly correlated. > > ```{r determinant} > xtx <- t(methyl_mat) %*% methyl_mat @@ -113,14 +114,11 @@ than observations. > ## Correlated features -- common in high-dimensional data > -> So, we can't fit a standard linear model to high-dimensional data. But there -> is another issue. In high-dimensional datasets, there +> In high-dimensional datasets, there > are often multiple features that contain redundant information (correlated features). -> -> We have seen in the first episode that correlated features can make it hard -> (or impossible) to correctly infer parameters. If we visualise the level of +> If we visualise the level of > correlation between sites in the methylation dataset, we can see that many -> of the features essentially represent the same information - there are many +> of the features represent the same information - there are many > off-diagonal cells, which are deep red or blue. For example, the following > heatmap visualises the correlations for the first 500 features in the > `methylation` dataset (we selected 500 features only as it can be hard to @@ -138,10 +136,6 @@ library("ComplexHeatmap") > ) > ``` > -> Correlation between features can be problematic for technical reasons. If it is -> very severe, it may even make it impossible to fit a model! This is in addition to -> the fact that with more features than observations, we can't even estimate -> the model properly. Regularisation can help us to deal with correlated features. {: .callout} @@ -149,7 +143,7 @@ library("ComplexHeatmap") > ## Challenge 1 > -> Discuss in groups: +> Consider or discuss in groups: > > 1. Why would we observe correlated features in high-dimensional biological > data? @@ -172,6 +166,13 @@ library("ComplexHeatmap") > {: .solution} {: .challenge} +Regularisation can help us to deal with correlated features, as well as effectively reduce +the number of features in our model. Before we describe regularisation, let's recap what's +going on when we fit a linear model. + + + + # Coefficient estimates of a linear model When we fit a linear model, we're finding the line through our data that @@ -276,10 +277,7 @@ the sum above is the line of best fit through our data when considering this goal of minimising the sum of squared error. However, it is not the only possible line we could use! For example, we might want to err on the side of caution when estimating effect sizes (coefficients). That is, we might want to -avoid estimating very large effect sizes. This can help us to create *generalisable* -models. This is important when models that are fitted (trained) on one dataset -and then used to predict outcomes from a new dataset. Restricting parameter -estimates is particularly important when analysing high-dimensional data. +avoid estimating very large effect sizes. > ## Challenge 2 @@ -338,6 +336,13 @@ known to be associated with age from a paper by Horvath et al.[^2]. Horvath et a use methylation markers alone to predict the biological age of an individual. This is useful in studying age-related disease amongst many other things. + +To compare the training and test errors for a model for age using the Hovarth data, we'll split the +data into training and test sets, fit a linear model and calculate the errors. First, let's +calculate the training error. Let's start by splitting the data into training and test sets: + + + ```{r coefhorvath} coef_horvath <- readRDS(here::here("data/coefHorvath.rds")) methylation <- readRDS(here::here("data/methylation.rds")) @@ -353,73 +358,68 @@ horvath_mat <- methyl_mat[, features] ## Generate an index to split the data set.seed(42) train_ind <- sample(nrow(methyl_mat), 25) + +## Split the data +train_mat <- horvath_mat[train_ind, ] +train_age <- age[train_ind] +test_mat <- horvath_mat[-train_ind, ] +test_age <- age[-train_ind] + +``` + +Now let's fit a linear model to our training data and calculate the training error. Here we use the mean of the squared difference between our predictions and the observed data, or "mean squared error" (MSE). + +```{r trainlm} +## Fit a linear model +# as.data.frame() converts train_mat into a data.frame +# Using the `.` syntax above together with a `data` argument will lead to +# the same result as using `train_age ~ train_mat`: R will fit a multivariate +# regression model in which each of the columns in `train_mat` is used as +# a predictor. We opted to use the `.` syntax because it will help us to +# obtain model predictions using the `predict()` function. + +fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat)) + +## Function to calculate the (mean squared) error + mse <- function(true, prediction) { + mean((true - prediction)^2) + } + +## Calculate the training error +err_lm_train <- mse(train_age, fitted(fit_horvath)) +err_lm_train ``` +The training error appears very low here -- on average we're only off by +about a year! + > ## Challenge 3 > -> 1. Split the methylation data matrix and the age vector -> into training and test sets. -> 2. Fit a model on the training data matrix and training age -> vector. -> 3. Check the mean squared error on this model. +> For the fitted model above, calculate the mean squared error for the test set. > > > > ## Solution > > -> > 1. Splitting the data involves using our index to split up the matrix and -> > the age vector into two each. We can use a negative subscript to create -> > the test data. -> > -> > ```{r testsplit} -> > train_mat <- horvath_mat[train_ind, ] -> > train_age <- age[train_ind] -> > test_mat <- horvath_mat[-train_ind, ] -> > test_age <- age[-train_ind] -> > ``` -> > The solution to this exercise is important because the generated objects -> > (`train_mat`, `train_age`, `test_mat` and `test_age`) will be used later in -> > this episode. Please make sure that you use the same object names. -> > 2. We can fit a model using the training data and the `lm` function. -> > ```{r trainfit} -> > # we convert train_mat to a data frame to use the `.` syntax in lm -> > fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat)) -> > ``` -> > -> > Using the `.` syntax above together with a `data` argument will lead to -> > the same result as usign `train_age ~ tran_mat`: R will fit a multivariate -> > regression model in which each of the colums in `train_mat` is used as -> > a predictor. We opted to use the `.` syntax because it will help us to -> > obtain model predictions using the `predict()` function. +> > First, let's find the predicted values for the 'unseen' test data: +> > ```{r modelpreds} +> > pred_lm <- predict(fit_horvath, newdata = as.data.frame(test_mat)) +> > ``` > > -> > 3. The mean squared error of the model is the mean of the square of the -> > residuals. This seems very low here -- on average we're only off by -> > about a year! -> > ```{r testerror} -> > mean(residuals(fit_horvath)^2) -> > ``` +> > The mean squared error for the test set is the mean of the squared +> > error between the predicted values and true test data. +> > ```{r testerror} +> > err_lm <- mse(test_age, pred_lm) +> > err_lm +> > ``` > > > {: .solution} {: .challenge} -Having trained this model, now we can check how well it does in predicting age -from new dataset (the test data). -Here we use the mean of the squared difference between our predictions and the -true ages for the test data, or "mean squared error" (MSE). Unfortunately, it -seems like this is a lot higher than the error on the training data! - -```{r} -mse <- function(true, prediction) { - mean((true - prediction)^2) -} -pred_lm <- predict(fit_horvath, newdata = as.data.frame(test_mat)) -err_lm <- mse(test_age, pred_lm) -err_lm -``` - -Further, if we plot true age against predicted age for the samples in the test -set, we can see how well we're really doing - ideally these would line up -exactly! +Unfortunately, the test error is a lot higher than the training error. +If we plot true age against predicted age for the samples in the test +set, we can gain more insight into the performance of the model on the test set. +Ideally, the predicted values should be close to the test data. ```{r test-plot-lm, fig.cap="A scatter plot of observed age versus predicted age for individuals in the test set. Each dot represents one individual. Dashed line is used as a reference to indicate how perfect predictions would look (observed = predicted).", fig.alt="A scatter plot of observed age versus predicted age for individuals in the test set. Each dot represents one individual. Dashed line is used as a reference to indicate how perfect predictions would look (observed = predicted). In this case we observe high prediction error in the test set."} par(mfrow = c(1, 1)) @@ -433,17 +433,16 @@ good, the dots should follow a line. Regularisation can help us to make the model more generalisable, improving predictions for the test dataset (or any other dataset that is not used when fitting our model). -# Using regularisation to impove generalisability +# Regularisation + +Regularisation can be used to reduce correlation between predictors, the number of features, +and improve generalisability by restricting model parameter estimates. Essentially, we +add another condition to the problem we're solving with linear regression that controls the +total size of the coefficients that come out and shrinks many coefficients to zero (or near zero). -As stated above, restricting model parameter estimates can improve a model's -generalisability. This can be done with *regularisation*. The idea to add another -condition to the problem we're solving with linear regression. This condition -controls the total size of the coefficients that come out. -For example, we might say that the point representing the slope and intercept -must fall within a certain distance of the origin, $(0, 0)$. Note that we are -still trying to solve for the line that minimises the square of the residuals; -we are just adding this extra constraint to our solution. +For example, we might say that the point representing the slope and intercept +must fall within a certain distance of the origin, $(0, 0)$. For the 2-parameter model (slope and intercept), we could visualise this constraint as a circle with a given radius. We want to find the "best" solution (in terms of minimising the @@ -487,8 +486,10 @@ points( par(mfrow = c(1, 1)) ``` -There are multiple ways to define the distance that our solution must fall in, -though. The one we've plotted above controls the squared sum of the +# Ridge regression + +There are multiple ways to define the distance that our solution must fall in. +The one we've plotted above controls the squared sum of the coefficients, $\beta$. This is also sometimes called the $L^2$ norm. This is defined as @@ -505,21 +506,17 @@ $$ \sum_{i=1}^N \biggl( y_i - x'_i\beta\biggr)^2 + \lambda \left\lVert \beta \right\lVert_2 ^2 $$ +This type of regularisation is called *ridge regression*. Another way of thinking about this is that when finding the best model, we're weighing up a balance of the ordinary least squares objective and a "penalty" term that punishes models with large coefficients. The balance between the penalty and the ordinary least squares objective is controlled by $\lambda$ - -when $\lambda$ is large, we care a lot about the size of the coefficients. -When it's small, we don't really care a lot. When it's zero, we're back to -just using ordinary least squares. This type of regularisation is called *ridge regression*. - -# Why would we want to restrict our model? +when $\lambda$ is large, we want to penalise large coefficients. In other words, we care a lot about the size of the coefficients +and we want to reduce the complexity of our model. +When $\lambda$ is small, we don't really care a lot about shrinking our coefficients and we opt for a more complex model. When it's zero, we're back to +just using ordinary least squares. We see how a penalty term, $\lambda$, might be chosen later in this episode. -It may seem an odd thing to do: to restrict the possible values of our model -parameters! Why would we want to do this? Firstly, as discussed earlier, our -model estimates can be very unstable or even difficult to calculate when we have -many correlated features. Secondly, this type of approach can make our model more -generalisable to new data. To show this, we'll fit a model using the same set +For now, to see how regularisation might improve a model, let's fit a model using the same set of 20 features (stored as `features`) selected earlier in this episode (these are a subset of the features identified by Horvarth et al), using both regularised and ordinary least squares. @@ -541,7 +538,7 @@ This plot shows how the estimated coefficients for each CpG site change as we increase the penalty, $\lambda$. That is, as we decrease the size of the region that solutions can fall into, the values of the coefficients that we get back tend to decrease. In this case, -coefficients trend towards zero but generally don't reach it until the penalty +coefficients tend towards zero but generally don't reach it until the penalty gets very large. We can see that initially, some parameter estimates are really, really large, and these tend to shrink fairly rapidly. @@ -552,22 +549,37 @@ or correlated predictors. As we reduce the importance of one feature, we can weight to another feature that represents similar information. Since we split the data into test and training data, we can prove that ridge -regression gives us a better prediction in this case: +regression predicts the test data better than the model with no regularisation. +Let's generate our predictions under the ridge regression model and calculate +the mean squared error in the test set: ```{r pred-ridge-lm, fig.width = 10, fig.cap="Cap", fig.alt="Alt"} +# Obtain a matrix of predictions from the ridge model, +# where each column corresponds to a different lambda value pred_ridge <- predict(ridge_fit, newx = test_mat) -err_ridge <- apply(pred_ridge, 2, function(col) mse(test_age, col)) -min(err_ridge) -err_lm -which_min_err <- which.min(err_ridge) +# Calculate MSE for every column of the prediction matrix against the vector of true ages +err_ridge <- apply(pred_ridge, 2, function(col) mse(test_age, col)) min_err_ridge <- min(err_ridge) + +# Identify the lambda value that results in the lowest MSE (ie, the "best" lambda value) +which_min_err <- which.min(err_ridge) pred_min_ridge <- pred_ridge[, which_min_err] + +## Return errors +min_err_ridge ``` +This is much lower than the test error for the model without regularisation: + +```{r pred-ridge-lm-noreg} +err_lm +``` + + We can see where on the continuum of lambdas we've picked a model by plotting the coefficient paths again. In this case, we've picked a model with fairly -modest shrinkage. +modest coefficient shrinking. ```{r chooselambda, fig.cap="Cap", fig.alt="Alt"} chosen_lambda <- ridge_fit$lambda[which.min(err_ridge)] @@ -576,6 +588,7 @@ abline(v = log(chosen_lambda), lty = "dashed") ``` + > ## Challenge 4 > > 1. Which performs better, ridge or OLS? @@ -736,63 +749,47 @@ drawplot() # Cross-validation to find the best value of $\lambda$ -There are various methods to select the "best" -value for $\lambda$. One is to split +Ideally, we want $\lambda$ to be large enough to reduce the complexity of +our model, thus reducing the number of and correlations between the features, and improving generalisability. However, +we don't want the value of $\lambda$ to be so large that we lose a lot of the valuable information in the features. + +Various methods can be used to balance this trade-off and thus select the "best" +value for $\lambda$. One method splits the data into $K$ chunks. We then use $K-1$ of these as the training set, and the remaining $1$ chunk as the test set. We can repeat this until we've rotated through all $K$ chunks, giving us a good estimate of how well each of the lambda values work in our data. This is called cross-validation, and doing this repeated test/train split -gives us a better estimate of how generalisable our model is. Cross-validation -is a really deep topic that we're not going to cover in more detail today, though! +gives us a better estimate of how generalisable our model is. ```{r cvfig, echo = FALSE, fig.cap = "Schematic representiation of a $K$-fold cross-validation procedure.", fig.alt = "The data is divided into $K$ chunks. For each cross-validation iteration, one data chunk is used as the test set. The remaining $K-1$ chunks are combined into a training set."} knitr::include_graphics("../fig/cross_validation.png") ``` -We can use this new idea to choose a lambda value, by finding the lambda -that minimises the error across each of the test and training splits. +We can use this new idea to choose a lambda value by finding the lambda +that minimises the error across each of the test and training splits. In R: ```{r lasso-cv, fig.cap="Cross-validated mean squared error for different values of lambda under a LASSO penalty.", fig.alt="Alt"} +# fit lasso model with cross-validation across a range of lambda values lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1) plot(lasso) -coefl <- coef(lasso, lasso$lambda.min) -selected_coefs <- as.matrix(coefl)[which(coefl != 0), 1] -## load the horvath signature to compare features -coef_horvath <- readRDS(here::here("data/coefHorvath.rds")) -## We select some of the same features! Hooray -intersect(names(selected_coefs), coef_horvath$CpGmarker) +# Extract the coefficients from the model with the lowest mean squared error from cross-validation +coefl <- coef(lasso, lasso$lambda.min) +# select only non-zero coefficients +selection <- which(coefl != 0) +# and convert to a normal matrix +selected_coefs <- as.matrix(coefl)[selection, 1] +selected_coefs ``` -```{r heatmap-lasso, echo = FALSE, fig.cap="Heatmap showing methylation values for the selected CpG and how the vary with age.", fig.alt="Overall, we observe either increasing or decreasing methylation patterns as a function of age."} -hmat <- t(scale(methyl_mat[, names(selected_coefs)[-1]])) -ord <- order(age) -hmat_ord <- hmat[, ord] -age_ord <- age[ord] -col <- circlize::colorRamp2(seq(min(age), max(age), length.out = 100), viridis(100)) -Heatmap( - hmat_ord, - name = "Scaled\nmethylation level", - top_annotation = columnAnnotation( - age = age_ord, - col = list(age = col) - ), - show_column_names = FALSE, - column_title = "Sample", - row_title = "Feature", - row_names_side = "left", - row_title_side = "left", - column_title_side = "bottom", - show_row_dend = FALSE, - cluster_columns = FALSE -) -``` +We can see that cross-validation has selected a value of $\lambda$ resulting in `r length(selected_coefs)-1` features and the intercept. + -# Blending ridge regression and the LASSO - elastic nets +# Elastic nets: blending ridge regression and the LASSO -So far, we've used ridge regression, where `alpha = 0`, and LASSO regression, -where `alpha = 1`. What if `alpha` is set to a value between zero and one? +So far, we've used ridge regression (where `alpha = 0`) and LASSO regression, +(where `alpha = 1`). What if `alpha` is set to a value between zero and one? Well, this actually lets us blend the properties of ridge and LASSO regression. This allows us to have the nice properties of the LASSO, where uninformative variables are dropped automatically, and the nice properties diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index d2b5bda3..60b33db2 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -3,7 +3,7 @@ title: "Principal component analysis" author: "GS Robertson" source: Rmd teaching: 90 -exercises: 30 +exercises: 40 questions: - What is principal component analysis (PCA) and when can it be used? - How can we perform a PCA in R? @@ -29,7 +29,7 @@ keypoints: - The first principal component represents the dimension along which there is maximum variation in the data. Subsequent principal components represent dimensions with progressively less variation. -- "Screeplots and biplots may be used to show: +- "Scree plots and biplots may be used to show: 1. how much variation in the data is explained by each principal component and 2. how data points cluster according to principal component scores and which variables are associated with these scores." @@ -165,7 +165,7 @@ Disadvantages: The first principal component is the direction of the data along which the observations vary the most. The second principal component is the direction of the data along which the observations show the next highest amount of variation. -For example, Figure 1 shows biodiversity index versus percentage area left +For example, the figure below shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component direction of the data, which is the direction along which there is greatest variability in the data. Projecting points onto this line @@ -180,7 +180,7 @@ components as there are variables in your dataset, but as we'll see, some are more useful at explaining your data than others. By definition, the first principal component explains more variation than other principal components. -```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} +```{r fig1, echo=FALSE, fig.cap="Scatter plots of the farm data with the first principal component in red and second principal component in green.", fig.alt="Side-by-side scatter plots of the farm data. The left figure displays a scatter plot of biodiversity versus percentage area fallow. The first principal component is shown by a red line and the second principal component is shown by a green line. The right figure displays the same scatter plot rotated so that the first principal component is horizontal and the second principal component is shown perpendicular to this."} # ![Figure 1: Biodiversity index and percentage area fallow PCA](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Bio index vs percentage fallow.png) knitr::include_graphics("../fig/bio_index_vs_percentage_fallow.png") ``` @@ -195,7 +195,7 @@ equilibrium. We then use the length of the springs from the rod as the first principal component. This is explained in more detail on [this Q&A website](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues). -```{r pendulum, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} +```{r pendulum, echo=FALSE, fig.cap="Animation showing the iterative process by which principal components are found. Data points are shown in blue, distances from the line as red dashed lines and the principal component is shown by a black line.", fig.alt="Animated scatter plot displaying the iterative process by which principal components are found. The data points are shown in blue and the principal component is shown by a solid black line. The distances of the points from the line are shown by red dashed lines. The animation initially starts with the principal component far away from the direction of variability in the points and, as time goes on, eventually finds the direction of variability exhibiting a springing motion."} knitr::include_graphics("../fig/pendulum.gif") ``` @@ -230,14 +230,14 @@ Columns include: outer walls of prostate), - `gleason` (Gleason score; grade of cancer cells), - `pgg45` (percentage Gleason scores 4 or 5), -- `lpsa` (log-tranformed prostate specific antigen; level of PSA in blood). +- `lpsa` (log-transformed prostate specific antigen; level of PSA in blood). - `age` (patient age in years). Here we will calculate principal component scores for each of the rows in this dataset, using five principal components (one for each variable included in the PCA). We will include five clinical variables in our PCA, each of the continuous variables in the prostate dataset, so that we can create fewer variables -representing clinical markers of cancer progression. Standard PCAs are carried +representing clinical markers of cancer progression. Standard PCA is carried out using continuous variables only. First, we will examine the `prostate` dataset (originally part of the @@ -265,7 +265,7 @@ head(pros2) Now we compare the variances between variables in the dataset. -```{r var-hist, fig.cap="Caption", fig.cap="Alt"} +```{r var-hist, fig.cap="Histograms of two variables from the prostate data set.", fig.alt="Side-by-side histograms of two variables from the dataset, lweight on the left and lbph on the right. The histogram for the lweight data ranges from around 2 to 6 on the x axis, while the histogram for the lbph ranges from around -2 to 3 on the x axis."} apply(pros2, 2, var) par(mfrow = c(1, 2)) hist(pros2$lweight, breaks = "FD") @@ -364,6 +364,13 @@ Let's look at the relative importance of (percentage of the variance explained b ```{r summ} prop.var <- round(pca.pros$variance, 4) #round to 4 decimal places prop.var +summary(pca.pros) +``` + +```{r, echo = FALSE} +# Get proportions of variance explained by each principal component (rounded to 2 decimal places) +prop.var <- round(summary(pca.pros)$importance["Proportion of Variance", ], 2) * + 100 ``` This returns the proportion of variance in the data explained by each of the @@ -389,10 +396,30 @@ screeplot(pca.pros, axisLabSize = 5, titleLabSize = 8, #add line to scree plot to visualise the elbow ``` +```{r vardf-plot, fig.cap="Scree plot showing the percentage of the variance explained by the principal components calculated from the prostate data.", fig.alt="A scree plot showing the percentage of variance explained by each principal component versus the principal component number. The points are joined by lines to indicate where the elbow of the scree plot occurs."} +plot(varDF, type="b") +``` + The scree plot shows that the first principal component explains most of the -variance in the data (>50%), the first two principal components -explain >70% of variance in the data, and and each subsequent principal component explains -less and less of the total variance. +variance in the data (>50%) and each subsequent principal component explains +less and less of the total variance. The first two principal components +explain >70% of variance in the data. + +Selecting a subset of principal components results in loss of +information from the original dataset but selecting principal components to +summarise the _majority_ of the information in the original dataset is useful +from a practical dimension reduction perspective. There are no clear guidelines +on how many principal components should be included in PCA. +We often look at the 'elbow’ on the scree plot as an +indicator that the addition of principal components does not drastically +contribute to explaining the remaining variance. In this case, the 'elbow' of the scree plot appears +to be at the third principal component. We may therefore conclude that three principal +components are sufficient to explain the majority of the variability in the original +dataset. We may also choose another criterion, such as an arbitrary cut-off for +the proportion of the variance explained compared to other principal components. +Essentially, the criterion used to select principal components should be determined +based on what is deemed a sufficient level of information retention for a specific +dataset and question. ## Loadings and principal component scores @@ -403,7 +430,7 @@ has as many rows as there were observations in the input matrix. These scores are what is usually visualised or used for down-stream analyses. The matrix of loadings (also called rotation matrix) has as many rows as there are features in the original data. It contains information about how the -(usually centered and scaled) original data relate to the PC scores. +(usually centered and scaled) original data relate to the principal component scores. We can print the loadings for the **`PCAtools`** implementation using @@ -419,14 +446,14 @@ callout demonstrates this. > ## Computing a PCA "by hand" > The rotation matrix obtained in a PCA is identical to the eigenvectors > of the covariance matrix of the data. Multiplying these with the (centered and scaled) -> data yields the PC scores: -> ```{r pca-by-hand} -> pros2.scaled <- scale(pros2) # centre and scale the Prostate data -> pros2.cov <- cov(pros2.scaled) #generate covariance matrix +> data yields the principal component scores: +> ```{r pca-by-hand, fig.cap="Scatter plots of the second versus first principal components calculated by prcomp() (left) and by hand (right).", fig.alt="Side-by-side scatter plots of the second versus first principal components calculated by prcomp() (left) and by hand (right). The left scatter plot is the same as the right scatter plot but with swapped axes."} +> pros2.scaled <- scale(pros2) # centre and scale the prostate data +> pros2.cov <- cov(pros2.scaled) # generate covariance matrix > pros2.cov > pros2.eigen <- eigen(pros2.cov) # preform eigen decomposition > pros2.eigen # The slot $vectors = rotation of the PCA -> # generate PC scores by by hand, using matrix multiplication +> # generate principal component scores by by hand, using matrix multiplication > my.pros2.pcs <- pros2.scaled %*% pros2.eigen$vectors > # compare results > par(mfrow=c(1,2)) @@ -453,8 +480,9 @@ implementations in different R libraries. It is thus a good idea to specify the desired package when calling `biplot()`. A biplot of the first two principal components can be generated as follows: -```{r stats-biplot, fig.cap="Caption", fig.cap="Alt"} -PCAtools::biplot(pca.pros, lab=NULL, singlecol=TRUE) #plot biplot using a single colour for the points + +```{r stats-biplot, fig.cap="Scatter plot of the first two principal components with observations numerically labelled. The loadings are shown by red arrows.", fig.alt="Scatter plot of the second principal component versus the first principal component. Observations as points, numerically labelled. The loadings are shown by red arrows, each labelled by their associated variable."} +stats::biplot(pca.pros, xlim = c(-0.3, 0.3)) ``` This biplot shows the position of each patient on a 2-dimensional plot where @@ -469,11 +497,9 @@ on the top and right of the plot are used to interpret the loadings, where loadings are scaled by the standard deviation of the principal components (`pca.pros$sdev`) times the square root the number of observations. -Finally, you need to know that PC scores and rotations may have different slot names, +Finally, you need to know that principal component scores and rotations may have different slot names, depending on the PCA implementation you use. Here are some examples: - - # Using PCA to analyse gene expression data @@ -500,13 +526,13 @@ metadata <- colData(cancer) ``` ```{r mat, eval=FALSE} -View(mat) +head(mat) #nrow=22215 probes #ncol=91 samples ``` ```{r meta, eval=FALSE} -View(metadata) +head(metadata) #nrow=91 #ncol=8 ``` @@ -527,12 +553,11 @@ was involved in their cancer and the grade and size of the cancer for each sample (represented by rows). Microarray data are difficult to analyse for several reasons. Firstly, -they are typically high-dimensional and therefore are subject to the same -difficulties associated with analysing high dimensional data outlined above -(i.e. *p*>*n*, large numbers of rows, multiple possible response variables, -curse of dimensionality). Secondly, formulating a research question using -microarray data can be difficult, especially if not much is known a priori -about which genes code for particular phenotypes of interest. Finally, +formulating a research question using microarray data can be difficult, +especially if not much is known a priori +about which genes code for particular phenotypes of interest. +Secondly, they are typically high-dimensional and therefore are subject to the same +difficulties associated with analysing high-dimensional data. Finally, exploratory analysis, which can be used to help formulate research questions and display relationships, is difficult using microarray data due to the number of potentially interesting response variables (i.e. expression data from probes @@ -560,7 +585,7 @@ represents. > > Let us assume we only care about the principal components accounting for the top > 80% of the variance in the dataset. Use the `removeVar` argument in `pca()` to remove -> the PCs accounting for the bottom 20%. +> the principal components accounting for the bottom 20%. > > As in the example using prostate data above, examine the first 5 rows and > columns of rotated data and loadings from your PCA. @@ -568,21 +593,25 @@ represents. > > ## Solution > > > > ```{r pca-ex} -> > pc <- pca(mat, metadata = metadata) -> > #Many PCs explain a very small amount of the total variance in the data -> > #Remove the lower 20% of PCs with lower variance +> > pc <- pca(mat, metadata = metadata) # implement a PCA +> > #We can check the scree plot to see that many principal components explain a very small amount of +> > the total variance in the data +> > +> > #Let's remove the principal components with lowest 20% of the variance > > pc <- pca(mat, metadata = metadata, removeVar = 0.2) > > #Explore other arguments provided in pca -> > pc$rotated[1:5, 1:5] -> > pc$loadings[1:5, 1:5] +> > pc$rotated[1:5, 1:5] # obtain the first 5 principal component scores for the first 5 observations +> > pc$loadings[1:5, 1:5] #obtain the first 5 principal component loadings for the first 5 features > > -> > which.max(pc$loadings[, 1]) -> > pc$loadings[49, ] +> > which.max(pc$loadings[, 1]) # index of the maximal loading for the first principal component +> > pc$loadings[which.max(pc$loadings[, 1]), ] # principal component loadings for the feature +> > with the maximal loading on the first principal component. > > -> > which.max(pc$loadings[, 2]) -> > pc$loadings[27, ] +> > which.max(pc$loadings[, 2]) # index of the maximal loading for the second principal component +> > pc$loadings[which.max(pc$loadings[, 2]), ] # principal component loadings for the feature +> > with the maximal loading on the second principal component. > > ``` -> > The function `pca()` is used to perform PCA, and uses as inputs a matrix +> > The function `pca()` is used to perform PCA, and uses as input a matrix > > (`mat`) containing continuous numerical data > > in which rows are data variables and columns are samples, and `metadata` > > associated with the matrix in which rows represent samples and columns @@ -609,29 +638,23 @@ represents. > > Note that this is different from normalising gene expression data. Gene expression > data have to be normalised before donwstream analyses can be -> carried out. This is to reduce to effect technical and other potentially confounding -> factors. We assume that the expression data we use had been noralised previously. +> carried out. This is to reduce technical and other potentially confounding +> factors. We assume that the expression data we use had been normalised previously. {: .callout} ## Choosing how many components are important to explain the variance in the data -As in the example using the `prostate` dataset we can use a screeplot to +As in the example using the `prostate` dataset, we can use a scree plot to compare the proportion of variance in the data explained by each principal component. This allows us to understand how much information in the microarray dataset is lost by projecting the observations onto the first few principal components and whether these principal components represent a reasonable amount of the variation. The proportion of variance explained should sum to one. -There are no clear guidelines on how many principal components should be -included in PCA: your choice depends on the total variability of the data and -the size of the dataset. We often look at the 'elbow’ on the screeplot as an -indicator that the addition of principal components does not drastically -contribute to explain the remaining variance or choose an arbitory cut off for -proportion of variance explained. > ## Challenge 4 > -> Using the `screeplot()` function in **`PCAtools`**, create a screeplot to show +> Using the `screeplot()` function in **`PCAtools`**, create a scree plot to show > proportion of variance explained by each principal component. Explain the > output of the screeplot in terms of proportion of the variance in the data explained > by each principal component. @@ -675,13 +698,14 @@ are two functions called `biplot()`, one in the package **`PCAtools`** and one i > ## Challenge 5 > > Create a biplot of the first two principal components from your PCA -> using `biplot()` function in **`PCAtools`**. +> using `biplot()` function in **`PCAtools`**. See `help("PCAtools::biplot")` for +> arguments and their meaning. For instance, `lab` or `colby` may be useful. > > Examine whether the data appear to form clusters. Explain your results. > > > ## Solution > > -> > ```{r biplot-ex, fig.cap="Caption", fig.cap="Alt"} +> > ```{r biplot-ex} > > biplot(pc, lab = NULL, colby = 'Grade', legendPosition = 'top') > > ``` > > The biplot shows the position of patient samples relative to PC1 and PC2 @@ -709,7 +733,7 @@ Let's consider this biplot in more detail, and also display the loadings: > > > ## Solution > > -> > ```{r pca-biplot-ex2, fig.cap="Caption", fig.cap="Alt"} +> > ```{r pca-biplot-ex2} > > PCAtools::biplot(pc, > > lab = paste0(pc$metadata$Age,'years'), > > colby = 'ER', @@ -724,7 +748,7 @@ So far, we have only looked at a biplot of PC1 versus PC2 which only gives part of the picture. The `pairplots()` function in **`PCAtools`** can be used to create multiple biplots including different principal components. -```{r pairsplot, fig.cap="Caption", fig.cap="Alt"} +```{r pairsplot, fig.cap="Multiple biplots produced by pairsplot().", fig.alt="An upper triangular grid of scatter plots of each principal component versus the others."} pairsplot(pc) ``` @@ -738,22 +762,21 @@ you to set custom colours. Finally, it can sometimes be of interest to compare how certain variables contribute to different principal components. This can be visualised with `plotloadings()` from the **`PCAtools`** package. The function checks the range of loadings for each -principal component specified (default: first five PCs). It then selects the features +principal component specified (default: first five principal components). It then selects the features in the top and bottom 5% of these ranges and displays their loadings. This behaviour can be adjusted with the `rangeRetain` argument, which has 0.1 as the default value (i.e. 5% on each end of the range). NB, if there are too many labels to be plotted, you will see a warning. This is not a serious problem. -```{r loadingsplots} +```{r loadingsplots, fig.cap=c("Plot of the features in the top and bottom 5 % of the loadings range for the first principal component.", "Plot of the features in the top and bottom 5 % of the loadings range for the second principal component.", fig.cap="Plot of the features in the top and bottom 5 % of the loadings range for the either the first or second principal components."), fig.alt=c("A plot with component loading versus principal component index. The plot shows the loadings of features at the top and bottom 5 % of the loadings range and only for the first principal component. The component loading values for four features are shown. The features with the highest loadings are shown at the top of the plot and consist of three features, each with a blue dot and feature label. The feature with the lowest loading is shown at the bottom of the plot and delineated by a yellow dot with a feature label.", "A plot with component loadings versus principal component index. The plot shows the loadings of features at the top and bottom 5 % of the loadings range and only for the second principal component. The component loading values for 9 features are shown. The features with the highest loadings are shown at the top of the plot and consist of 6 features, each with a blue dot and feature label. The 3 features at the lower range are at the bottom of the plot and delineated by a yellow dot with a feature label.", fig.alt="A plot with component loading versus principal component index. The plot shows the loadings of features at the top and bottom 5 % of the loadings range for either principal component. The component loading values for several features are shown with loading score-delineated colours ranging from dark blue for the highest loadings and dark yellow for the lowest loadings. Each point has a feature label. The loadings of features points for the first principal component are concentrated towards the top, while the loadings of features points for the second principal components are concentrated at the top, middle and bottom of the range.")} plotloadings(pc, c("PC1"), rangeRetain = 0.1) plotloadings(pc, c("PC2"), rangeRetain = 0.1) plotloadings(pc, c("PC1", "PC2"), rangeRetain = 0.1) ``` -You can see how the third code line prooces more dots, some of which do not have -extreme loadings. This is because all loadings selected for any PC are shown for all -other PCs. For instance, it is plausible that features which have high loadings on -PC1 may have lower ones on PC2. +You can see how the third code line produces more dots, some of which do not have +extreme loadings. This is because all loadings selected for any principal component +are shown for all other principal components. For instance, it is plausible that features which have high loadings on PC1 may have lower ones on PC2. # Using PCA output in further analysis @@ -775,7 +798,7 @@ they can be included together in a single linear regression. > and it allows researchers to examine the effect of several correlated > explanatory variables on a single response variable in cases where a high > degree of correlation initially prevents them from being included in the same -> model. This is called principal componenet regression (PCR) and is just one +> model. Principal component regression (PCR) is just one > example of how principal components can be used in further analysis of data. > When carrying out PCR, the variable of interest (response/dependent variable) > is regressed against the principal components calculated using PCA, rather diff --git a/_episodes_rmd/05-factor-analysis.Rmd b/_episodes_rmd/05-factor-analysis.Rmd index 71e8c431..261d925f 100644 --- a/_episodes_rmd/05-factor-analysis.Rmd +++ b/_episodes_rmd/05-factor-analysis.Rmd @@ -2,7 +2,7 @@ title: "Factor analysis" author: "GS Robertson" source: Rmd -teaching: 25 +teaching: 30 exercises: 10 questions: - What is factor analysis and when can it be used? @@ -31,16 +31,22 @@ knitr_fig_path("06-") Biologists often encounter high-dimensional datasets from which they wish to extract underlying features – they need to carry out dimensionality -reduction. The last episode dealt with one method to achieve this this, -called principal component analysis (PCA). Here, we introduce more general -set of methods called factor analysis (FA). - +reduction. The last episode dealt with one method to achieve this, +called principal component analysis (PCA), which expressed new dimension-reduced components +as linear combinations of the original features in the dataset. Principal components can therefore +be difficult to interpret. Here, we introduce a related but more interpretable +method called factor analysis (FA), which constructs new components, called _factors_, +that explicitly represent underlying _(latent)_ constructs in our data. Like PCA, FA uses +linear combinations, but uses latent constructs instead. FA is therefore often +more interpretable and useful when we would like to extract meaning from our dimension-reduced +set of variables. + There are two types of FA, called exploratory and confirmatory factor analysis (EFA and CFA). Both EFA and CFA aim to reproduce the observed relationships among a group of features with a smaller set of latent variables. EFA -is used in a descriptive, data-driven manner to uncover which +is used in a descriptive (exploratory) manner to uncover which measured variables are reasonable indicators of the various latent dimensions. -In contrast, CFA is conducted in an a-priori, +In contrast, CFA is conducted in an _a priori_, hypothesis-testing manner that requires strong empirical or theoretical foundations. We will mainly focus on EFA here, which is used to group features into a specified number of latent factors. @@ -51,7 +57,7 @@ exploratory data analysis methods (including PCA) to provide an initial estimate of how many factors adequately explain the variation observed in a dataset. In practice, a range of different values is usually tested. -## An example +## Motivating example: student scores One scenario for using FA would be whether student scores in different subjects can be summarised by certain subject categories. Take a look at the hypothetical @@ -60,7 +66,7 @@ can be summarised well by two factors, which we can then interpret. We have labelled these hypothetical factors “mathematical ability” and “writing ability”. -```{r table, echo = FALSE} +```{r table, echo = FALSE, fig.cap="Student scores data across several subjects with hypothesised factors.", fig.alt="A table displaying data of student scores across several subjects. Each row displays the scores across different subjects for a given individual. The plot is annotated at the top with a curly bracket labelled Factor 1: mathematical ability and encompasses the data for the student scores is Arithmetic, Algebra, Geometry, and Statistics. Similarly, the subjects Creative Writing, Literature, Spelling/Grammar are encompassed by a different curly bracket with label Factor 2: writing ability."} knitr::include_graphics("../fig/table_for_fa.png") # ![Figure 1: Student exam scores per subject. Subjects can be split into two factors representing mathematical ability and writing ability](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Table for FA.png) ``` @@ -71,31 +77,6 @@ as many principal components as there are features in the dataset, each component representing a different linear combination of features. The principal components are ordered by the amount of variance they account for. -# Advantages and disadvantages of Factor Analysis - -There are several advantages and disadvantages of using FA as a -dimensionality reduction method. - -Advantages: - -* FA is a useful way of combining different groups of data into known - representative factors, thus reducing dimensionality in a dataset. -* FA can take into account researchers' expert knowledge when choosing - the number of factors to use, and can be used to identify latent or hidden - variables which may not be apparent from using other analysis methods. -* It is easy to implement with many software tools available to carry out FA. -* Confirmatory FA can be used to test hypotheses. - -Disadvantages: - -* Justifying the choice of - number of factors to use may be difficult if little is known about the - structure of the data before analysis is carried out. -* Sometimes, it can be difficult to interpret what factors mean after - analysis has been completed. -* Like PCA, standard methods of carrying out FA assume that input variables - are continuous, although extensions to FA allow ordinal and binary - variables to be included (after transforming the input matrix). # Prostate cancer patient data @@ -103,6 +84,9 @@ The prostate dataset represents data from 97 men who have prostate cancer. The data come from a study which examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. The data have 97 rows and 9 columns. +Although not strictly a high-dimensional dataset, as with other episodes, +we use this dataset to explore the method. + Columns are: @@ -129,12 +113,10 @@ Let's subset the data to just include the log-transformed clinical variables for the purposes of this episode: ```{r prostate} +library("here") prostate <- readRDS(here("data/prostate.rds")) ``` -```{r view, eval=FALSE} -View(prostate) -``` ```{r dims} nrow(prostate) @@ -193,18 +175,23 @@ factors, while negative values show a negative relationship between variables and factors. Loading values are missing for some variables because R does not print loadings less than 0.1. +# How many factors do we need? + There are numerous ways to select the “best” number of factors. One is to use the minimum number of features that does not leave a significant amount of -variance unaccounted for. In practise, we repeat the factor -analysis using different values in the `factors` argument. If we have an -idea of how many factors there will be before analysis, we can start with -that number. The final section of the analysis output shows the results of +variance unaccounted for. In practice, we repeat the factor +analysis for different numbers of factors (by specifying different values +in the `factors` argument). If we have an idea of how many factors there +will be before analysis, we can start with that number. The final +section of the analysis output then shows the results of a hypothesis test in which the null hypothesis is that the number of factors used in the model is sufficient to capture most of the variation in the -dataset. If the p-value is less than 0.05, we reject the null hypothesis -and accept that the number of factors included is too small. If the p-value -is greater than 0.05, we accept the null hypothesis that the number of -factors used captures variation in the data. +dataset. If the p-value is less than our significance level (for example 0.05), +we reject the null hypothesis that the number of factors is sufficient and we repeat the analysis with +more factors. When the p-value is greater than our significance level, we do not reject +the null hypothesis that the number of factors used captures variation +in the data. We may therefore conclude that +this number of factors is sufficient. Like PCA, the fewer factors that can explain most of the variation in the dataset, the better. It is easier to explore and interpret results using a @@ -219,7 +206,7 @@ for by the FA model. *Uniqueness* is the opposite of communality and represents the amount of variation in a variable that is not accounted for by the FA model. Uniqueness is calculated by subtracting the communality value from 1. If uniqueness is high for -a given variable, that means this variable is not well explaind/accounted for +a given variable, that means this variable is not well explained/accounted for by the factors identified. ```{r common-unique} @@ -232,7 +219,7 @@ Similar to a biplot as we produced in the PCA episode, we can “plot the loadings”. This shows how each original variable contributes to each of the factors we chose to visualise. -```{r biplot} +```{r biplot, fig.cap = "Factor 2 loadings versus factor 1 loadings for each feature.", fig.alt="A scatter plot of the factor 2 loadings for each feature versus the factor 2 loadings for each feature. The lpsa, lcavol and lcp feature points are located in the east of the plot, indicating a high loading on factor 1 and close to zero loading on factor 2. The lbph and lweight features are located in the north of the plot, indicating a close to zero loading on factor 1 and a high loading on factor 2."} #First, carry out factor analysis using two factors pros_fa <- factanal(pros2, factors = 2) @@ -264,7 +251,7 @@ text( > the results of your analysis. > > What variables are most important in explaining each factor? Do you think -> this makes sense biologically? Discuss in groups. +> this makes sense biologically? Consider or discuss in groups. > > > ## Solution > > @@ -282,6 +269,31 @@ text( > {: .solution} {: .challenge} +# Advantages and disadvantages of Factor Analysis + +There are several advantages and disadvantages of using FA as a +dimensionality reduction method. + +Advantages: + +* FA is a useful way of combining different groups of data into known + representative factors, thus reducing dimensionality in a dataset. +* FA can take into account researchers' expert knowledge when choosing + the number of factors to use, and can be used to identify latent or hidden + variables which may not be apparent from using other analysis methods. +* It is easy to implement with many software tools available to carry out FA. +* Confirmatory FA can be used to test hypotheses. + +Disadvantages: + +* Justifying the choice of + number of factors to use may be difficult if little is known about the + structure of the data before analysis is carried out. +* Sometimes, it can be difficult to interpret what factors mean after + analysis has been completed. +* Like PCA, standard methods of carrying out FA assume that input variables + are continuous, although extensions to FA allow ordinal and binary + variables to be included (after transforming the input matrix). diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 0d34c971..6f442946 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -1,8 +1,8 @@ --- title: "K-means" source: Rmd -teaching: 45 -exercises: 15 +teaching: 60 +exercises: 20 questions: - How do we detect real clusters in high-dimensional data? - How does K-means work and when should it be used? @@ -77,7 +77,7 @@ according to data values. We then plot these data and their allocated clusters and put ellipses around the clusters using the `stat_ellipse` function in `ggplot`. -```{r fake-cluster, echo = FALSE} +```{r fake-cluster, echo = FALSE, fig.cap = "Example of artificial clusters fitted to data points.", fig.alt="A scatter plot of random data y versus x. The points are horizontally partitioned at 2 random groups, forming three colour coded clusters. Circles are drawn around each cluster. The data shown appears to have no clusters but the colours and circles give the appearance of clusters artificially."} set.seed(11) library("MASS") library("ggplot2") @@ -124,7 +124,7 @@ We then follow these two steps until convergence: We can see this process in action in this animation: -```{r kmeans-animation, echo = FALSE, fig.cap="Cap", fig.alt="Alt"} +```{r kmeans-animation, echo = FALSE, fig.cap="Animation showing the iterative process of K-means clustering on data y versus x.", fig.alt="An animated scatter plot of data y versus x. The animation starts by identifying 3 initial points, delineated by different colours. The animations then colour codes all points by an associated cluster colour, delineating three distinct and non-overlapping clusters in the space of the scatter plot."} knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and @@ -140,8 +140,8 @@ number of clusters that the data should be partitioned into. > configuration can have a significant effect on the final configuration of the > clusters, so dealing with this limitation is an important part > of K-means clustering. Some strategies to deal with this problem are: -> - Choose $K$ points at random from the data as the cluster centroids. -> - Randomly split the data into $K$ groups, and then average these groups. +> - Choose $k$ points at random from the data as the cluster centroids. +> - Randomly split the data into $k$ groups, and then average these groups. > - Use the K-means++ algorithm to choose initial values. > > These each have advantages and disadvantages. In general, it's good to be @@ -151,10 +151,10 @@ number of clusters that the data should be partitioned into. > {: .callout} -# K-means clustering applied to single-cell RNAseq data +# K-means clustering applied to the single-cell RNA sequencing data Let's carry out K-means clustering in `R` using some real high-dimensional data. -We're going to work with single-cell RNAseq data in these clustering challenges, +We're going to work with single-cell RNA sequencing data, "scRNAseq", in these clustering challenges, which is often *very* high-dimensional. Commonly, experiments profile the expression level of 10,000+ genes in thousands of cells. Even after filtering the data to remove low quality observations, the dataset we're using in this @@ -165,11 +165,12 @@ earlier in the course - dimensionality reduction. Dimensionality reduction allows us to visualise this incredibly complex data in a small number of dimensions. In this case, we'll be using principal component analysis (PCA) to compress the data by identifying the major axes of variation in the data, -before running our clustering algorithms on this lower-dimensional data. +before running our clustering algorithms on this lower-dimensional data to explore +the similarity of features. The `scater` package has some easy-to-use tools to calculate a PCA for `SummarizedExperiment` objects. -Let's load the `scRNAseq` data and calculate some principal components. +Let's load the scRNAseq data and calculate some principal components. ```{r data} library("SingleCellExperiment") @@ -185,9 +186,9 @@ we can visualise those easily, and they're a quantitative representation of the underlying data, representing the two largest axes of variation. We can now run K-means clustering on the first and second principal components -of the `scRNAseq` data using the `kmeans` function. +of the scRNAseq data using the `kmeans` function. -```{r kmeans, fig.cap = "Title", fig.alt = "Alt"} +```{r kmeans, fig.cap = "Scatter plot of principal component 2 versus principal component 1 with points colour coded according to the cluster to which they belong.", fig.alt = "A scatter plot of principal component 2 versus principal component 1 of the scrnaseq data. Each point is one of four colours, representing cluster membership. Points of the same colour appear in the same areas of the plot, showing four distinct clusters in the data."} set.seed(42) cluster <- kmeans(pcs, centers = 4) scrnaseq$kmeans <- as.character(cluster$cluster) @@ -200,7 +201,7 @@ here? > ## Challenge 1 > -> Cluster the data using a $K$ of 5, and plot it using `plotReducedDim`. +> Perform clustering to group the data into $k=5$ clusters, and plot it using `plotReducedDim`. > Save this with a variable name that's different to what we just used, > because we'll use this again later. > @@ -229,7 +230,7 @@ here? > The following example shows how cluster centroids differ when created using > medians rather than means. > -> ```{r} +> ```{r, fig.cap="Scatter plot of random data y versus x with the (mean(x), mean(y)) co-ordinate point shown in red and the (median(x), median(y)) co-ordinate point shown in blue.", fig.alt="Scatter plot of random data y versus x. There are many black points on the plot representing the data. Two additional points are shown: the (mean(x), mean(y)) co-ordinate point in red and the (median(x), median(y)) co-ordinate point in blue. The median co-ordinate point in blue has a lower x value and is shown to the left of the red mean co-ordinate point."} > x <- rnorm(20) > y <- rnorm(20) > x[10] <- x[10] + 10 @@ -244,9 +245,10 @@ here? # Cluster separation When performing clustering, it is important for us to be able to measure -how well our clusters are separated. One measure to test this is silhouette width. -This is a number that is computed for every observation. It can range from -1 to 1. -A high silhouette width means an observation is closer to other observations +how well our clusters are separated. One measure to test this is silhouette width, +which measures how similar a data point is to points in the same cluster compared +to other clusters. The silhouette width is computed for every observation and can +range from -1 to 1. A high silhouette width means an observation is closer to other observations within the same cluster. For each cluster, the silhouette widths can then be averaged or an overall average can be taken. @@ -272,7 +274,7 @@ Here we use the `silhouette` function from the `cluster` package to calculate th silhouette width of our K-means clustering using a distance matrix of distances between points in the clusters. -```{r silhouette} +```{r silhouette, fig.cap="Silhouette plot for each point according to cluster.", fig.alt="Plot with horizontal axis silhoutte width. The plot shows the silhouette width for each point in the data set according to cluster. Cluster 4 contains over half of the points in the data set and largely consists of points with a large silhouette list, leading to a bar that extends to the right side of the graph. The other clusters contain many fewer points and have slightly lower silhouette widths. The bars therefore reach further to the left than cluster 4."} library("cluster") dist_mat <- dist(pcs) sil <- silhouette(cluster$cluster, dist = dist_mat) @@ -284,7 +286,7 @@ Let's plot the silhouette score on the original dimensions used to cluster the data. Here, we're mapping cluster membership to point shape, and silhouette width to colour. -```{r plot-silhouette} +```{r plot-silhouette, fig.cap="Scatter plot of y versus x coloured according to silhouette width and point characters grouped according to cluster membership.", fig.alt="A scatter plot of the random y versus x data. Cluster membership is delineated using different point characters. Data points in the same cluster have the same point character. Each point is coloured by its silhouette width: solid red delineating a silhouette width of 1 and white delineating a silhouette width of 0. Colours in between delineate the intermediate colours. Many points are red and fade to white at the boundaries of each cluster. "} pc <- as.data.frame(pcs) colnames(pc) <- c("x", "y") pc$sil <- sil[, "sil_width"] @@ -310,8 +312,8 @@ belong to. > ## Challenge 2 > -> Calculate the silhouette width for the K of 5 clustering we did earlier. -> Is it better or worse than before? +> Calculate the silhouette width for the k of 5 clustering we did earlier. +> Are 5 clusters appropriate? Why/why not? > > Can you identify where the differences lie? > @@ -339,7 +341,7 @@ belong to. > > ``` > > This seems to be because some observations in clusters 3 and 5 seem to be > > more similar to other clusters than the one they have been assigned to. -> > This may indicate that K is too high. +> > This may indicate that k is too high. > {: .solution} {: .challenge} @@ -366,17 +368,19 @@ belong to. {: .callout} -# Cluster robustness +# Cluster robustness: bootstrapping When we cluster data, we want to be sure that the clusters we identify are -not a result of the exact properties of the input data. That is, if the -data we observed were slightly different, the clusters we would identify -in this different data would be very similar. This makes it more -likely that these can be reproduced. - -To assess this, we can use the *bootstrap*. What we do here is to take a sample -from the data with replacement. Sampling with replacement means that in the -sample that we take, we can include points from the input data more than once. +not a result of the exact properties of the input data. That is, we want +to ensure that the clusters identified do not change substantially +if the observed data change slightly. This makes it more +likely that the clusters can be reproduced. + +To assess this, we can *bootstrap*. We first +resample the data with replacement to +reproduce a 'new' data set. We can then calculate new clusters for this +data set and compare these to those on the original data set, +thus helping us to see how the clusters may change for small changes in the data. This is maybe easier to see with an example. First, we define some data: ```{r bs-data} @@ -405,33 +409,32 @@ replicate(10, sample(data, 5, replace = TRUE)) ``` -> ## Bootstrapping -> -> The bootstrap is a powerful and common statistical technique. -> -> We would like to know about the sampling distribution of a statistic, -> but we don't have any knowledge of its behaviour under the null hypothesis. -> -> For example, we might want to understand the uncertainty around an estimate -> of the mean of our data. To do this, we could resample the data with -> replacement and calculate the mean of each average. -> -> ```{r boots} -> boots <- replicate(1000, mean(sample(data, 5, replace = TRUE))) -> hist(boots, -> breaks = "FD", -> main = "1,000 bootstrap samples", -> xlab = "Mean of sample" -> ) -> ``` -> -> In this case, the example is simple, but it's possible to -> devise more complex statistical tests using this kind of approach. -> -> The bootstrap, along with permutation testing, can be a very flexible and -> general solution to many statistical problems. -> -{: .callout} +## Bootstrapping + +Bootstrapping is a powerful and common statistical technique. + +We would like to know about the sampling distribution of a statistic, +but we don't have any knowledge of its behaviour under the null hypothesis. + +For example, we might want to understand the uncertainty around an estimate +of the mean of our data. To do this, we could resample the data with +replacement and calculate the mean of each average. + +```{r boots, fig.cap="Histogram of mean of bootstapped samples.", fig.alt="A histogram of the mean of each bootstrapped sample. The histogram appears roughly symmetric around 2.8 on the x axis."} +boots <- replicate(1000, mean(sample(data, 5, replace = TRUE))) +hist(boots, + breaks = "FD", + main = "1,000 bootstrap samples", + xlab = "Mean of sample" + ) +``` + +In this case, the example is simple, but it's possible to +devise more complex statistical tests using this kind of approach. + +The bootstrap, along with permutation testing, can be a very flexible and +general solution to many statistical problems. + In applying the bootstrap to clustering, we want to see two things: 1. Will observations within a cluster consistently cluster together in @@ -445,7 +448,7 @@ boostrap replicate. Similarly, the off-diagonal elements represent how often observations swap between clusters in bootstrap replicates. High scores indicate that observations rarely swap between clusters. -```{r bs-heatmap} +```{r bs-heatmap, fig.cap="Grid of empirical cluster swapping behaviour estimated by the bootstrap samples.", fig.alt="Grid of 16 squares labelled 1-4 on each of the x and y axes. The diagonal and off-diagonal squares of the grid are coloured in green, indicating the highest scoring value of 1 according to the legend. The lower triangular squares are coloured in grey, indicating NA values since these would be the same as the upper triangular squares."} library("pheatmap") library("bluster") library("viridis") @@ -466,7 +469,7 @@ These are “good” (despite missing in the colour bar). > ## Challenge 3 > -> Repeat the bootstrapping process with K=5. Are the results better or worse? +> Repeat the bootstrapping process with k=5. Do the results appear better or worse? > Can you identify where the differences occur on the `plotReducedDim`? > > > ## Solution @@ -483,7 +486,7 @@ These are “good” (despite missing in the colour bar). > > breaks = seq(0, 1, length.out = 10) > > ) > > ``` -> > When K=5, we can see that the values on the diagonal of the matrix are +> > When k=5, we can see that the values on the diagonal of the matrix are > > smaller, indicating that the clusters aren't exactly reproducible in the > > bootstrap samples. > > @@ -498,7 +501,7 @@ These are “good” (despite missing in the colour bar). > ## Consensus clustering > > One useful and generic method of clustering is *consensus clustering*. -> This method can use k-means, or other clustering methods. +> This method can use k-means or other clustering methods. > > The idea behind this is to bootstrap the data repeatedly, and cluster > it each time, perhaps using different numbers of clusters. @@ -507,7 +510,7 @@ These are “good” (despite missing in the colour bar). > > This is really computationally demanding but has been shown to perform very > well in some situations. It also allows you to visualise how cluster -> membership changes over different values of K. +> membership changes over different values of k. > {: .callout} @@ -533,7 +536,7 @@ These are “good” (despite missing in the colour bar). > > Similarly, approximate nearest neighbour methods like > [Annoy](https://pypi.org/project/annoy/) can be used to identify what the -> $K$ closest points are in the data, and this can be used in some clustering +> $k$ closest points are in the data, and this can be used in some clustering > methods (for example, graph-based clustering). > > Generally, these methods sacrifice a bit of accuracy for a big gain in speed. diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 1c39c57f..3a97cfc5 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -1,8 +1,8 @@ --- title: "Hierarchical clustering" source: Rmd -teaching: 60 -exercises: 10 +teaching: 70 +exercises: 20 questions: - What is hierarchical clustering and how does it differ from other clustering methods? - How do we carry out hierarchical clustering in R? diff --git a/_extras/guide.md b/_extras/guide.md index 393791ee..5ea51dfb 100644 --- a/_extras/guide.md +++ b/_extras/guide.md @@ -1,6 +1,26 @@ --- title: "Instructor Notes" --- -Coming soon. + +Thank you for teaching high dimensional statistics with R! We hope you enjoy teaching the lesson. This page contains additional information for instructors. + +The materials for each episode are self-contained and can be found through the episode links on the home page. The slides in the Extras section can be used to supplement these materials. + +In previous rounds of teaching, the lesson was taught in four sessions, each lasting 2 hours and 50 minutes. We also advise allowing around 40 minutes of additional time for breaks. The recommended timings for each session are as follows: + +Session 1: +- Introduction to high-dimensional data (episode 1): 30 minutes of teaching time, 20 minutes for exercises (total: 50 minutes). +- Regression with many outcomes (episode 2): 70 minutes of teaching time, 50 minutes for exercises (total: 120 minutes). + +Session 2: +- Regularised regression (episode 3): 110 minutes of teaching time, 60 minutes for exercises (total: 170 minutes). + +Session 3: +- Principal component analysis (episode 4): 90 minutes of teaching time, 40 minutes for exercises (total: 130 minutes). +- Factor analysis (episode 5): 30 minutes of teaching time, 10 minutes for exercises (total: 40 minutes). + +Session 4: +- K-means (episode 6): 60 minutes of teaching time, 20 minutes for exercises (total: 80 minutes). +- Hierarchical clustering (episode 7): 70 minutes of teaching time, 20 minutes for exercises (total: 90 minutes). {% include links.md %} diff --git a/reference.md b/reference.md index 0ed3c171..f7cdcb6c 100644 --- a/reference.md +++ b/reference.md @@ -4,6 +4,5 @@ layout: reference ## Glossary -Coming soon. Feel free to suggest entries via GitHub Issues! {% include links.md %}