diff --git a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd index 07cb9150..cd1d8e52 100644 --- a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd +++ b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd @@ -124,11 +124,11 @@ of the challenges we are facing when working with high-dimensional data. > encountered when working with many features in a high-dimensional data set. > > First, make sure you have completed the setup instructions [here](https://carpentries-incubator.github.io/high-dimensional-stats-r/setup.html). -> Next, let's Load the `Prostate` dataset as follows: +> Next, let's Load the `prostate` dataset as follows: > > ```{r prostate} > library("here") -> Prostate <- readRDS(here("data/prostate.rds")) +> prostate <- readRDS(here("data/prostate.rds")) > ``` > > Examine the dataset (in which each row represents a single patient) to: @@ -142,21 +142,21 @@ of the challenges we are facing when working with high-dimensional data. > > > > > > ```{r dim-prostate, eval = FALSE} -> > dim(Prostate) #print the number of rows and columns +> > dim(prostate) #print the number of rows and columns > > ``` > > > > ```{r head-prostate, eval = FALSE} -> > names(Prostate) # examine the variable names -> > head(Prostate) #print the first 6 rows +> > names(prostate) # examine the variable names +> > head(prostate) #print the first 6 rows > > ``` > > > > ```{r pairs-prostate} -> > names(Prostate) #examine column names +> > names(prostate) #examine column names > > -> > pairs(Prostate) #plot each pair of variables against each other +> > pairs(prostate) #plot each pair of variables against each other > > ``` > > The `pairs()` function plots relationships between each of the variables in -> > the `Prostate` dataset. This is possible for datasets with smaller numbers +> > the `prostate` dataset. This is possible for datasets with smaller numbers > > of variables, but for datasets in which $p$ is larger it becomes difficult > > (and time consuming) to visualise relationships between all variables in the > > dataset. Even where visualisation is possible, fitting models to datasets @@ -211,7 +211,7 @@ explore why high correlations might be an issue in a Challenge. > ## Challenge 3 > > Use the `cor()` function to examine correlations between all variables in the -> `Prostate` dataset. Are some pairs of variables highly correlated using a threshold of +> `prostate` dataset. Are some pairs of variables highly correlated using a threshold of > 0.75 for the correlation coefficients? > > Use the `lm()` function to fit univariate regression models to predict patient @@ -224,11 +224,11 @@ explore why high correlations might be an issue in a Challenge. > > > ## Solution > > -> > Create a correlation matrix of all variables in the Prostate dataset +> > Create a correlation matrix of all variables in the `prostate` dataset > > > > ```{r cor-prostate} -> > cor(Prostate) -> > round(cor(Prostate), 2) # rounding helps to visualise the correlations +> > cor(prostate) +> > round(cor(prostate), 2) # rounding helps to visualise the correlations > > ``` > > > > As seen above, some variables are highly correlated. In particular, the @@ -238,15 +238,15 @@ explore why high correlations might be an issue in a Challenge. > > as predictors. > > > > ```{r univariate-prostate} -> > model1 <- lm(age ~ gleason, data = Prostate) -> > model2 <- lm(age ~ pgg45, data = Prostate) +> > model_gleason <- lm(age ~ gleason, data = prostate) +> > model_pgg45 <- lm(age ~ pgg45, data = prostate) > > ``` > > > > Check which covariates have a significant efffect > > > > ```{r summary-prostate} -> > summary(model1) -> > summary(model2) +> > summary(model_gleason) +> > summary(model_pgg45) > > ``` > > > > Based on these results we conclude that both `gleason` and `pgg45` have a @@ -257,8 +257,8 @@ explore why high correlations might be an issue in a Challenge. > > as predictors > > > > ```{r multivariate-prostate} -> > model3 <- lm(age ~ gleason + pgg45, data = Prostate) -> > summary(model3) +> > model_multivar <- lm(age ~ gleason + pgg45, data = prostate) +> > summary(model_multivar) > > ``` > > > > Although `gleason` and `pgg45` have statistically significant univariate effects, @@ -298,7 +298,7 @@ In this course, we will cover four methods that help in dealing with high-dimens (3) dimensionality reduction, and (4) clustering. Here are some examples of when each of these approaches may be used: -(1) Regression with numerous outcomes refers to situations in which there are +1. Regression with numerous outcomes refers to situations in which there are many variables of a similar kind (expression values for many genes, methylation levels for many sites in the genome) and when one is interested in assessing whether these variables are associated with a specific covariate of interest, @@ -308,7 +308,7 @@ predictor) could be fitted independently. In the context of high-dimensional molecular data, a typical example are *differential gene expression* analyses. We will explore this type of analysis in the *Regression with many outcomes* episode. -(2) Regularisation (also known as *regularised regression* or *penalised regression*) +2. Regularisation (also known as *regularised regression* or *penalised regression*) is typically used to fit regression models when there is a single outcome variable or interest but the number of potential predictors is large, e.g. there are more predictors than observations. Regularisation can help to prevent @@ -318,14 +318,14 @@ been often used when building *epigenetic clocks*, where methylation values across several thousands of genomic sites are used to predict chronological age. We will explore this in more detail in the *Regularised regression* episode. -(3) Dimensionality reduction is commonly used on high-dimensional datasets for +3. Dimensionality reduction is commonly used on high-dimensional datasets for data exploration or as a preprocessing step prior to other downstream analyses. For instance, a low-dimensional visualisation of a gene expression dataset may be used to inform *quality control* steps (e.g. are there any anomalous samples?). This course contains two episodes that explore dimensionality reduction techniques: *Principal component analysis* and *Factor analysis*. -(4) Clustering methods can be used to identify potential grouping patterns +4. Clustering methods can be used to identify potential grouping patterns within a dataset. A popular example is the *identification of distinct cell types* through clustering cells with similar gene expression patterns. The *K-means* episode will explore a specific method to perform clustering analysis. diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index 2097bac7..b4ca084b 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -621,7 +621,7 @@ head(design_age) > that minimises the differences between outcome values and those values > predicted by using the covariates (or predictor variables). But how do we get > from a set of predictors and regression coefficients to predicted values? This -> is done via matrix multipliciation. The matrix of predictors is (matrix) +> is done via matrix multiplication. The matrix of predictors is (matrix) > multiplied by the vector of coefficients. That matrix is called the > **model matrix** (or design matrix). It has one row for each observation and > one column for each predictor plus (by default) one aditional column of ones @@ -669,8 +669,7 @@ of the input matrix. ```{r ebayes-toptab} toptab_age <- topTable(fit_age, coef = 2, number = nrow(fit_age)) -orderEffSize <- rev(order(abs(toptab_age$logFC))) # order by effect size (absolute log-fold change) -head(toptab_age[orderEffSize, ]) +head(toptab_age) ``` The output of `topTable` includes the coefficient, here termed a log diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 1b3fc4e5..6e16306d 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -46,74 +46,26 @@ knitr_fig_path("05-") # Introduction -Imagine a dataset which contains many variables ($p$), close to the total number -of rows in the dataset ($n$). Some of these variables are highly correlated and -several form groups which you might expect to represent the same overall effect. -Such datasets are challenging to analyse for several reasons, with the main -problem being how to reduce dimensionality in the dataset while retaining the -important features. +If a dataset contains many variables ($p$), it is likely that some of these +variables will be highly correlated. Variables may even be so highly correlated +that they represent the same overall effect. Such datasets are challenging +to analyse for several reasons, with the main problem being how to reduce +dimensionality in the dataset while retaining the important features. In this episode we will explore *principal component analysis* (PCA) as a -popular method of analysing high-dimensional data. PCA is an unsupervised +popular method of analysing high-dimensional data. PCA is a statistical method which allows large datasets of correlated variables to be summarised into smaller numbers of uncorrelated principal components that -explain most of the variability in the original dataset. This is useful, -for example, during initial data exploration as it allows correlations among -data points to be observed and principal components to be calculated for -inclusion in further analysis (e.g. linear regression). An example of PCA might -be reducing several variables representing aspects of patient health -(blood pressure, heart rate, respiratory rate) into a single feature. +explain most of the variability in the original dataset. As an example, +PCA might reduce several variables representing aspects of patient health +(blood pressure, heart rate, respiratory rate) into a single feature capturing +an overarching "patient health" effect. This is useful from an exploratory point +of view, discovering how variables might be associated and combined. The +resulting principal component could also be used as an effect in further analysis +(e.g. linear regression). - -# Advantages and disadvantages of PCA - -Advantages: -* It is a relatively easy to use and popular method. -* Various software/packages are available to run a PCA. -* The calculations used in a PCA are easy to understand for statisticians and - non-statisticians alike. - -Disadvantages: -* It assumes that variables in a dataset are correlated. -* It is sensitive to the scale at which input variables are measured. - If input variables are measured at different scales, the variables - with large variance relative to the scale of measurement will have - greater impact on the principal components relative to variables with smaller - variance. In many cases, this is not desirable. -* It is not robust against outliers, meaning that very large or small data - points can have a large effect on the output of the PCA. -* PCA assumes a linear relationship between variables which is not always a - realistic assumption. -* It can be difficult to interpret the meaning of the principal components, - especially when including them in further analyses (e.g. inclusion in a linear - regression). - - -> ## Supervised vs unsupervised learning -> Most statistical problems fall into one of two categories: supervised or -> unsupervised learning. -> Examples of supervised learning problems include linear regression and include -> analyses in which each observation has both at least one independent variable -> ($x$) as well as a dependent variable ($y$). In supervised learning problems -> the aim is to predict the value of the response given future observations or -> to understand the relationship between the dependent variable and the -> predictors. In unsupervised learning for each observation there is no -> dependent variable ($y$), but only -> a series of independent variables. In this situation there is no need for -> prediction, as there is no dependent variable to predict (hence the analysis -> can be thought as being unsupervised by the dependent variable). Instead -> statistical analysis can be used to understand relationships between the -> independent variables or between observations themselves. Unsupervised -> learning problems often occur when analysing high-dimensional datasets in -> which there is no obvious dependent variable to be -> predicted, but the analyst would like to understand more about patterns -> between groups of observations or reduce dimensionality so that a supervised -> learning process may be used. -{: .callout} - - > ## Challenge 1 > > Descriptions of three datasets and research questions are given below. For @@ -153,32 +105,21 @@ Disadvantages: {: .challenge} -# What is a principal component? +# Principal component analysis +PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability +in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed into principal components, we can extract a subset of the principal components in order of the variance they explain (starting with the first principal component that by definition explains the most variability, and then the second), giving new variables that explain a lot of the variability in the original dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. -```{r, eval=FALSE, echo=FALSE} -# A PCA is carried out by calculating a matrix of Pearson's correlations from -# the original dataset which shows how each of the variables in the dataset -# relate to each other. -``` +To see what these new principal components may look like, +Figure 1 shows biodiversity index versus percentage area left +fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points, one for each individual observation called _scores_. +The red line on the plot represents the line passing through the scores (points) of the first principal component for each observation. +The angle that the first principal component line passes through the data points at is set to the direction with the highest +variability. The plotted first principal components can therefore be thought of reflecting the +effect in the data that has the highest variability. The second principal component explains the next highest amount of variability +in the data and is represented by the line perpendicular to the first (the green line). The second principal component can be thought of as +capturing the overall effect in the data that has the second-highest variability. -The first principal component is the direction of the data along which the -observations vary the most. The second principal component is the direction of -the data along which the observations show the next highest amount of variation. -For example, the figure below shows biodiversity index versus percentage area left -fallow for 50 farms in southern England. The red line represents the first -principal component direction of the data, which is the direction along which -there is greatest variability in the data. Projecting points onto this line -(i.e. by finding the location on the line closest to the point) would give a -vector of points with the greatest possible variance. The next highest amount -of variability in the data is represented by the line perpendicular to first -regression line which represents the second principal component (green line). - -The second principal component is a linear combination of the variables that -is uncorrelated with the first principal component. There are as many principal -components as there are variables in your dataset, but as we'll see, some are -more useful at explaining your data than others. By definition, the first -principal component explains more variation than other principal components. ```{r fig1, echo=FALSE, fig.cap="Scatter plots of the farm data with the first principal component in red and second principal component in green.", fig.alt="Side-by-side scatter plots of the farm data. The left figure displays a scatter plot of biodiversity versus percentage area fallow. The first principal component is shown by a red line and the second principal component is shown by a green line. The right figure displays the same scatter plot rotated so that the first principal component is horizontal and the second principal component is shown perpendicular to this."} # ![Figure 1: Biodiversity index and percentage area fallow PCA](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Bio index vs percentage fallow.png) @@ -199,23 +140,22 @@ This is explained in more detail on [this Q&A website](https://stats.stackexchan knitr::include_graphics("../fig/pendulum.gif") ``` +> ## Mathematical description of PCA +> Mathematically, each principal component is a linear combination +> of the variables in the dataset. That is, the first principal +> component values or _scores_, $Z_1$, are a linear combination +> of variables in the dataset, $X_1...X_p$, given by +> $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ +> where $a_{11}...a_{p1}$ represent principal component _loadings_. +{: .callout} -The first principal component's scores ($Z_1$) are calculated using the equation: - -$$ - Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p -$$ - -$X_1...X_p$ represents variables in the original dataset and $a_{11}...a_{p1}$ -represent principal component loadings, which can be thought of as the degree to -which each variable contributes to the calculation of the principal component. -We will come back to principal component scores and loadings further below. +In summary, the principal components' values are called _scores_. The loadings can +be thought of as the degree to which each original variable contributes to +the principal component scores. In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. # How do we perform a PCA? -## A prostate cancer dataset - -The `prostate` dataset represents data from 97 +To illustrate how to perform PCA initially, we start with a low-dimensional dataset. The `prostate` dataset represents data from 97 men who have prostate cancer. The data come from a study which examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. @@ -233,12 +173,9 @@ Columns include: - `lpsa` (log-transformed prostate specific antigen; level of PSA in blood). - `age` (patient age in years). -Here we will calculate principal component scores for each of the rows in this -dataset, using five principal components (one for each variable included in the -PCA). We will include five clinical variables in our PCA, each of the continuous -variables in the prostate dataset, so that we can create fewer variables -representing clinical markers of cancer progression. Standard PCA is carried -out using continuous variables only. +We will perform PCA on the five continuous clinical variables in our dataset +so that we can create fewer variables representing clinical markers of cancer progression. + First, we will examine the `prostate` dataset (originally part of the **`lasso2`** package): @@ -261,9 +198,11 @@ pros2 <- prostate[, c("lcavol", "lweight", "lbph", "lcp", "lpsa")] head(pros2) ``` -## Do we need to standardise the data? +## Do we need to scale the data? + +PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Scaling is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to scale if we want variables with low variance to carry less weight in the PCA. -Now we compare the variances between variables in the dataset. +In this example, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to scale our variables first: ```{r var-hist, fig.cap="Histograms of two variables from the prostate data set.", fig.alt="Side-by-side histograms of two variables from the dataset, lweight on the left and lbph on the right. The histogram for the lweight data ranges from around 2 to 6 on the x axis, while the histogram for the lbph ranges from around -2 to 3 on the x axis."} apply(pros2, 2, var) @@ -272,10 +211,10 @@ hist(pros2$lweight, breaks = "FD") hist(pros2$lbph, breaks = "FD") ``` -Note that variance is greatest for `lbph` and lowest for `lweight`. It is clear -from this output that we need to scale each of these variables before including -them in a PCA analysis to ensure that differences in variances between variables -do not drive the calculation of principal components. In this example we +Note that variance is greatest for `lbph` and lowest for `lweight`. Since we +want each of the variables to be treated equally in our PCA, but there are large differences in the variances of the variables, we need to scale each of the variables before including +them in a PCA to ensure that differences in variances +do not drive the calculation of principal components. In this example, we standardise all five variables to have a mean of 0 and a standard deviation of 1. @@ -343,6 +282,7 @@ library("PCAtools") The input data (`pros2`) is in the form of a matrix and we need to take the transpose of this matrix and convert it to a data frame for use within the **`PCAtools`** package. We set the `scale = TRUE` argument to standardise the variables to have a mean 0 and standard deviation of 1 as above. + ```{r prcomp} pros2t <- data.frame(t(pros2)) #create transposed data frame colnames(pros2t)<-seq(1:dim(pros2t)[2]) #rename columns for use within PCAtools @@ -361,6 +301,7 @@ variation in the data, without having extra components that are unnecessary? Let's look at the relative importance of (percentage of the variance explained by) the components: + ```{r summ} prop.var <- round(pca.pros$variance, 4) #round to 4 decimal places prop.var @@ -374,8 +315,7 @@ around `r round(prop.var[[5]])`%. Let's visualise this. A plot of the amount of variance accounted for by each PC is also called a scree plot. Often, scree plots show a characteristic pattern -where initially, the variance drops -rapidly with each additional principal component. But then there is an “elbow” after which the +where initially, the variance drops rapidly with each additional principal component. But then there is an “elbow” after which the variance decreases more slowly. The total variance explained up to the elbow point is sometimes interpreted as structural variance that is relevant and should be retained versus noise which may be discarded after the elbow. @@ -486,8 +426,28 @@ on the top and right of the plot are used to interpret the loadings, where loadings are scaled by the standard deviation of the principal components (`pca.pros$sdev`) times the square root the number of observations. -Finally, you need to know that principal component scores and rotations may have different slot names, -depending on the PCA implementation you use. Here are some examples: + +# Advantages and disadvantages of PCA + +Advantages: +* It is a relatively easy to use and popular method. +* Various software/packages are available to run a PCA. +* The calculations used in a PCA are simple to understand compared to other methods for dimension reduction. + +Disadvantages: +* It assumes that variables in a dataset are correlated. +* It is sensitive to the scale at which input variables are measured. + If input variables are measured at different scales, the variables + with large variance relative to the scale of measurement will have + greater impact on the principal components relative to variables with smaller + variance. In many cases, this is not desirable. +* It is not robust against outliers, meaning that very large or small data + points can have a large effect on the output of the PCA. +* PCA assumes a linear relationship between variables which is not always a + realistic assumption. +* It can be difficult to interpret the meaning of the principal components, + especially when including them in further analyses (e.g. inclusion in a linear + regression). # Using PCA to analyse gene expression data @@ -509,7 +469,7 @@ associated metadata, downloaded from the ```{r se} library("SummarizedExperiment") -cancer <- readRDS(here::here("data/cancer_expression.rds")) +cancer <- readRDS(here("data/cancer_expression.rds")) mat <- assay(cancer) metadata <- colData(cancer) ``` diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 6f442946..1ea085e5 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -32,31 +32,84 @@ knitr_fig_path("08-") # Introduction -High-dimensional data, especially in biological settings, has -many sources of heterogeneity. Some of these are stochastic variation -arising from measurement error or random differences between organisms. -In some cases, a known grouping causes this heterogeneity (sex, treatment -groups, etc). In other cases, this heterogeneity arises from the presence of -unknown subgroups in the data. **Clustering** is a set of techniques that allows -us to discover unknown groupings like this, which we can often use to -discover the nature of the heterogeneity we're investigating. - -**Cluster analysis** involves finding groups of observations that are more -similar to each other (according to some feature) than they are to observations -in other groups. Cluster analysis is a useful statistical tool for exploring -high-dimensional datasets as -visualising data with large numbers of features is difficult. It is commonly -used in fields such as bioinformatics, genomics, and image processing in which -large datasets that include many features are often produced. Once groups -(or clusters) of observations have been identified using cluster analysis, -further analyses or interpretation can be carried out on the groups, for -example, using metadata to further explore groups. +As we saw in previous episodes, visualising high-dimensional +data with a large amount of features is difficult and can +limit our understanding of the data and associated processes. +In some cases, a known grouping causes this heterogeneity +(sex, treatment groups, etc). In other cases, heterogeneity +may arise from the presence of unknown subgroups in the data. +While PCA can be used to reduce the dimension of the dataset +into a smaller set of uncorrelated variables and factor analysis +can be used to identify underlying factors, clustering is a set +of techniques that allow us to discover unknown groupings. + +Cluster analysis involves finding groups of observations that +are more similar to each other (according to some feature) +than they are to observations in other groups and are thus +likely to represent the same source of heterogeneity. +Once groups (or clusters) of observations have been identified +using cluster analysis, further analyses or interpretation can be +carried out on the groups, for example, using metadata to further +explore groups. + +Cluster analysis is commonly used to discover unknown groupings +in fields such as bioinformatics, genomics, and image processing, +in which large datasets that include many features are often produced. There are various ways to look for clusters of observations in a dataset using different *clustering algorithms*. One way of clustering data is to minimise distance between observations within a cluster and maximise distance between -proposed clusters. Clusters can be updated in an iterative process so that over -time we can become more confident in size and shape of clusters. +proposed clusters. Using this process, we can also iteratively update clusters +so that we become more confident about the shape and size of the clusters. + + +# What is K-means clustering? + +**K-means clustering** groups data points into a +user-defined number of distinct, non-overlapping clusters. +To create clusters of 'similar' data points, K-means +clustering creates clusters that minimise the +within-cluster variation and thus the amount that +data points within a cluster differ from each other. +The distance between data points within a cluster is +used as a measure of within-cluster variation. + +To carry out K-means clustering, we first pick $k$ initial points as centres or +"centroids" of our clusters. There are a few ways to choose these initial "centroids" +and this is discussed below. Once we have picked intitial points, we then follow +these two steps until appropriate clusters have been formed: + +1. Assign each data point to the cluster with the closest centroid +2. Update centroid positions as the average of the points in that cluster. + +We can see this process in action in this animation: + +```{r kmeans-animation, echo = FALSE, fig.cap="Cap", fig.alt="Alt"} +knitr::include_graphics("../fig/kmeans.gif") +``` +While K-means has some advantages over other clustering methods (easy to implement and +to understand), it does have some disadvantages, particularly difficulties in identifying +initial clusters which observations belong to and the need for the user to specify the +number of clusters that the data should be partitioned into. + +> ## Initialisation +> +> The algorithm used in K-means clustering finds a *local* rather than a +> *global* optimum, so that results of clustering are dependent on the initial +> cluster that each observation is randomly assigned to. This initial +> configuration can have a significant effect on the final configuration of the +> clusters, so dealing with this limitation is an important part +> of K-means clustering. Some strategies to deal with this problem are: +> - Choose $K$ points at random from the data as the cluster centroids. +> - Randomly split the data into $K$ groups, and then average these groups. +> - Use the K-means++ algorithm to choose initial values. +> +> These each have advantages and disadvantages. In general, it's good to be +> aware of this limitation of K-means clustering and that this limitation can +> be addressed by choosing a good initialisation method, initialising clusters +> manually, or running the algorithm from multiple different starting points. +> +{: .callout} # Believing in clusters diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 3a97cfc5..e35a4329 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -40,18 +40,20 @@ knitr_fig_path("09-") When analysing high-dimensional data in the life sciences, it is often useful to identify groups of similar data points to understand more about the relationships -within the dataset. In *hierarchical clustering* an algorithm groups similar +within the dataset. General clustering algorithms group similar data points (or observations) into groups (or clusters). This results in a set of clusters, where each cluster is distinct, and the data points within each cluster have similar characteristics. The clustering algorithm works by iteratively grouping data points so that different clusters may exist at different stages of the algorithm's progression. -Unlike K-means clustering, *hierarchical clustering* does not require the +Here, we describe *hierarchical clustering*. Unlike K-means clustering, +hierarchical clustering does not require the number of clusters $k$ to be specified by the user before the analysis is carried -out. Hierarchical clustering also provides an attractive *dendrogram*, a -tree-like diagram showing the degree of similarity between clusters. +out. +Hierarchical clustering also provides an attractive *dendrogram*, a +tree-like diagram showing the degree of similarity between clusters. The dendrogram is a key feature of hierarchical clustering. This tree-shaped graph allows the similarity between data points in a dataset to be visualised and the arrangement of clusters produced by the analysis to be illustrated. Dendrograms are created @@ -83,17 +85,17 @@ To start with, we measure distance of the dendrogram, each observation is considered to be in its own individual cluster. We start the clustering procedure by fusing the two observations that are most similar according to a distance matrix. Next, the next-most similar observations are fused -so that the total number of clusters is *number of observations* - 2 (see +so that the total number of clusters is *number of observations* minus 2 (see panel below). Groups of observations may then be merged into a larger cluster (see next panel below, green box). This process continues until all the observations are included in a single cluster. -```{r hclustfig1, echo=FALSE, out.width="500px", fig.cap="Figure 1a: Example data showing two clusters of observation pairs"} +```{r hclustfig1, echo=FALSE, out.width="500px", fig.cap="Example data showing two clusters of observation pairs.", fig.alt="Scatter plot of observations x2 versus x1. Two clusters of pairs of observations are shown by blue and red boxes, each grouping two observations that are close in their x and y distance."} knitr::include_graphics("../fig/hierarchical_clustering_1.png") ``` -```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Figure 1b: Example data showing fusing of one observation into larger cluster"} +```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time. Blue and red boxes contain two observations each. The two boxes contain points that are relatively far apart. A third green box is shown encompassing the blue box and an additional data point."} knitr::include_graphics("../fig/hierarchical_clustering_2.png") ``` @@ -119,7 +121,7 @@ Looking at a heatmap of these data, we may spot some patterns -- many columns appear to have a similar methylation levels across all rows. However, they are all quite jumbled at the moment, so it's hard to tell how many line up exactly. -```{r heatmap-noclust, echo=FALSE} +```{r heatmap-noclust, echo=FALSE, fig.cap="Heatmap of methylation data.", fig.alt="Heatmap of methylation level with individuals along the y axis and methylation sites along the x axis. Red colours indicate high methylation levels (up to around 4), blue colours indicate low methylation levels (to around -4) and white indicates methylation levels close to zero. There are many vertical blue and red stripes."} Heatmap(methyl_mat, name = "Methylation level", @@ -129,13 +131,20 @@ Heatmap(methyl_mat, ) ``` +Looking at a heatmap of these data, we may spot some patterns -- looking at the +vertical stripes, many columns +appear to have a similar methylation levels across all rows. However, they are +all quite jumbled at the moment, so it's hard to tell how many line up exactly. +In addition, it is challenging to tell how many groups containing similar methylation +levels we may have or what the similarities and differences are between groups. + We can order these data to make the patterns more clear using hierarchical clustering. To do this, we can change the arguments we pass to `Heatmap()` from the **`ComplexHeatmap`** package. `Heatmap()` groups features based on dissimilarity (here, Euclidean distance) and orders rows and columns to show clustering of features and observations. -```{r heatmap-clust} +```{r heatmap-clust, fig.cap="Heatmap of methylation data clustered by methylation sites and individuals.", fig.alt="Heatmap of methylation level with individuals along the y axis and methylation sites along the x axis, clustered by methylation sites and individuals. Red colours indicate high methylation levels (up to around 4), blue colours indicate low methylation levels (to around -4) and white indicates methylation levels close to zero. This time, the individuals and methylation sites are clustered and the plot fades from vertical red lines on the left side to vertical blue lines on the right side. There are two, arguably three, white stripes towards the middle of the plot."} Heatmap(methyl_mat, name = "Methylation level", cluster_rows = TRUE, cluster_columns = TRUE, @@ -156,12 +165,15 @@ cause of this is -- it could be a batch effect, or a known grouping (e.g., old vs young samples). However, clustering like this can be a useful part of exploratory analysis of data to build hypotheses. +# Hierarchical clustering + Now, let's cover the inner workings of hierarchical clustering in more detail. -There are two things to consider before carrying out clustering: +Hierarchical clustering is a type of clustering that also allows us to estimate the number +of clusters. There are two things to consider before carrying out clustering: * how to define dissimilarity between observations using a distance matrix, and * how to define dissimilarity between clusters and when to fuse separate clusters. -# Creating the distance matrix +# Defining the dissimilarity between observations: creating the distance matrix Agglomerative hierarchical clustering is performed in two steps: calculating the distance matrix (containing distances between pairs of observations) and iteratively grouping observations into clusters using this matrix. @@ -184,7 +196,7 @@ clustering can have a big effect on the resulting tree. The decision of which distance matrix to use before carrying out hierarchical clustering depends on the type of data and question to be addressed. -# Linkage methods +# Defining the dissimilarity between clusters: Linkage methods The second step in performing hierarchical clustering after defining the distance matrix (or another function defining similarity between data points) @@ -206,27 +218,24 @@ method used. Complete linkage (the default in `hclust()`) works by computing all pairwise dissimilarities between data points in different clusters. For each pair of two clusters, -it sets their dissimilarity ($d$) to the maximum dissimilarity value observed +it sets their dissimilarity to the maximum dissimilarity value observed between any of these clusters' constituent points. The two clusters -with smallest value of $d$ are then fused. +with smallest dissimilarity value are then fused. # Computing a dendrogram -Dendograms are useful tools to visualise the grouping of points and clusters into bigger clusters. +Dendograms are useful tools that plot the grouping of points and clusters into bigger clusters. We can create and plot dendrograms in R using `hclust()` which takes -a distance matrix as input and creates the associated tree using hierarchical +a distance matrix as input and creates a tree using hierarchical clustering. Here we create some example data to carry out hierarchical clustering. Let's generate 20 data points in 2D space. Each -point belongs to one of three classes. Suppose we did not know which class -data points belonged to and we want to identify these via cluster analysis. -Hierarchical clustering carried out on the data can be used to produce a -dendrogram showing how the data is partitioned into clusters. But how do we -interpret this dendrogram? Let's explore this using our example data. +point is generated to belong to one of three classes/groups. Suppose we did not know which class +data points belonged to and we want to identify these via cluster analysis. Let's first generate and plot our data: -```{r plotexample} +```{r plotexample, fig.cap="Scatter plot of randomly-generated data x2 versus x1.", fig.alt="A scatter plot of randomly-generated data x2 versus x1. The points appear fairly randomly scattered, arguably centered towards the bottom of the plot."} #First, create some example data with two variables x1 and x2 set.seed(450) example_data <- data.frame( @@ -247,6 +256,9 @@ text( dist_m <- dist(example_data, method = "euclidean") ``` +Hierarchical clustering carried out on the data can be used to produce a +dendrogram showing how the data is partitioned into clusters. But how do we interpret this dendrogram? Let's explore this using our example data in a Challenge. + > ## Challenge 1 > > Use `hclust()` to implement hierarchical clustering using the @@ -256,16 +268,18 @@ dist_m <- dist(example_data, method = "euclidean") > > > ## Solution: > > -> > ```{r plotclustex} +> > ```{r plotclustex, fig.cap=" "} > > clust <- hclust(dist_m, method = "complete") > > plot(clust) > > ``` > {: .solution} {: .challenge} -This dendrogram shows similarities/differences in distances between data points. -Each leaf of the dendrogram represents one of the 20 data points. These leaves -fuse into branches as the height increases. Observations that are similar fuse into +A dendrogram, such as the one generated in Challenge 1, +shows similarities/differences in distances between data points. +Each vertical line at the bottom of the dendogram ('leaf') represents +one of the 20 data points. These leaves +fuse into fewer vertical lines ('branches') as the height increases. Observations that are similar fuse into the same branches. The height at which any two data points fuse indicates how different these two points are. Points that fuse at the top of the tree are very different from each other compared with two @@ -273,20 +287,56 @@ points that fuse at the bottom of the tree, which are quite similar. You can see this by comparing the position of similar/dissimilar points according to the scatterplot with their position on the tree. -# Identifying clusters based on the dendrogram +# Identifying the number of clusters + +To identify the number of clusters, we can make a horizontal cut through the dendrogram at a user-defined height. +The sets of observations beneath this cut can be thought of as distinct clusters. Equivalently, +we can count the vertical lines we encounter crossing the horizontal cut. For +example, a cut at height 10 produces 2 downstream clusters for the dendogram in Challenge 1, +while a cut at height 4 produces 6 downstream clusters. -To do this, we can make a horizontal cut through the dendrogram at a user-defined height. -The sets of observations beneath this cut can be thought of as distinct clusters. For -example, a cut at height 10 produces two downstream clusters while a cut at -height 4 produces six downstream clusters. +# Dendogram visualisation -We can cut the dendrogram to determine number of clusters at different heights +We can first visualise cluster membership by highlight branches in dendograms. +In this example, we calculate a distance matrix between +samples in the `methyl_mat` dataset. We then draw boxes round clusters obtained with `cutree`. + +```{r plot-clust-method} +## create a distance matrix using euclidean distance +distmat <- dist(methyl_mat) +## hierarchical clustering using complete method +clust <- hclust(distmat) +## plot resulting dendrogram +plot(clust) + +## draw border around three clusters +rect.hclust(clust, k = 3, border = 2:6) #border argument specifies the colours +## draw border around two clusters +rect.hclust(clust, k = 2, border = 2:6) +``` +We can also colour clusters downstream of a specified cut using `color_branches()` +from the **`dendextend`** package. + +```{r plot-coloured-branches} +## cut tree at height = 4 +cut <- cutree(clust, h = 50) + +library("dendextend") +avg_dend_obj <- as.dendrogram(clust) +## colour branches of dendrogram depending on clusters +plot(color_branches(avg_dend_obj, h = 50)) +``` + +## Numerical visualisation + +In addition to visualising clusters directly on the dendogram, we can cut +the dendrogram to determine number of clusters at different heights using `cutree()`. This function cuts a dendrogram into several groups (or clusters) where the number of desired groups is controlled by the user, by defining either `k` (number of groups) or `h` (height at which tree is -cut). +cut). The function outputs the cluster labels of each data point in order. -```{r cutree} +```{r cutree, fig.cap="Scatter plot of data x2 versus x1, coloured by cluster.", fig.alt="A scatter plot of the example data x2 versus x1, coloured by 8 different clusters. There are two clusters in the bottom right of the plot, 4 clusters in the top left of the plot, and a final cluster consisting of one point in the top right of the plot."} ## k is a user defined parameter determining ## the desired number of clusters at which to cut the treee cutree(clust, k = 3) @@ -305,7 +355,7 @@ count(example_cl, cluster) #plot cluster each point belongs to on original scatterplot library(ggplot2) -ggplot(example_cl, aes(x = x2, y = x1, color = factor(cluster))) + geom_point() +ggplot(example_cl, aes(x = x1, y = x2, color = factor(cluster))) + geom_point() ``` Note that this cut produces 8 clusters (two before the cut and another six @@ -348,7 +398,7 @@ In addition to visualising cluster identity in scatter plots, it is also possibl highlight branches in dentrograms. In this example, we calculate a distance matrix between samples in the `methyl_mat` dataset. We then draw boxes round clusters obtained with `cutree`. -```{r plot-clust-method} +```{r plot-clust-method, fig.cap="Dendogram with boxes around clusters.", fig.alt="A dendogram for the methyl_mat data with boxes overlaid on clusters. There are 5 boxes in total, each indicating separate clusters."} ## create a distance matrix using euclidean method distmat <- dist(methyl_mat) ## hierarchical clustering using complete method @@ -364,7 +414,7 @@ rect.hclust(clust, k = 2, border = 2:6) We can also colour clusters downstream of a specified cut using `color_branches()` from the **`dendextend`** package. -```{r plot-coloured-branches} +```{r plot-coloured-branches, fig.cap="Dendogram with coloured branches delineating different clusters.", fig.alt="A dendogram with the different clusters in 4 different colours."} ## cut tree at height = 4 cut <- cutree(clust, h = 50) @@ -376,7 +426,7 @@ plot(color_branches(avg_dend_obj, h = 50)) # The effect of different linkage methods Now let us look into changing the default behaviour of `hclust()`. Imagine we have two crescent-shaped point clouds as shown below. -```{r crescents} +```{r crescents, fig.cap="Scatter plot of data simulated according to two crescent-shaped point clouds.", fig.alt="A scatter plot of data simulated to form two crescent shapes. The crescents are horizontally orientated with a a rough line of vertical symmetry."} # These two functions are to help us make crescents. Don't worry it you do not understand all this code. # The importent bit is the object "cres", which consists of two columns (x and y coordinates of two crescents). is.insideCircle <- function(co, r=0.5, offs=c(0,0)){ @@ -399,7 +449,7 @@ plot(cres) We might expect that the crescents are resolved into separate clusters. But if we run hierarchical clustering with the default arguments, we get this: -```{r cresClustDefault} +```{r cresClustDefault, fig.cap="Scatter plot of crescent-shaped simulated data, coloured according to clusters calculated using Euclidean distance.", fig.alt="A scatter plot of the crescent-shaped simulated data calculated using Euclidean distance. The points are coloured in black or red according to their membership to 2 clusters. The points in the tails of each crescent have inherited the colour of the opposite crescent."} cresClass <- cutree(hclust(dist(cres)), k=2) # save partition for colouring plot(cres, col=cresClass) # colour scatterplot by partition ``` @@ -468,7 +518,7 @@ other crescent and so it splits both crescents. So far, we've been using Euclidean distance to define the dissimilarity or distance between observations. However, this isn't always the best -metric for how dissimilar different observations are. Let's make an +metric for how dissimilar the observations are. Let's make an example to demonstrate. Here, we're creating two samples each with ten observations of random noise: @@ -498,24 +548,26 @@ head(cor_example) ``` If we plot a heatmap of this, we can see that `sample_a` and `sample_b` are -grouped together because they have a small distance to each other, despite +grouped together because they have a small distance from each other, despite being quite different in their pattern across the different features. In contrast, `sample_a` and `sample_c` are very distant, despite having *exactly* the same pattern across the different features. -```{r heatmap-cor-example} +```{r heatmap-cor-example, fig.cap="Heatmap of simulated data.", fig.alt="Heatmap of simulated data: feature versus sample. The grid cells of the heatmap are coloured from red (high) to blue (low) according to value of the simulated data."} Heatmap(as.matrix(cor_example)) ``` We can see that more clearly if we do a line plot: -```{r lineplot-cor-example} +```{r lineplot-cor-example, fig.cap="Line plot of simulated value versus observation number, coloured by sample.", fig.alt="A line plot of simulated value versus observation number, coloured by sample. Samples a and b are concentrated at the bottom of the plot, while sample c is concentrated at the top of the plot. Samples a and c have exactly the same vertical pattern."} ## create a blank plot (type = "n" means don't draw anything) ## with an x range to hold the number of features we have. ## the range of y needs to be enough to show all the values for every feature plot( 1:nrow(cor_example), rep(range(cor_example), 5), - type = "n" + type = "n", + xlab = "Feature number", + ylab = "Value" ) ## draw a red line for sample_a lines(cor_example$sample_a, col = "firebrick") @@ -531,7 +583,7 @@ the values, they have a high distance to each other. We can see that if we cluster and plot the data ourselves using Euclidean distance: -```{r clust-euc-cor-example} +```{r clust-euc-cor-example, fig.cap="Dendogram of the example simulated data clustered according to Euclidean distance.", fig.alt="A dendogram of the example simulated data clustered according to Euclidean distance. The dendogram shows that sample c definitively forms its own cluster for any cut height and samples a and b merge into a cluster at a height of around 6."} clust_dist <- hclust(dist(t(cor_example))) plot(clust_dist) ``` @@ -546,7 +598,7 @@ we can use `1 - cor(x)` as the distance metric. The input to `hclust()` must be a `dist` object, so we also need to call `as.dist()` on it before passing it in. -```{r clust-cor-cor-example} +```{r clust-cor-cor-example, fig.cap="Dendogram of the example simulated data clustered according to correlation.", fig.alt="A dendogram of the example simulated data clustered according to correlation. The dendogram shows that sample b definitively forms its own cluster and samples a and c form definitively form their own cluster for any cut height."} cor_as_dist <- as.dist(1 - cor(cor_example)) clust_cor <- hclust(cor_as_dist) plot(clust_cor) @@ -557,10 +609,10 @@ are grouped together, while `sample_b` is seen as distant because it has a different pattern, even though its values are closer to `sample_a`. Using your own distance function is often useful, especially if you have missing or unusual data. It's often possible to use correlation and other custom -distance functions to functions that perform hierarchical clustering, such as +dissimilarity measures in functions that perform hierarchical clustering, such as `pheatmap()` and `stats::heatmap()`: -```{r heatmap-cor-cor-example} +```{r heatmap-cor-cor-example, fig.cap="Heatmaps of features versus samples clustered in the samples according to correlation.", fig.alt="Heatmaps of features versus samples, coloured by simulated value. The columns (samples) are clustered according to the correlation. Samples a and b have mostly low values, delineated by blue in the first plot and yellow in the second plot. Sample c has mostly high values, delineated by red in the first plot and brown in the second plot."} ## pheatmap allows you to select correlation directly pheatmap(as.matrix(cor_example), clustering_distance_cols = "correlation") ## Using the built-in stats::heatmap @@ -596,7 +648,7 @@ Dunn index, the better defined the clusters. Let's calculate the Dunn index for clustering carried out on the `methyl_mat` dataset using the **`clValid`** package. -```{r plot-clust-dunn} +```{r plot-clust-dunn, fig.cap="Dendogram for clustering of methylation data.", fig.alt="A dendogram for clustering of methylation data. Identical to that in the section Highlighting dendrogram branches, without the colour overlay to show clusters."} ## calculate dunn index ## (ratio of the smallest distance between obs not in the same cluster ## to the largest intra-cluster distance) @@ -660,7 +712,7 @@ between sets of clusters with larger values being preferred. The figures below show in a more systematic way how changing the values of `k` and `h` using `cutree()` affect the Dunn index. -```{r hclust-fig3, echo=TRUE, fig.cap="Figure 3: Dunn index"} +```{r hclust-fig3, echo=TRUE, fig.cap="Dunn index versus cut height for methylation data.", fig.alt="Scatter plot of Dunn index versus cut height for methylation data. The Dunn index is high (around 1.6) for height values up to 20. The Dunn index drops around height 20 and the points fluctuate around 0.8 and 1 as height increases."} h_seq <- 70:10 h_dunn <- sapply(h_seq, function(x) dunn(distance = distmat, cutree(clust, h = x))) k_seq <- seq(2, 10) @@ -673,7 +725,7 @@ is not very useful - cutting the given tree at a low `h` value like 15 leads to ending up each in its own cluster. More relevant is the second maximum in the plot, around `h=55`. Looking at the dendrogram, this corresponds to `k=4`. -```{r hclust-fig4, echo=TRUE, fig.cap="Figure 4: Dunn index continued"} +```{r hclust-fig4, echo=TRUE, fig.cap="Scatter plot of Dunn index versus the number of clusters for the methylation data.", fig.alt="A scatter plot of the Dunn index versus the number of clusters for the methylation data. The points appear randomly scattered around the plot area between Dunn indices of 0.77 to 0.85, apart from for 4 clusters where the Dunn index reaches just over 0.88."} plot(k_seq, k_dunn, xlab = "Number of clusters (k)", ylab = "Dunn index") grid() ``` @@ -685,7 +737,7 @@ clustering results, due to its high sensitivity to noise in the dataset. An alternative is to use silhouette scores (see the k-means clustering episode). As we said before (see previous episode), clustering is a non-trivial task. -It is important to think about the nature of your data and your expactations +It is important to think about the nature of your data and your expectations rather than blindly using a some algorithm for clustering or cluster validation. # Further reading