From 21f4858d62f3df5f7a0ad71b5ffb4d5bb889079e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:20:57 +0000 Subject: [PATCH 01/11] rewrite introduction, tasks 1-3 mainly to motivate by clarifying differences compared to pca and fa since these are already discussed --- _episodes_rmd/06-k-means.Rmd | 46 ++++++++++++++++++++---------------- 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 0d34c971..5a8ef56d 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -32,31 +32,35 @@ knitr_fig_path("08-") # Introduction -High-dimensional data, especially in biological settings, has -many sources of heterogeneity. Some of these are stochastic variation -arising from measurement error or random differences between organisms. -In some cases, a known grouping causes this heterogeneity (sex, treatment -groups, etc). In other cases, this heterogeneity arises from the presence of -unknown subgroups in the data. **Clustering** is a set of techniques that allows -us to discover unknown groupings like this, which we can often use to -discover the nature of the heterogeneity we're investigating. - -**Cluster analysis** involves finding groups of observations that are more -similar to each other (according to some feature) than they are to observations -in other groups. Cluster analysis is a useful statistical tool for exploring -high-dimensional datasets as -visualising data with large numbers of features is difficult. It is commonly -used in fields such as bioinformatics, genomics, and image processing in which -large datasets that include many features are often produced. Once groups -(or clusters) of observations have been identified using cluster analysis, -further analyses or interpretation can be carried out on the groups, for -example, using metadata to further explore groups. +As we saw in previous episodes, visualising high-dimensional +data with a large amount of features is difficult and can +limit our understanding of the data and associated processes. +In some cases, a known grouping causes this heterogeneity +(sex, treatment groups, etc). In other cases, heterogeneity +may arise from the presence of unknown subgroups in the data. +While PCA can be used to reduce the dimension of the dataset +into a smaller set of uncorrelated variables and factor analysis +can be used to identify underlying factors, clustering is a set +of techniques that allow us to discover unknown groupings. + +Cluster analysis involves finding groups of observations that +are more similar to each other (according to some feature) +than they are to observations in other groups and are thus +likely to represent the same source of heterogeneity. +Once groups (or clusters) of observations have been identified +using cluster analysis, further analyses or interpretation can be +carried out on the groups, for example, using metadata to further +explore groups. + +Cluster analysis is commonly used to discover unknown groupings +in fields such as bioinformatics, genomics, and image processing, +in which large datasets that include many features are often produced. There are various ways to look for clusters of observations in a dataset using different *clustering algorithms*. One way of clustering data is to minimise distance between observations within a cluster and maximise distance between -proposed clusters. Clusters can be updated in an iterative process so that over -time we can become more confident in size and shape of clusters. +proposed clusters. Using this process, we can also iteratively update clusters +so that we become more confident about the shape and size of the clusters. # Believing in clusters From 89f7d9b715fd4a8bbb1229fcdcf87f2c3e231f4c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:22:47 +0000 Subject: [PATCH 02/11] move believing in clusters to after methodology, task 4 think it's clearer to explain believing in clusters after fully describing what clusters are --- _episodes_rmd/06-k-means.Rmd | 90 ++++++++++++++++++------------------ 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 5a8ef56d..0f53af6f 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -63,51 +63,6 @@ proposed clusters. Using this process, we can also iteratively update clusters so that we become more confident about the shape and size of the clusters. -# Believing in clusters - -When using clustering, it's important to realise that data may seem to -group together even when these groups are created randomly. It's especially -important to remember this when making plots that add extra visual aids to -distinguish clusters. -For example, if we cluster data from a single 2D normal distribution and draw -ellipses around the points, these clusters suddenly become almost visually -convincing. This is a somewhat extreme example, since there is genuinely no -heterogeneity in the data, but it does reflect what can happen if you allow -yourself to read too much into faint signals. - -Let's explore this further using an example. We create two columns of data -('x' and 'y') and partition these data into three groups ('a', 'b', 'c') -according to data values. We then plot these data and their allocated clusters -and put ellipses around the clusters using the `stat_ellipse` function -in `ggplot`. - -```{r fake-cluster, echo = FALSE} -set.seed(11) -library("MASS") -library("ggplot2") -data <- mvrnorm(n = 200, mu = rep(1, 2), Sigma = matrix(runif(4), ncol = 2)) -data <- as.data.frame(data) -colnames(data) <- c("x", "y") - -data$cluster <- ifelse( - data$y < (data$x * -0.06 + 0.9), - "a", - ifelse( - data$y < 1.15, - "b", - "c" - ) -) -ggplot(data, aes(x, y, colour = cluster)) + - geom_point() + - stat_ellipse() -``` -The randomly created data used here appear to form three clusters when we -plot the data. Putting ellipses around the clusters can further convince us -that the clusters are 'real'. But how do we tell if clusters identified -visually are 'real'? - - # What is K-means clustering? **K-means clustering** is a clustering method which groups data points into a @@ -155,6 +110,51 @@ number of clusters that the data should be partitioned into. > {: .callout} + +# Believing in clusters + +When using clustering, it's important to realise that data may seem to +group together even when these groups are created randomly. It's especially +important to remember this when making plots that add extra visual aids to +distinguish clusters. +For example, if we cluster data from a single 2D normal distribution and draw +ellipses around the points, these clusters suddenly become almost visually +convincing. This is a somewhat extreme example, since there is genuinely no +heterogeneity in the data, but it does reflect what can happen if you allow +yourself to read too much into faint signals. + +Let's explore this further using an example. We create two columns of data +('x' and 'y') and partition these data into three groups ('a', 'b', 'c') +according to data values. We then plot these data and their allocated clusters +and put ellipses around the clusters using the `stat_ellipse` function +in `ggplot`. + +```{r fake-cluster, echo = FALSE} +set.seed(11) +library("MASS") +library("ggplot2") +data <- mvrnorm(n = 200, mu = rep(1, 2), Sigma = matrix(runif(4), ncol = 2)) +data <- as.data.frame(data) +colnames(data) <- c("x", "y") + +data$cluster <- ifelse( + data$y < (data$x * -0.06 + 0.9), + "a", + ifelse( + data$y < 1.15, + "b", + "c" + ) +) +ggplot(data, aes(x, y, colour = cluster)) + + geom_point() + + stat_ellipse() +``` +The randomly created data used here appear to form three clusters when we +plot the data. Putting ellipses around the clusters can further convince us +that the clusters are 'real'. But how do we tell if clusters identified +visually are 'real'? + # K-means clustering applied to single-cell RNAseq data Let's carry out K-means clustering in `R` using some real high-dimensional data. From e9e81cdc1a660e89cade66f02b9e0202787d250c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:29:06 +0000 Subject: [PATCH 03/11] rewrite initial description of k means clustering, tasks 5 and 6 --- _episodes_rmd/06-k-means.Rmd | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 0f53af6f..b0356dcb 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -65,12 +65,16 @@ so that we become more confident about the shape and size of the clusters. # What is K-means clustering? -**K-means clustering** is a clustering method which groups data points into a -user-defined number of distinct non-overlapping clusters. In K-means clustering -we are interested in minimising the *within-cluster variation*. This is the amount that -data points within a cluster differ from each other. In K-means clustering, the distance -between data points within a cluster is used as a measure of within-cluster variation. -Using a specified clustering algorithm like K-means clustering increases our confidence +**K-means clustering** groups data points into a +user-defined number of distinct, non-overlapping clusters. +To create clusters of 'similar' data points, K-means +clustering creates clusters that minimise the +within-cluster variation adn thus the amount that +data points within a cluster differ from each other. +The distance between data points within a cluster is +used as a measure of within-cluster variation. +Using a specified clustering algorithm like K-means clustering +increases our confidence that our data can be partitioned into groups. To carry out K-means clustering, we first pick $k$ initial points as centres or From c2f865307c63f3b9a96d67f3df0b2867970c3017 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:30:03 +0000 Subject: [PATCH 04/11] remove final sentence from intro to method, task 7 unclear what a specified clustering algorithm is and how this increases our confidence that data can be partitioned into groups at this stage --- _episodes_rmd/06-k-means.Rmd | 3 --- 1 file changed, 3 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index b0356dcb..af767ad5 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -73,9 +73,6 @@ within-cluster variation adn thus the amount that data points within a cluster differ from each other. The distance between data points within a cluster is used as a measure of within-cluster variation. -Using a specified clustering algorithm like K-means clustering -increases our confidence -that our data can be partitioned into groups. To carry out K-means clustering, we first pick $k$ initial points as centres or "centroids" of our clusters. There are a few ways to choose these initial "centroids", From 0506bf18c8bb64b264096465b63c0d5604ed7150 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:33:56 +0000 Subject: [PATCH 05/11] remove mention of random initialisation in the method and clarify what convergence looks like, tasks 8 and 9 Picking initial points randomly here may be misleading for someone just looking up the method from this section. Have simply omitted and said that this is discussed below. Also, have removed the word convergence in favour of a description of what convergence looks like --- _episodes_rmd/06-k-means.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index af767ad5..d815fd0a 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -75,9 +75,9 @@ The distance between data points within a cluster is used as a measure of within-cluster variation. To carry out K-means clustering, we first pick $k$ initial points as centres or -"centroids" of our clusters. There are a few ways to choose these initial "centroids", -but for simplicity let's imagine we just pick three random co-ordinates. -We then follow these two steps until convergence: +"centroids" of our clusters. There are a few ways to choose these initial "centroids" +and this is discussed below. Once we have picked intitial points, we then follow +these two steps until appropriate clusters have been formed: 1. Assign each data point to the cluster with the closest centroid 2. Update centroid positions as the average of the points in that cluster From ba029bae7b427be2db0dd0b5c2f15f7f934e0d11 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 12:19:55 +0000 Subject: [PATCH 06/11] add here to k means Co-authored-by: Alan O'Callaghan --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 60b7cca7..b53830cf 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -85,7 +85,7 @@ these two steps until appropriate clusters have been formed: We can see this process in action in this animation: ```{r kmeans-animation, echo = FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics("../fig/kmeans.gif") +knitr::include_graphics(here::here("fig/kmeans.gif")) ``` While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, namely difficulties in identifying From a1103da687c4f778234abc1afd99260c58fbe863 Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Mon, 18 Mar 2024 20:31:43 +0000 Subject: [PATCH 07/11] Update _episodes_rmd/06-k-means.Rmd --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index b53830cf..60b7cca7 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -85,7 +85,7 @@ these two steps until appropriate clusters have been formed: We can see this process in action in this animation: ```{r kmeans-animation, echo = FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics(here::here("fig/kmeans.gif")) +knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, namely difficulties in identifying From fe13a414e8acefa0c913149a3bc7992ea21edc95 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:30:37 +0000 Subject: [PATCH 08/11] add full stop Co-authored-by: Alan O'Callaghan --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 60b7cca7..390f68c9 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -80,7 +80,7 @@ and this is discussed below. Once we have picked intitial points, we then follow these two steps until appropriate clusters have been formed: 1. Assign each data point to the cluster with the closest centroid -2. Update centroid positions as the average of the points in that cluster +2. Update centroid positions as the average of the points in that cluster. We can see this process in action in this animation: From 7f7b8ffc8ecdc15249121c8e270b16b4e0929b15 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:31:40 +0000 Subject: [PATCH 09/11] typo fix adn --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 390f68c9..2fd2607f 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -69,7 +69,7 @@ so that we become more confident about the shape and size of the clusters. user-defined number of distinct, non-overlapping clusters. To create clusters of 'similar' data points, K-means clustering creates clusters that minimise the -within-cluster variation adn thus the amount that +within-cluster variation and thus the amount that data points within a cluster differ from each other. The distance between data points within a cluster is used as a measure of within-cluster variation. From ee8917a682c861a54c6e3637f0f63eeaf2b13901 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:32:32 +0000 Subject: [PATCH 10/11] namely to particularly --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 2fd2607f..071c5e7e 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -88,7 +88,7 @@ We can see this process in action in this animation: knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and -to understand), it does have some disadvantages, namely difficulties in identifying +to understand), it does have some disadvantages, particularly difficulties in identifying initial clusters which observations belong to and the need for the user to specifiy the number of clusters that the data should be partitioned into. From 67a85aa860213df33559d419b7168a08846e9156 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:33:08 +0000 Subject: [PATCH 11/11] specifiy typo fix --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 071c5e7e..1ea085e5 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -89,7 +89,7 @@ knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, particularly difficulties in identifying -initial clusters which observations belong to and the need for the user to specifiy the +initial clusters which observations belong to and the need for the user to specify the number of clusters that the data should be partitioned into. > ## Initialisation