From 77bb5d814c1f46ade1549fc7c05752c4d240c3b3 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:20:57 +0000 Subject: [PATCH 01/11] rewrite introduction, tasks 1-3 mainly to motivate by clarifying differences compared to pca and fa since these are already discussed --- episodes/06-k-means.Rmd | 46 ++++++++++++++++++++++------------------- 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index 0d34c971..5a8ef56d 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -32,31 +32,35 @@ knitr_fig_path("08-") # Introduction -High-dimensional data, especially in biological settings, has -many sources of heterogeneity. Some of these are stochastic variation -arising from measurement error or random differences between organisms. -In some cases, a known grouping causes this heterogeneity (sex, treatment -groups, etc). In other cases, this heterogeneity arises from the presence of -unknown subgroups in the data. **Clustering** is a set of techniques that allows -us to discover unknown groupings like this, which we can often use to -discover the nature of the heterogeneity we're investigating. - -**Cluster analysis** involves finding groups of observations that are more -similar to each other (according to some feature) than they are to observations -in other groups. Cluster analysis is a useful statistical tool for exploring -high-dimensional datasets as -visualising data with large numbers of features is difficult. It is commonly -used in fields such as bioinformatics, genomics, and image processing in which -large datasets that include many features are often produced. Once groups -(or clusters) of observations have been identified using cluster analysis, -further analyses or interpretation can be carried out on the groups, for -example, using metadata to further explore groups. +As we saw in previous episodes, visualising high-dimensional +data with a large amount of features is difficult and can +limit our understanding of the data and associated processes. +In some cases, a known grouping causes this heterogeneity +(sex, treatment groups, etc). In other cases, heterogeneity +may arise from the presence of unknown subgroups in the data. +While PCA can be used to reduce the dimension of the dataset +into a smaller set of uncorrelated variables and factor analysis +can be used to identify underlying factors, clustering is a set +of techniques that allow us to discover unknown groupings. + +Cluster analysis involves finding groups of observations that +are more similar to each other (according to some feature) +than they are to observations in other groups and are thus +likely to represent the same source of heterogeneity. +Once groups (or clusters) of observations have been identified +using cluster analysis, further analyses or interpretation can be +carried out on the groups, for example, using metadata to further +explore groups. + +Cluster analysis is commonly used to discover unknown groupings +in fields such as bioinformatics, genomics, and image processing, +in which large datasets that include many features are often produced. There are various ways to look for clusters of observations in a dataset using different *clustering algorithms*. One way of clustering data is to minimise distance between observations within a cluster and maximise distance between -proposed clusters. Clusters can be updated in an iterative process so that over -time we can become more confident in size and shape of clusters. +proposed clusters. Using this process, we can also iteratively update clusters +so that we become more confident about the shape and size of the clusters. # Believing in clusters From 8917bae3b14fc4d7cbc19c21dfe5d7f8521fe421 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:22:47 +0000 Subject: [PATCH 02/11] move believing in clusters to after methodology, task 4 think it's clearer to explain believing in clusters after fully describing what clusters are --- episodes/06-k-means.Rmd | 90 ++++++++++++++++++++--------------------- 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index 5a8ef56d..0f53af6f 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -63,51 +63,6 @@ proposed clusters. Using this process, we can also iteratively update clusters so that we become more confident about the shape and size of the clusters. -# Believing in clusters - -When using clustering, it's important to realise that data may seem to -group together even when these groups are created randomly. It's especially -important to remember this when making plots that add extra visual aids to -distinguish clusters. -For example, if we cluster data from a single 2D normal distribution and draw -ellipses around the points, these clusters suddenly become almost visually -convincing. This is a somewhat extreme example, since there is genuinely no -heterogeneity in the data, but it does reflect what can happen if you allow -yourself to read too much into faint signals. - -Let's explore this further using an example. We create two columns of data -('x' and 'y') and partition these data into three groups ('a', 'b', 'c') -according to data values. We then plot these data and their allocated clusters -and put ellipses around the clusters using the `stat_ellipse` function -in `ggplot`. - -```{r fake-cluster, echo = FALSE} -set.seed(11) -library("MASS") -library("ggplot2") -data <- mvrnorm(n = 200, mu = rep(1, 2), Sigma = matrix(runif(4), ncol = 2)) -data <- as.data.frame(data) -colnames(data) <- c("x", "y") - -data$cluster <- ifelse( - data$y < (data$x * -0.06 + 0.9), - "a", - ifelse( - data$y < 1.15, - "b", - "c" - ) -) -ggplot(data, aes(x, y, colour = cluster)) + - geom_point() + - stat_ellipse() -``` -The randomly created data used here appear to form three clusters when we -plot the data. Putting ellipses around the clusters can further convince us -that the clusters are 'real'. But how do we tell if clusters identified -visually are 'real'? - - # What is K-means clustering? **K-means clustering** is a clustering method which groups data points into a @@ -155,6 +110,51 @@ number of clusters that the data should be partitioned into. > {: .callout} + +# Believing in clusters + +When using clustering, it's important to realise that data may seem to +group together even when these groups are created randomly. It's especially +important to remember this when making plots that add extra visual aids to +distinguish clusters. +For example, if we cluster data from a single 2D normal distribution and draw +ellipses around the points, these clusters suddenly become almost visually +convincing. This is a somewhat extreme example, since there is genuinely no +heterogeneity in the data, but it does reflect what can happen if you allow +yourself to read too much into faint signals. + +Let's explore this further using an example. We create two columns of data +('x' and 'y') and partition these data into three groups ('a', 'b', 'c') +according to data values. We then plot these data and their allocated clusters +and put ellipses around the clusters using the `stat_ellipse` function +in `ggplot`. + +```{r fake-cluster, echo = FALSE} +set.seed(11) +library("MASS") +library("ggplot2") +data <- mvrnorm(n = 200, mu = rep(1, 2), Sigma = matrix(runif(4), ncol = 2)) +data <- as.data.frame(data) +colnames(data) <- c("x", "y") + +data$cluster <- ifelse( + data$y < (data$x * -0.06 + 0.9), + "a", + ifelse( + data$y < 1.15, + "b", + "c" + ) +) +ggplot(data, aes(x, y, colour = cluster)) + + geom_point() + + stat_ellipse() +``` +The randomly created data used here appear to form three clusters when we +plot the data. Putting ellipses around the clusters can further convince us +that the clusters are 'real'. But how do we tell if clusters identified +visually are 'real'? + # K-means clustering applied to single-cell RNAseq data Let's carry out K-means clustering in `R` using some real high-dimensional data. From 26e91616fd95d6163a1445c524c54bc173bef7f3 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:29:06 +0000 Subject: [PATCH 03/11] rewrite initial description of k means clustering, tasks 5 and 6 --- episodes/06-k-means.Rmd | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index 0f53af6f..b0356dcb 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -65,12 +65,16 @@ so that we become more confident about the shape and size of the clusters. # What is K-means clustering? -**K-means clustering** is a clustering method which groups data points into a -user-defined number of distinct non-overlapping clusters. In K-means clustering -we are interested in minimising the *within-cluster variation*. This is the amount that -data points within a cluster differ from each other. In K-means clustering, the distance -between data points within a cluster is used as a measure of within-cluster variation. -Using a specified clustering algorithm like K-means clustering increases our confidence +**K-means clustering** groups data points into a +user-defined number of distinct, non-overlapping clusters. +To create clusters of 'similar' data points, K-means +clustering creates clusters that minimise the +within-cluster variation adn thus the amount that +data points within a cluster differ from each other. +The distance between data points within a cluster is +used as a measure of within-cluster variation. +Using a specified clustering algorithm like K-means clustering +increases our confidence that our data can be partitioned into groups. To carry out K-means clustering, we first pick $k$ initial points as centres or From 10428c81c7e27764da4fe297657377219065f656 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:30:03 +0000 Subject: [PATCH 04/11] remove final sentence from intro to method, task 7 unclear what a specified clustering algorithm is and how this increases our confidence that data can be partitioned into groups at this stage --- episodes/06-k-means.Rmd | 3 --- 1 file changed, 3 deletions(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index b0356dcb..af767ad5 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -73,9 +73,6 @@ within-cluster variation adn thus the amount that data points within a cluster differ from each other. The distance between data points within a cluster is used as a measure of within-cluster variation. -Using a specified clustering algorithm like K-means clustering -increases our confidence -that our data can be partitioned into groups. To carry out K-means clustering, we first pick $k$ initial points as centres or "centroids" of our clusters. There are a few ways to choose these initial "centroids", From d5bfa40cc83a27ca65e2b055ace671bb1c8a6e0e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:33:56 +0000 Subject: [PATCH 05/11] remove mention of random initialisation in the method and clarify what convergence looks like, tasks 8 and 9 Picking initial points randomly here may be misleading for someone just looking up the method from this section. Have simply omitted and said that this is discussed below. Also, have removed the word convergence in favour of a description of what convergence looks like --- episodes/06-k-means.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index af767ad5..d815fd0a 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -75,9 +75,9 @@ The distance between data points within a cluster is used as a measure of within-cluster variation. To carry out K-means clustering, we first pick $k$ initial points as centres or -"centroids" of our clusters. There are a few ways to choose these initial "centroids", -but for simplicity let's imagine we just pick three random co-ordinates. -We then follow these two steps until convergence: +"centroids" of our clusters. There are a few ways to choose these initial "centroids" +and this is discussed below. Once we have picked intitial points, we then follow +these two steps until appropriate clusters have been formed: 1. Assign each data point to the cluster with the closest centroid 2. Update centroid positions as the average of the points in that cluster From c9f0fbcd38f7b30a382de734956db16e01d84dee Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 12:19:55 +0000 Subject: [PATCH 06/11] add here to k means Co-authored-by: Alan O'Callaghan --- episodes/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index 60b7cca7..b53830cf 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -85,7 +85,7 @@ these two steps until appropriate clusters have been formed: We can see this process in action in this animation: ```{r kmeans-animation, echo = FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics("../fig/kmeans.gif") +knitr::include_graphics(here::here("fig/kmeans.gif")) ``` While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, namely difficulties in identifying From 460f348d2c7cc42e4416c52421f9e156c81fe04e Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Mon, 18 Mar 2024 20:31:43 +0000 Subject: [PATCH 07/11] Update _episodes_rmd/06-k-means.Rmd --- episodes/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index b53830cf..60b7cca7 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -85,7 +85,7 @@ these two steps until appropriate clusters have been formed: We can see this process in action in this animation: ```{r kmeans-animation, echo = FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics(here::here("fig/kmeans.gif")) +knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, namely difficulties in identifying From f7d19b202725f93d672f7d8e929525606209f48a Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:30:37 +0000 Subject: [PATCH 08/11] add full stop Co-authored-by: Alan O'Callaghan --- episodes/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index 60b7cca7..390f68c9 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -80,7 +80,7 @@ and this is discussed below. Once we have picked intitial points, we then follow these two steps until appropriate clusters have been formed: 1. Assign each data point to the cluster with the closest centroid -2. Update centroid positions as the average of the points in that cluster +2. Update centroid positions as the average of the points in that cluster. We can see this process in action in this animation: From b3688a5f8b615ea5be35600ab99953ad81be5ade Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:31:40 +0000 Subject: [PATCH 09/11] typo fix adn --- episodes/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index 390f68c9..2fd2607f 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -69,7 +69,7 @@ so that we become more confident about the shape and size of the clusters. user-defined number of distinct, non-overlapping clusters. To create clusters of 'similar' data points, K-means clustering creates clusters that minimise the -within-cluster variation adn thus the amount that +within-cluster variation and thus the amount that data points within a cluster differ from each other. The distance between data points within a cluster is used as a measure of within-cluster variation. From 48cf286f282703f48d94dd0318cf1690f64e2457 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:32:32 +0000 Subject: [PATCH 10/11] namely to particularly --- episodes/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index 2fd2607f..071c5e7e 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -88,7 +88,7 @@ We can see this process in action in this animation: knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and -to understand), it does have some disadvantages, namely difficulties in identifying +to understand), it does have some disadvantages, particularly difficulties in identifying initial clusters which observations belong to and the need for the user to specifiy the number of clusters that the data should be partitioned into. From d700befc6e3c955b0eed8a8e545c72164671f9a9 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:33:08 +0000 Subject: [PATCH 11/11] specifiy typo fix --- episodes/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/06-k-means.Rmd b/episodes/06-k-means.Rmd index 071c5e7e..1ea085e5 100644 --- a/episodes/06-k-means.Rmd +++ b/episodes/06-k-means.Rmd @@ -89,7 +89,7 @@ knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, particularly difficulties in identifying -initial clusters which observations belong to and the need for the user to specifiy the +initial clusters which observations belong to and the need for the user to specify the number of clusters that the data should be partitioned into. > ## Initialisation