From c55220044e2f1fb6b87511a7909b3e55b1162179 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:18:24 +0000 Subject: [PATCH 001/119] remove "imagine", task 1 more consistent with high-dim data definition at the start also --- _episodes_rmd/04-principal-component-analysis.Rmd | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index a1d54d0c..26353334 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -46,12 +46,11 @@ knitr_fig_path("05-") # Introduction -Imagine a dataset which contains many variables ($p$), close to the total number -of rows in the dataset ($n$). Some of these variables are highly correlated and -several form groups which you might expect to represent the same overall effect. -Such datasets are challenging to analyse for several reasons, with the main -problem being how to reduce dimensionality in the dataset while retaining the -important features. +If a dataset contains many variables ($p$), it is likely that some of these +variables will be highly correlated. Variables may even be so highly correlated +that they represent the same overall effect. Such datasets are challenging +to analyse for several reasons, with the main problem being how to reduce +dimensionality in the dataset while retaining the important features. In this episode we will explore *principal component analysis* (PCA) as a popular method of analysing high-dimensional data. PCA is an unsupervised From 6c8b768f09f006ca6dfebc8954298e1a199ebfb7 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:30:37 +0000 Subject: [PATCH 002/119] rewording end of introduction, task 2 --- _episodes_rmd/04-principal-component-analysis.Rmd | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 26353334..5493956a 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -56,13 +56,13 @@ In this episode we will explore *principal component analysis* (PCA) as a popular method of analysing high-dimensional data. PCA is an unsupervised statistical method which allows large datasets of correlated variables to be summarised into smaller numbers of uncorrelated principal components that -explain most of the variability in the original dataset. This is useful, -for example, during initial data exploration as it allows correlations among -data points to be observed and principal components to be calculated for -inclusion in further analysis (e.g. linear regression). An example of PCA might -be reducing several variables representing aspects of patient health -(blood pressure, heart rate, respiratory rate) into a single feature. - +explain most of the variability in the original dataset. As an example, +PCA might reduce several variables representing aspects of patient health +(blood pressure, heart rate, respiratory rate) into a single feature capturing +an overarching "patient health" effect. This is useful from an exploratory point +of view, discovering how variables might be associated and combined, but the +associated principal component can also be used as an effect in further analysis +(e.g. linear regression). From 87e86664b00b04dcf1197e8da2ff02af2e553001 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:31:36 +0000 Subject: [PATCH 003/119] minor edits to end of introduction, task 2 --- _episodes_rmd/04-principal-component-analysis.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 5493956a..7eee5cdb 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -60,8 +60,8 @@ explain most of the variability in the original dataset. As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect. This is useful from an exploratory point -of view, discovering how variables might be associated and combined, but the -associated principal component can also be used as an effect in further analysis +of view, discovering how variables might be associated and combined. The the +associated principal component could also be used as an effect in further analysis (e.g. linear regression). From e0e1118b513b2f497de97a7f1ec8fc540136cba1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:36:56 +0000 Subject: [PATCH 004/119] remove all reference to supervised learning It is referenced in the callout after its first and only mention in the introduction (and not referenced in any other episodes as far as I can tell). I think it's probably not necessary and maybe even distracting from a cognitive overload point of view. If we want to integrate this with some ML jargon, I would suggest it's included much earlier in the episodes and the terminology is used throughout. --- .../04-principal-component-analysis.Rmd | 24 +------------------ 1 file changed, 1 insertion(+), 23 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 7eee5cdb..fcb9e107 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -53,7 +53,7 @@ to analyse for several reasons, with the main problem being how to reduce dimensionality in the dataset while retaining the important features. In this episode we will explore *principal component analysis* (PCA) as a -popular method of analysing high-dimensional data. PCA is an unsupervised +popular method of analysing high-dimensional data. PCA is a statistical method which allows large datasets of correlated variables to be summarised into smaller numbers of uncorrelated principal components that explain most of the variability in the original dataset. As an example, @@ -90,28 +90,6 @@ Disadvantages: regression). -> ## Supervised vs unsupervised learning -> Most statistical problems fall into one of two categories: supervised or -> unsupervised learning. -> Examples of supervised learning problems include linear regression and include -> analyses in which each observation has both at least one independent variable -> ($x$) as well as a dependent variable ($y$). In supervised learning problems -> the aim is to predict the value of the response given future observations or -> to understand the relationship between the dependent variable and the -> predictors. In unsupervised learning for each observation there is no -> dependent variable ($y$), but only -> a series of independent variables. In this situation there is no need for -> prediction, as there is no dependent variable to predict (hence the analysis -> can be thought as being unsupervised by the dependent variable). Instead -> statistical analysis can be used to understand relationships between the -> independent variables or between observations themselves. Unsupervised -> learning problems often occur when analysing high-dimensional datasets in -> which there is no obvious dependent variable to be -> predicted, but the analyst would like to understand more about patterns -> between groups of observations or reduce dimensionality so that a supervised -> learning process may be used. -{: .callout} - > ## Challenge 1 > From 8515fc7a42552d2da72e0237f8199c13e2a3046c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:39:30 +0000 Subject: [PATCH 005/119] move advantages and disadvantages to after description of PCA, task 3 propose that this is difficult to understand without first really understanding what PCA is --- .../04-principal-component-analysis.Rmd | 50 +++++++++---------- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index fcb9e107..33acc5f2 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -66,31 +66,6 @@ associated principal component could also be used as an effect in further analys -# Advantages and disadvantages of PCA - -Advantages: -* It is a relatively easy to use and popular method. -* Various software/packages are available to run a PCA. -* The calculations used in a PCA are easy to understand for statisticians and - non-statisticians alike. - -Disadvantages: -* It assumes that variables in a dataset are correlated. -* It is sensitive to the scale at which input variables are measured. - If input variables are measured at different scales, the variables - with large variance relative to the scale of measurement will have - greater impact on the principal components relative to variables with smaller - variance. In many cases, this is not desirable. -* It is not robust against outliers, meaning that very large or small data - points can have a large effect on the output of the PCA. -* PCA assumes a linear relationship between variables which is not always a - realistic assumption. -* It can be difficult to interpret the meaning of the principal components, - especially when including them in further analyses (e.g. inclusion in a linear - regression). - - - > ## Challenge 1 > > Descriptions of three datasets and research questions are given below. For @@ -439,6 +414,31 @@ depending on the PCA implementation you use. Here are some examples: +# Advantages and disadvantages of PCA + +Advantages: +* It is a relatively easy to use and popular method. +* Various software/packages are available to run a PCA. +* The calculations used in a PCA are easy to understand for statisticians and + non-statisticians alike. + +Disadvantages: +* It assumes that variables in a dataset are correlated. +* It is sensitive to the scale at which input variables are measured. + If input variables are measured at different scales, the variables + with large variance relative to the scale of measurement will have + greater impact on the principal components relative to variables with smaller + variance. In many cases, this is not desirable. +* It is not robust against outliers, meaning that very large or small data + points can have a large effect on the output of the PCA. +* PCA assumes a linear relationship between variables which is not always a + realistic assumption. +* It can be difficult to interpret the meaning of the principal components, + especially when including them in further analyses (e.g. inclusion in a linear + regression). + + + # Using PCA to analyse gene expression data In this section you will carry out your own PCA using the Bioconductor package **`PCAtools`** From 7ecc5f2f3011d49bd690dd70144fa0048a455e4d Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:06:46 +0000 Subject: [PATCH 006/119] edit advantages and disadvantages, task 4 make PCA being easy statement relative, rather than saying it's generally easy for everyone --- _episodes_rmd/04-principal-component-analysis.Rmd | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 109236f3..37aee14d 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -417,8 +417,7 @@ depending on the PCA implementation you use. Here are some examples: Advantages: * It is a relatively easy to use and popular method. * Various software/packages are available to run a PCA. -* The calculations used in a PCA are easy to understand for statisticians and - non-statisticians alike. +* The calculations used in a PCA are simple to understand compared to other methods for dimension reduction. Disadvantages: * It assumes that variables in a dataset are correlated. From 576d05636d1684faea4831172b72dcda951426eb Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:08:38 +0000 Subject: [PATCH 007/119] edit PCA section title I think this should be called PCA for signposting that this presents the whole method essentially --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 37aee14d..0eda584e 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -105,7 +105,7 @@ associated principal component could also be used as an effect in further analys {: .challenge} -# What is a principal component? +# Principal component analysis ```{r, eval=FALSE, echo=FALSE} From 9db44fa3f41ab25f6323623d1dede1a152f34f09 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:09:01 +0000 Subject: [PATCH 008/119] complete task 5 --- _episodes_rmd/04-principal-component-analysis.Rmd | 1 - 1 file changed, 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 0eda584e..aed0ef20 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -107,7 +107,6 @@ associated principal component could also be used as an effect in further analys # Principal component analysis - ```{r, eval=FALSE, echo=FALSE} # A PCA is carried out by calculating a matrix of Pearson's correlations from # the original dataset which shows how each of the variables in the dataset From ccd9762946d3905110f5a39a5544bd86fe75e7a9 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:17:16 +0000 Subject: [PATCH 009/119] describe what a principal component is first, task 5 --- _episodes_rmd/04-principal-component-analysis.Rmd | 1 + 1 file changed, 1 insertion(+) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index aed0ef20..6b1bb2e0 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -113,6 +113,7 @@ associated principal component could also be used as an effect in further analys # relate to each other. ``` +PCA transforms data to new uncorrelated variables called "principal components". The first principal component is the direction of the data along which the observations vary the most. The second principal component is the direction of the data along which the observations show the next highest amount of variation. From 930cd0b1b487855bc9b748b36133d07ae51b4ce1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:43:13 +0000 Subject: [PATCH 010/119] reorder, rewrite description of pca and remove repetition --- .../04-principal-component-analysis.Rmd | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 6b1bb2e0..438d7a6e 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -114,10 +114,16 @@ associated principal component could also be used as an effect in further analys ``` PCA transforms data to new uncorrelated variables called "principal components". -The first principal component is the direction of the data along which the -observations vary the most. The second principal component is the direction of -the data along which the observations show the next highest amount of variation. -For example, Figure 1 shows biodiversity index versus percentage area left +Each principal component is a linear combination of the variables in the data +set. The first principal component is the direction of the data along which the +observations vary the most. In other words, the first principal component +explains the largest amount of the variability in the underlying dataset. +The second principal component is the direction of +the data along which the observations show the next highest amount of variation +(and explains the second largest amount of variability in the dataset). + + +Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component direction of the data, which is the direction along which there is greatest variability in the data. Projecting points onto this line @@ -126,11 +132,6 @@ vector of points with the greatest possible variance. The next highest amount of variability in the data is represented by the line perpendicular to first regression line which represents the second principal component (green line). -The second principal component is a linear combination of the variables that -is uncorrelated with the first principal component. There are as many principal -components as there are variables in your dataset, but as we'll see, some are -more useful at explaining your data than others. By definition, the first -principal component explains more variation than other principal components. ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} # ![Figure 1: Biodiversity index and percentage area fallow PCA](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Bio index vs percentage fallow.png) From 84eacc8e1ca0204af3bcce1b5416b9cd775dcf15 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:49:12 +0000 Subject: [PATCH 011/119] avoid talking about projections This may need further editing as I'm not sure it's easy to understand yet --- _episodes_rmd/04-principal-component-analysis.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 438d7a6e..dc4fdf0d 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -126,11 +126,11 @@ the data along which the observations show the next highest amount of variation Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component direction of the data, which is the direction along which -there is greatest variability in the data. Projecting points onto this line -(i.e. by finding the location on the line closest to the point) would give a -vector of points with the greatest possible variance. The next highest amount +there is greatest variability in the data. Finding the location on the line +closest to a given data point would yield a vector of points with the +greatest possible variance. The next highest amount of variability in the data is represented by the line perpendicular to first -regression line which represents the second principal component (green line). +regression line, which represents the uncorrelated second principal component (green line). ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} From 4385f8a292f7abbf5656ae9908d2340bb42cfe35 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:59:13 +0000 Subject: [PATCH 012/119] mathematical description sooner and link to initial description, task 6 --- .../04-principal-component-analysis.Rmd | 24 ++++++++++++------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index dc4fdf0d..3c246bff 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -120,8 +120,20 @@ observations vary the most. In other words, the first principal component explains the largest amount of the variability in the underlying dataset. The second principal component is the direction of the data along which the observations show the next highest amount of variation -(and explains the second largest amount of variability in the dataset). +(and explains the second largest amount of variability in the dataset), and so on. +Mathematically, the first principal component values or _scores_, $Z_1$, are a linear combination +of variables in the dataset, $X_1...X_p$: + +$$ + Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, +$$ + +where $a_{11}...a_{p1}$ represent principal component _loadings_, +which can be thought of as the degree to +which each variable contributes to the calculation of the principal component. +The values of $a_{11}...a_{p1}$ are found so that the principal component (scores), $Z_1$, +explain most of the variation in the dataset. Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first @@ -129,7 +141,7 @@ principal component direction of the data, which is the direction along which there is greatest variability in the data. Finding the location on the line closest to a given data point would yield a vector of points with the greatest possible variance. The next highest amount -of variability in the data is represented by the line perpendicular to first +of variability in the data is represented by the line perpendicular to the first regression line, which represents the uncorrelated second principal component (green line). @@ -153,15 +165,9 @@ knitr::include_graphics("../fig/pendulum.gif") ``` -The first principal component's scores ($Z_1$) are calculated using the equation: -$$ - Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p -$$ -$X_1...X_p$ represents variables in the original dataset and $a_{11}...a_{p1}$ -represent principal component loadings, which can be thought of as the degree to -which each variable contributes to the calculation of the principal component. + We will come back to principal component scores and loadings further below. # How do we perform a PCA? From 5709dec112677046f2ca282d39a59ccc26582643 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:10:26 +0000 Subject: [PATCH 013/119] simplify description of pca --- .../04-principal-component-analysis.Rmd | 21 ++++++++----------- 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 3c246bff..aa6ba6cd 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -131,18 +131,17 @@ $$ where $a_{11}...a_{p1}$ represent principal component _loadings_, which can be thought of as the degree to -which each variable contributes to the calculation of the principal component. -The values of $a_{11}...a_{p1}$ are found so that the principal component (scores), $Z_1$, -explain most of the variation in the dataset. +which each variable contributes to the calculation of the principal component scores. +The values of $a_{11}...a_{p1}$ are found so that the principal component scores, $Z_1$, +explain most of the variation in the dataset. Once we have calculated the principal component scores by finding the loadings, we can use them as new variables. +To see what these new principal component variables may look like, Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first -principal component direction of the data, which is the direction along which -there is greatest variability in the data. Finding the location on the line -closest to a given data point would yield a vector of points with the -greatest possible variance. The next highest amount -of variability in the data is represented by the line perpendicular to the first -regression line, which represents the uncorrelated second principal component (green line). +principal component scores, which pass through the points with the greatest +variability. The points along this line give the first principal component scores. +The second principal component scores explain the next highest amount of variability +in the data and are represented by the line perpendicular to the first (green line). ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} @@ -150,7 +149,7 @@ regression line, which represents the uncorrelated second principal component (g knitr::include_graphics("../fig/bio_index_vs_percentage_fallow.png") ``` -The animation below illustrates how principal components are calculated from +The animation below illustrates how principal components are calculated iteratively from data. You can imagine that the black line is a rod and each red dashed line is a spring. The energy of each spring is proportional to its squared length. The direction of the first principal component is the one that minimises the total @@ -166,8 +165,6 @@ knitr::include_graphics("../fig/pendulum.gif") - - We will come back to principal component scores and loadings further below. # How do we perform a PCA? From baee36e30656fc2b32a26daa2e2cf64810497fc1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:14:25 +0000 Subject: [PATCH 014/119] use "here" for file paths --- _episodes_rmd/04-principal-component-analysis.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index aa6ba6cd..07749a02 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -146,7 +146,7 @@ in the data and are represented by the line perpendicular to the first (green li ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} # ![Figure 1: Biodiversity index and percentage area fallow PCA](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Bio index vs percentage fallow.png) -knitr::include_graphics("../fig/bio_index_vs_percentage_fallow.png") +knitr::include_graphics(here("fig/bio_index_vs_percentage_fallow.png")) ``` The animation below illustrates how principal components are calculated iteratively from @@ -160,7 +160,7 @@ principal component. This is explained in more detail on [this Q&A website](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues). ```{r pendulum, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics("../fig/pendulum.gif") +knitr::include_graphics(here("fig/pendulum.gif")) ``` @@ -470,7 +470,7 @@ associated metadata, downloaded from the ```{r se} library("SummarizedExperiment") -cancer <- readRDS(here::here("data/cancer_expression.rds")) +cancer <- readRDS(here("data/cancer_expression.rds") mat <- assay(cancer) metadata <- colData(cancer) ``` From 6f077ce512825ed77f6f2718b0878210529a325c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:15:05 +0000 Subject: [PATCH 015/119] add close bracket for file paths --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 07749a02..627f914d 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -470,7 +470,7 @@ associated metadata, downloaded from the ```{r se} library("SummarizedExperiment") -cancer <- readRDS(here("data/cancer_expression.rds") +cancer <- readRDS(here("data/cancer_expression.rds")) mat <- assay(cancer) metadata <- colData(cancer) ``` From a31c0047341119486bd7a4244735e3531296bea7 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:22:29 +0000 Subject: [PATCH 016/119] remove all mention of directions in pca description I think talk of projections/directions is slightly confusing and possibly unnecessary here --- .../04-principal-component-analysis.Rmd | 24 ++++++++----------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 627f914d..e1d183b1 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -113,25 +113,21 @@ associated principal component could also be used as an effect in further analys # relate to each other. ``` -PCA transforms data to new uncorrelated variables called "principal components". -Each principal component is a linear combination of the variables in the data -set. The first principal component is the direction of the data along which the -observations vary the most. In other words, the first principal component -explains the largest amount of the variability in the underlying dataset. -The second principal component is the direction of -the data along which the observations show the next highest amount of variation -(and explains the second largest amount of variability in the dataset), and so on. - -Mathematically, the first principal component values or _scores_, $Z_1$, are a linear combination -of variables in the dataset, $X_1...X_p$: +PCA transforms a dataset into a new set of uncorrelated variables called "principal components". +The first principal component is derived to explain the largest amount of the variability +in the underlying dataset. The second principal component is derived to explain the second largest amount of variability in the dataset, and so on. + +Mathematically, each principal component is a linear combination of the variables in the data +set. That is, the first principal component values or _scores_, $Z_1$, are a linear combination +of variables in the dataset, $X_1...X_p$, given by $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ -where $a_{11}...a_{p1}$ represent principal component _loadings_, -which can be thought of as the degree to -which each variable contributes to the calculation of the principal component scores. +where $a_{11}...a_{p1}$ represent principal component _loadings_. These loadings can +be thought of as the degree to which each original variable contributes to +the calculation of the principal component scores. The values of $a_{11}...a_{p1}$ are found so that the principal component scores, $Z_1$, explain most of the variation in the dataset. Once we have calculated the principal component scores by finding the loadings, we can use them as new variables. From 375f92bae3196afcfb6bff0704fa76e1d9a14c87 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:26:10 +0000 Subject: [PATCH 017/119] add foreshadowing --- _episodes_rmd/04-principal-component-analysis.Rmd | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index e1d183b1..c498ff23 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -159,9 +159,7 @@ This is explained in more detail on [this Q&A website](https://stats.stackexchan knitr::include_graphics(here("fig/pendulum.gif")) ``` - - -We will come back to principal component scores and loadings further below. +In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. # How do we perform a PCA? From 24101a439c246392833c685082d9e9b37b39bd31 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:27:02 +0000 Subject: [PATCH 018/119] remove "The the" --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c498ff23..60a2a520 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -60,7 +60,7 @@ explain most of the variability in the original dataset. As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect. This is useful from an exploratory point -of view, discovering how variables might be associated and combined. The the +of view, discovering how variables might be associated and combined. The associated principal component could also be used as an effect in further analysis (e.g. linear regression). From 0fad8019abc03b1062545ceb60a6a2fc1233a7df Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:37:57 +0000 Subject: [PATCH 019/119] separate mathematical description I don't think it's necessary if focusing on practical description (not used later as far as I can see) --- .../04-principal-component-analysis.Rmd | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 60a2a520..c65c25b7 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -113,10 +113,10 @@ associated principal component could also be used as an effect in further analys # relate to each other. ``` -PCA transforms a dataset into a new set of uncorrelated variables called "principal components". -The first principal component is derived to explain the largest amount of the variability -in the underlying dataset. The second principal component is derived to explain the second largest amount of variability in the dataset, and so on. +PCA transforms a dataset into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability +in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset, and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. +callout Mathematically, each principal component is a linear combination of the variables in the data set. That is, the first principal component values or _scores_, $Z_1$, are a linear combination of variables in the dataset, $X_1...X_p$, given by @@ -125,13 +125,13 @@ $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ -where $a_{11}...a_{p1}$ represent principal component _loadings_. These loadings can +where $a_{11}...a_{p1}$ represent principal component _loadings_. + +In summary, the principal components values are called _scores_. The loadings can be thought of as the degree to which each original variable contributes to -the calculation of the principal component scores. -The values of $a_{11}...a_{p1}$ are found so that the principal component scores, $Z_1$, -explain most of the variation in the dataset. Once we have calculated the principal component scores by finding the loadings, we can use them as new variables. +the principal component scores. -To see what these new principal component variables may look like, +To see what these new principal component variables (scores) may look like, Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component scores, which pass through the points with the greatest From 7babb16f4749c244221cf1c8f464ac7b43615a5a Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:41:23 +0000 Subject: [PATCH 020/119] change mathematical description to callout --- .../04-principal-component-analysis.Rmd | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c65c25b7..5f3d2ba9 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -116,16 +116,14 @@ associated principal component could also be used as an effect in further analys PCA transforms a dataset into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset, and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. -callout -Mathematically, each principal component is a linear combination of the variables in the data -set. That is, the first principal component values or _scores_, $Z_1$, are a linear combination -of variables in the dataset, $X_1...X_p$, given by - -$$ - Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, -$$ - -where $a_{11}...a_{p1}$ represent principal component _loadings_. +> ## Mathematical description of PCA +> Mathematically, each principal component is a linear combination +> of the variables in the dataset. That is, the first principal +> component values or _scores_, $Z_1$, are a linear combination +> of variables in the dataset, $X_1...X_p$, given by +> $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ +> where $a_{11}...a_{p1}$ represent principal component _loadings_. +{: .callout} In summary, the principal components values are called _scores_. The loadings can be thought of as the degree to which each original variable contributes to From 10907425f411d978c6e6fbf12713e9b1b1bae221 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:44:50 +0000 Subject: [PATCH 021/119] remove comma from description of PCA --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 5f3d2ba9..d16199a7 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -114,7 +114,7 @@ associated principal component could also be used as an effect in further analys ``` PCA transforms a dataset into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability -in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset, and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. +in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. > ## Mathematical description of PCA > Mathematically, each principal component is a linear combination From 29d5c0bc7f4d0b2f3de34f220a13597b8d21b108 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:47:46 +0000 Subject: [PATCH 022/119] add justification for low dimensional dataset --- _episodes_rmd/04-principal-component-analysis.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index d16199a7..c8c8cefa 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -161,9 +161,9 @@ In this episode, we will see how to perform PCA to summarise the information in # How do we perform a PCA? -## A prostate cancer dataset +## Prostate cancer dataset -The `prostate` dataset represents data from 97 +To illustrate how to perform PCA initially, we start with a low dimensional dataset. The `prostate` dataset represents data from 97 men who have prostate cancer. The data come from a study which examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. From c21187dd4d0704464baa2a1418e7ff7393e95b2d Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:49:09 +0000 Subject: [PATCH 023/119] change title to Loadings and principal component scores already explained what they are and this section doesn't really explain what they are --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c8c8cefa..00059b69 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -326,7 +326,7 @@ explain >70% of variance in the data. But what do these two principal components mean? -## What are loadings and principal component scores? +## Loadings and principal component scores Most PCA functions will produce two main output matrices: the *principal component scores* and the *loadings*. The matrix of principal component scores From 0614955d3112778f2f81cd3f00a6eb30548d866b Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:51:00 +0000 Subject: [PATCH 024/119] remove prostate data set title if using this as a first example, task 7 and 8 --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 -- 1 file changed, 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 00059b69..bf127831 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -161,8 +161,6 @@ In this episode, we will see how to perform PCA to summarise the information in # How do we perform a PCA? -## Prostate cancer dataset - To illustrate how to perform PCA initially, we start with a low dimensional dataset. The `prostate` dataset represents data from 97 men who have prostate cancer. The data come from a study which examined the correlation between the level of prostate specific antigen and a number of From 443adb6d3228791d62439350b42922a682c0a83e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:53:13 +0000 Subject: [PATCH 025/119] move information about continuous variables to early text, task 9 --- _episodes_rmd/04-principal-component-analysis.Rmd | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index bf127831..f0daa642 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -113,7 +113,7 @@ associated principal component could also be used as an effect in further analys # relate to each other. ``` -PCA transforms a dataset into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability +PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. > ## Mathematical description of PCA @@ -183,8 +183,7 @@ Here we will calculate principal component scores for each of the rows in this dataset, using five principal components (one for each variable included in the PCA). We will include five clinical variables in our PCA, each of the continuous variables in the prostate dataset, so that we can create fewer variables -representing clinical markers of cancer progression. Standard PCAs are carried -out using continuous variables only. +representing clinical markers of cancer progression. First, we will examine the `prostate` dataset (originally part of the **`lasso2`** package): From fc6771df60b895a76517bef5295eeea95b1dee19 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:56:55 +0000 Subject: [PATCH 026/119] simplify example motivation --- _episodes_rmd/04-principal-component-analysis.Rmd | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index f0daa642..26bd4f85 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -179,11 +179,8 @@ Columns include: - `lpsa` (log-tranformed prostate specific antigen; level of PSA in blood). - `age` (patient age in years). -Here we will calculate principal component scores for each of the rows in this -dataset, using five principal components (one for each variable included in the -PCA). We will include five clinical variables in our PCA, each of the continuous -variables in the prostate dataset, so that we can create fewer variables -representing clinical markers of cancer progression. +We will perform PCA on the five continuous clinical variables in our dataset +so that we can create fewer variables representing clinical markers of cancer progression. First, we will examine the `prostate` dataset (originally part of the **`lasso2`** package): From 7e378baf60b5bdb9ddf827818c5c4c2db18afb3d Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:15:22 +0000 Subject: [PATCH 027/119] add reason for standardisation in this section, task 10 --- _episodes_rmd/04-principal-component-analysis.Rmd | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 26bd4f85..bbce63cf 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -161,7 +161,7 @@ In this episode, we will see how to perform PCA to summarise the information in # How do we perform a PCA? -To illustrate how to perform PCA initially, we start with a low dimensional dataset. The `prostate` dataset represents data from 97 +To illustrate how to perform PCA initially, we start with a low-dimensional dataset. The `prostate` dataset represents data from 97 men who have prostate cancer. The data come from a study which examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. @@ -205,7 +205,9 @@ head(pros2) ## Do we need to standardise the data? -Now we compare the variances between variables in the dataset. +Since PCA derives principal components based on the variance they explain in the data, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Standardisation is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to standardise if we want variables with low variance to carry less weight in the PCA. + +For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to standardise our variables first: ```{r var-hist, fig.cap="Caption", fig.cap="Alt"} apply(pros2, 2, var) @@ -216,8 +218,8 @@ hist(pros2$lbph, breaks = "FD") Note that variance is greatest for `lbph` and lowest for `lweight`. It is clear from this output that we need to scale each of these variables before including -them in a PCA analysis to ensure that differences in variances between variables -do not drive the calculation of principal components. In this example we +them in a PCA analysis to ensure that differences in variances +do not drive the calculation of principal components. In this example, we standardise all five variables to have a mean of 0 and a standard deviation of 1. From d02fde36a679aefe525e3e2bf975619015cad5b0 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:17:54 +0000 Subject: [PATCH 028/119] clarify why we have concluded that we need to scale, task 11 --- _episodes_rmd/04-principal-component-analysis.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index bbce63cf..c91787e4 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -205,7 +205,7 @@ head(pros2) ## Do we need to standardise the data? -Since PCA derives principal components based on the variance they explain in the data, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Standardisation is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to standardise if we want variables with low variance to carry less weight in the PCA. +PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Standardisation is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to standardise if we want variables with low variance to carry less weight in the PCA. For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to standardise our variables first: @@ -216,9 +216,9 @@ hist(pros2$lweight, breaks = "FD") hist(pros2$lbph, breaks = "FD") ``` -Note that variance is greatest for `lbph` and lowest for `lweight`. It is clear -from this output that we need to scale each of these variables before including -them in a PCA analysis to ensure that differences in variances +Note that variance is greatest for `lbph` and lowest for `lweight`. Since we +want each of the variables to be treated equally in our PCA, but there are large differences in the variances of the variables, we need to scale each of the variables before including +them in a PCA to ensure that differences in variances do not drive the calculation of principal components. In this example, we standardise all five variables to have a mean of 0 and a standard deviation of 1. From 2ff628c371717d0be60f210b11d77556f490db75 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:20:38 +0000 Subject: [PATCH 029/119] standardise -> scale for consistency --- _episodes_rmd/04-principal-component-analysis.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c91787e4..b6215dd1 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -203,11 +203,11 @@ pros2 <- prostate[, c("lcavol", "lweight", "lbph", "lcp", "lpsa")] head(pros2) ``` -## Do we need to standardise the data? +## Do we need to scale the data? -PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Standardisation is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to standardise if we want variables with low variance to carry less weight in the PCA. +PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Scaling is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to scale if we want variables with low variance to carry less weight in the PCA. -For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to standardise our variables first: +For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to scale our variables first: ```{r var-hist, fig.cap="Caption", fig.cap="Alt"} apply(pros2, 2, var) From 5ba37e784272bfde9ba7db94441f515ae0de6cc4 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:23:01 +0000 Subject: [PATCH 030/119] swap back mathematical description --- .../04-principal-component-analysis.Rmd | 32 +++++++++---------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index b6215dd1..c1ea5855 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -116,25 +116,12 @@ associated principal component could also be used as an effect in further analys PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. -> ## Mathematical description of PCA -> Mathematically, each principal component is a linear combination -> of the variables in the dataset. That is, the first principal -> component values or _scores_, $Z_1$, are a linear combination -> of variables in the dataset, $X_1...X_p$, given by -> $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ -> where $a_{11}...a_{p1}$ represent principal component _loadings_. -{: .callout} - -In summary, the principal components values are called _scores_. The loadings can -be thought of as the degree to which each original variable contributes to -the principal component scores. - -To see what these new principal component variables (scores) may look like, +To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component scores, which pass through the points with the greatest -variability. The points along this line give the first principal component scores. -The second principal component scores explain the next highest amount of variability +variability. The points along this line give the first principal component. +The second principal component explains the next highest amount of variability in the data and are represented by the line perpendicular to the first (green line). @@ -157,7 +144,18 @@ This is explained in more detail on [this Q&A website](https://stats.stackexchan knitr::include_graphics(here("fig/pendulum.gif")) ``` -In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. +> ## Mathematical description of PCA +> Mathematically, each principal component is a linear combination +> of the variables in the dataset. That is, the first principal +> component values or _scores_, $Z_1$, are a linear combination +> of variables in the dataset, $X_1...X_p$, given by +> $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ +> where $a_{11}...a_{p1}$ represent principal component _loadings_. +{: .callout} + +In summary, the principal components values are called _scores_. The loadings can +be thought of as the degree to which each original variable contributes to +the principal component scores. In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. # How do we perform a PCA? From 4ac16c07b32280009c14331fb097177292d9fd5e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:26:46 +0000 Subject: [PATCH 031/119] explain center=TRUE, task 14 scale=TRUE doesn't change the mean I think? --- _episodes_rmd/04-principal-component-analysis.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c1ea5855..393266ae 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -205,7 +205,7 @@ head(pros2) PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Scaling is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to scale if we want variables with low variance to carry less weight in the PCA. -For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to scale our variables first: +In this example, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to scale our variables first: ```{r var-hist, fig.cap="Caption", fig.cap="Alt"} apply(pros2, 2, var) @@ -257,8 +257,8 @@ deviation of 1. {: .challenge} Next we will carry out a PCA using the `prcomp()` function in base R. The input -data (`pros2`) is in the form of a matrix. Note that the `scale = TRUE` argument -is used to standardise the variables to have a mean 0 and standard deviation of +data (`pros2`) is in the form of a matrix. Note that the `center = TRUE` and `scale = TRUE` arguments +are used to standardise the variables to have a mean 0 and standard deviation of 1. ```{r prcomp} From e6c0f76114ff7af2f741036dfd95376339ca608d Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:27:57 +0000 Subject: [PATCH 032/119] task 15 --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 393266ae..de826127 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -272,7 +272,7 @@ We have calculated one principal component for each variable in the original dataset. How do we choose how many of these are necessary to represent the true variation in the data, without having extra components that are unnecessary? -Let's look at the relative importance of each component using `summary`. +Let's look at the relative importance of (variance explained by) each component using `summary`. ```{r summ} summary(pca.pros) From 37071bcbb48af7d35392241f037e63d05f15930c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:39:01 +0000 Subject: [PATCH 033/119] rewording to avoid repeating "also called", task 16 --- _episodes_rmd/04-principal-component-analysis.Rmd | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index de826127..88d89fa4 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -290,10 +290,7 @@ This returns the proportion of variance in the data explained by each of the PC3 a further `r prop.var[[3]]`%, PC4 approximately `r prop.var[[4]]`% and PC5 around `r prop.var[[5]]`%. -Let us visualise this. A plot of the amount of variance accounted for by each PC -is also called a scree plot. Note that the amount of variance accounted for by a principal -component is also called eigenvalue and thus the y-axis in scree plots if often -labelled “eigenvalue”. +Let us visualise this. A plot of the amount of variance accounted for by each PC is called a scree plot. Note that the amount of variance accounted for by a principal component is given by "eigenvalues". Thus, the y-axis in scree plots if often labelled "eigenvalue". Often, scree plots show a characteristic pattern where initially, the variance drops rapidly with each additional principal component. But then there is an “elbow” after which the From 21f4858d62f3df5f7a0ad71b5ffb4d5bb889079e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:20:57 +0000 Subject: [PATCH 034/119] rewrite introduction, tasks 1-3 mainly to motivate by clarifying differences compared to pca and fa since these are already discussed --- _episodes_rmd/06-k-means.Rmd | 46 ++++++++++++++++++++---------------- 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 0d34c971..5a8ef56d 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -32,31 +32,35 @@ knitr_fig_path("08-") # Introduction -High-dimensional data, especially in biological settings, has -many sources of heterogeneity. Some of these are stochastic variation -arising from measurement error or random differences between organisms. -In some cases, a known grouping causes this heterogeneity (sex, treatment -groups, etc). In other cases, this heterogeneity arises from the presence of -unknown subgroups in the data. **Clustering** is a set of techniques that allows -us to discover unknown groupings like this, which we can often use to -discover the nature of the heterogeneity we're investigating. - -**Cluster analysis** involves finding groups of observations that are more -similar to each other (according to some feature) than they are to observations -in other groups. Cluster analysis is a useful statistical tool for exploring -high-dimensional datasets as -visualising data with large numbers of features is difficult. It is commonly -used in fields such as bioinformatics, genomics, and image processing in which -large datasets that include many features are often produced. Once groups -(or clusters) of observations have been identified using cluster analysis, -further analyses or interpretation can be carried out on the groups, for -example, using metadata to further explore groups. +As we saw in previous episodes, visualising high-dimensional +data with a large amount of features is difficult and can +limit our understanding of the data and associated processes. +In some cases, a known grouping causes this heterogeneity +(sex, treatment groups, etc). In other cases, heterogeneity +may arise from the presence of unknown subgroups in the data. +While PCA can be used to reduce the dimension of the dataset +into a smaller set of uncorrelated variables and factor analysis +can be used to identify underlying factors, clustering is a set +of techniques that allow us to discover unknown groupings. + +Cluster analysis involves finding groups of observations that +are more similar to each other (according to some feature) +than they are to observations in other groups and are thus +likely to represent the same source of heterogeneity. +Once groups (or clusters) of observations have been identified +using cluster analysis, further analyses or interpretation can be +carried out on the groups, for example, using metadata to further +explore groups. + +Cluster analysis is commonly used to discover unknown groupings +in fields such as bioinformatics, genomics, and image processing, +in which large datasets that include many features are often produced. There are various ways to look for clusters of observations in a dataset using different *clustering algorithms*. One way of clustering data is to minimise distance between observations within a cluster and maximise distance between -proposed clusters. Clusters can be updated in an iterative process so that over -time we can become more confident in size and shape of clusters. +proposed clusters. Using this process, we can also iteratively update clusters +so that we become more confident about the shape and size of the clusters. # Believing in clusters From 89f7d9b715fd4a8bbb1229fcdcf87f2c3e231f4c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:22:47 +0000 Subject: [PATCH 035/119] move believing in clusters to after methodology, task 4 think it's clearer to explain believing in clusters after fully describing what clusters are --- _episodes_rmd/06-k-means.Rmd | 90 ++++++++++++++++++------------------ 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 5a8ef56d..0f53af6f 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -63,51 +63,6 @@ proposed clusters. Using this process, we can also iteratively update clusters so that we become more confident about the shape and size of the clusters. -# Believing in clusters - -When using clustering, it's important to realise that data may seem to -group together even when these groups are created randomly. It's especially -important to remember this when making plots that add extra visual aids to -distinguish clusters. -For example, if we cluster data from a single 2D normal distribution and draw -ellipses around the points, these clusters suddenly become almost visually -convincing. This is a somewhat extreme example, since there is genuinely no -heterogeneity in the data, but it does reflect what can happen if you allow -yourself to read too much into faint signals. - -Let's explore this further using an example. We create two columns of data -('x' and 'y') and partition these data into three groups ('a', 'b', 'c') -according to data values. We then plot these data and their allocated clusters -and put ellipses around the clusters using the `stat_ellipse` function -in `ggplot`. - -```{r fake-cluster, echo = FALSE} -set.seed(11) -library("MASS") -library("ggplot2") -data <- mvrnorm(n = 200, mu = rep(1, 2), Sigma = matrix(runif(4), ncol = 2)) -data <- as.data.frame(data) -colnames(data) <- c("x", "y") - -data$cluster <- ifelse( - data$y < (data$x * -0.06 + 0.9), - "a", - ifelse( - data$y < 1.15, - "b", - "c" - ) -) -ggplot(data, aes(x, y, colour = cluster)) + - geom_point() + - stat_ellipse() -``` -The randomly created data used here appear to form three clusters when we -plot the data. Putting ellipses around the clusters can further convince us -that the clusters are 'real'. But how do we tell if clusters identified -visually are 'real'? - - # What is K-means clustering? **K-means clustering** is a clustering method which groups data points into a @@ -155,6 +110,51 @@ number of clusters that the data should be partitioned into. > {: .callout} + +# Believing in clusters + +When using clustering, it's important to realise that data may seem to +group together even when these groups are created randomly. It's especially +important to remember this when making plots that add extra visual aids to +distinguish clusters. +For example, if we cluster data from a single 2D normal distribution and draw +ellipses around the points, these clusters suddenly become almost visually +convincing. This is a somewhat extreme example, since there is genuinely no +heterogeneity in the data, but it does reflect what can happen if you allow +yourself to read too much into faint signals. + +Let's explore this further using an example. We create two columns of data +('x' and 'y') and partition these data into three groups ('a', 'b', 'c') +according to data values. We then plot these data and their allocated clusters +and put ellipses around the clusters using the `stat_ellipse` function +in `ggplot`. + +```{r fake-cluster, echo = FALSE} +set.seed(11) +library("MASS") +library("ggplot2") +data <- mvrnorm(n = 200, mu = rep(1, 2), Sigma = matrix(runif(4), ncol = 2)) +data <- as.data.frame(data) +colnames(data) <- c("x", "y") + +data$cluster <- ifelse( + data$y < (data$x * -0.06 + 0.9), + "a", + ifelse( + data$y < 1.15, + "b", + "c" + ) +) +ggplot(data, aes(x, y, colour = cluster)) + + geom_point() + + stat_ellipse() +``` +The randomly created data used here appear to form three clusters when we +plot the data. Putting ellipses around the clusters can further convince us +that the clusters are 'real'. But how do we tell if clusters identified +visually are 'real'? + # K-means clustering applied to single-cell RNAseq data Let's carry out K-means clustering in `R` using some real high-dimensional data. From e9e81cdc1a660e89cade66f02b9e0202787d250c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:29:06 +0000 Subject: [PATCH 036/119] rewrite initial description of k means clustering, tasks 5 and 6 --- _episodes_rmd/06-k-means.Rmd | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 0f53af6f..b0356dcb 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -65,12 +65,16 @@ so that we become more confident about the shape and size of the clusters. # What is K-means clustering? -**K-means clustering** is a clustering method which groups data points into a -user-defined number of distinct non-overlapping clusters. In K-means clustering -we are interested in minimising the *within-cluster variation*. This is the amount that -data points within a cluster differ from each other. In K-means clustering, the distance -between data points within a cluster is used as a measure of within-cluster variation. -Using a specified clustering algorithm like K-means clustering increases our confidence +**K-means clustering** groups data points into a +user-defined number of distinct, non-overlapping clusters. +To create clusters of 'similar' data points, K-means +clustering creates clusters that minimise the +within-cluster variation adn thus the amount that +data points within a cluster differ from each other. +The distance between data points within a cluster is +used as a measure of within-cluster variation. +Using a specified clustering algorithm like K-means clustering +increases our confidence that our data can be partitioned into groups. To carry out K-means clustering, we first pick $k$ initial points as centres or From c2f865307c63f3b9a96d67f3df0b2867970c3017 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:30:03 +0000 Subject: [PATCH 037/119] remove final sentence from intro to method, task 7 unclear what a specified clustering algorithm is and how this increases our confidence that data can be partitioned into groups at this stage --- _episodes_rmd/06-k-means.Rmd | 3 --- 1 file changed, 3 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index b0356dcb..af767ad5 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -73,9 +73,6 @@ within-cluster variation adn thus the amount that data points within a cluster differ from each other. The distance between data points within a cluster is used as a measure of within-cluster variation. -Using a specified clustering algorithm like K-means clustering -increases our confidence -that our data can be partitioned into groups. To carry out K-means clustering, we first pick $k$ initial points as centres or "centroids" of our clusters. There are a few ways to choose these initial "centroids", From 0506bf18c8bb64b264096465b63c0d5604ed7150 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Wed, 6 Mar 2024 10:33:56 +0000 Subject: [PATCH 038/119] remove mention of random initialisation in the method and clarify what convergence looks like, tasks 8 and 9 Picking initial points randomly here may be misleading for someone just looking up the method from this section. Have simply omitted and said that this is discussed below. Also, have removed the word convergence in favour of a description of what convergence looks like --- _episodes_rmd/06-k-means.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index af767ad5..d815fd0a 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -75,9 +75,9 @@ The distance between data points within a cluster is used as a measure of within-cluster variation. To carry out K-means clustering, we first pick $k$ initial points as centres or -"centroids" of our clusters. There are a few ways to choose these initial "centroids", -but for simplicity let's imagine we just pick three random co-ordinates. -We then follow these two steps until convergence: +"centroids" of our clusters. There are a few ways to choose these initial "centroids" +and this is discussed below. Once we have picked intitial points, we then follow +these two steps until appropriate clusters have been formed: 1. Assign each data point to the cluster with the closest centroid 2. Update centroid positions as the average of the points in that cluster From 51d6b2d46b5c1ad25553f80b2c5b6160f34efc0a Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:11:53 +0000 Subject: [PATCH 039/119] differentiate between general clustering and hierarchical, task 1 since the original text re hierarchical clustering is true for general clustering algorithms --- _episodes_rmd/07-hierarchical.Rmd | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 1c39c57f..f64e1b23 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -40,14 +40,15 @@ knitr_fig_path("09-") When analysing high-dimensional data in the life sciences, it is often useful to identify groups of similar data points to understand more about the relationships -within the dataset. In *hierarchical clustering* an algorithm groups similar +within the dataset. General clustering algorithms group similar data points (or observations) into groups (or clusters). This results in a set of clusters, where each cluster is distinct, and the data points within each cluster have similar characteristics. The clustering algorithm works by iteratively grouping data points so that different clusters may exist at different stages of the algorithm's progression. -Unlike K-means clustering, *hierarchical clustering* does not require the +Here, we describe *hierarchical clustering*. Unlike K-means clustering, +hierarchical clustering does not require the number of clusters $k$ to be specified by the user before the analysis is carried out. Hierarchical clustering also provides an attractive *dendrogram*, a tree-like diagram showing the degree of similarity between clusters. From eafad99a2c963c8d370d4f9fb716a1e34bbaf826 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:13:19 +0000 Subject: [PATCH 040/119] move dendogram description to paragraph where discussed, task 2 --- _episodes_rmd/07-hierarchical.Rmd | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index f64e1b23..cf862eb9 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -50,9 +50,10 @@ of the algorithm's progression. Here, we describe *hierarchical clustering*. Unlike K-means clustering, hierarchical clustering does not require the number of clusters $k$ to be specified by the user before the analysis is carried -out. Hierarchical clustering also provides an attractive *dendrogram*, a -tree-like diagram showing the degree of similarity between clusters. +out. +Hierarchical clustering also provides an attractive *dendrogram*, a +tree-like diagram showing the degree of similarity between clusters. The dendrogram is a key feature of hierarchical clustering. This tree-shaped graph allows the similarity between data points in a dataset to be visualised and the arrangement of clusters produced by the analysis to be illustrated. Dendrograms are created From 4c4821beddce107b4fbed41d9dd1920ded746720 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:14:04 +0000 Subject: [PATCH 041/119] change -2 to minus 2, task 3 may be confused for [hyphen 2] --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index cf862eb9..ccb7526f 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -85,7 +85,7 @@ To start with, we measure distance of the dendrogram, each observation is considered to be in its own individual cluster. We start the clustering procedure by fusing the two observations that are most similar according to a distance matrix. Next, the next-most similar observations are fused -so that the total number of clusters is *number of observations* - 2 (see +so that the total number of clusters is *number of observations* minus 2 (see panel below). Groups of observations may then be merged into a larger cluster (see next panel below, green box). This process continues until all the observations are included in a single cluster. From c2d8f9dda3f963d81b7b943099a185b1325e2551 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:15:13 +0000 Subject: [PATCH 042/119] clarify patterns in heat map, task 4 --- _episodes_rmd/07-hierarchical.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index ccb7526f..78c05f04 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -117,7 +117,8 @@ methyl <- readRDS(here("data/methylation.rds")) methyl_mat <- t(assay(methyl)) ``` -Looking at a heatmap of these data, we may spot some patterns -- many columns +Looking at a heatmap of these data, we may spot some patterns -- looking at the +vertical stripes, many columns appear to have a similar methylation levels across all rows. However, they are all quite jumbled at the moment, so it's hard to tell how many line up exactly. From 78dc9f4c6254c8295a9be51cf8fbd39d6d60096e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:15:43 +0000 Subject: [PATCH 043/119] move description of heat map to after plot, task 5 --- _episodes_rmd/07-hierarchical.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 78c05f04..ec60b21d 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -117,11 +117,6 @@ methyl <- readRDS(here("data/methylation.rds")) methyl_mat <- t(assay(methyl)) ``` -Looking at a heatmap of these data, we may spot some patterns -- looking at the -vertical stripes, many columns -appear to have a similar methylation levels across all rows. However, they are -all quite jumbled at the moment, so it's hard to tell how many line up exactly. - ```{r heatmap-noclust, echo=FALSE} Heatmap(methyl_mat, @@ -132,6 +127,11 @@ Heatmap(methyl_mat, ) ``` +Looking at a heatmap of these data, we may spot some patterns -- looking at the +vertical stripes, many columns +appear to have a similar methylation levels across all rows. However, they are +all quite jumbled at the moment, so it's hard to tell how many line up exactly. + We can order these data to make the patterns more clear using hierarchical clustering. To do this, we can change the arguments we pass to `Heatmap()` from the **`ComplexHeatmap`** package. `Heatmap()` From d99d034e8baaf3d81b053a8f44f3e6166b457589 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:20:15 +0000 Subject: [PATCH 044/119] motivate that number of clusters also unknown from initial heat map, task 6 --- _episodes_rmd/07-hierarchical.Rmd | 2 ++ 1 file changed, 2 insertions(+) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index ec60b21d..eac64bd9 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -131,6 +131,8 @@ Looking at a heatmap of these data, we may spot some patterns -- looking at the vertical stripes, many columns appear to have a similar methylation levels across all rows. However, they are all quite jumbled at the moment, so it's hard to tell how many line up exactly. +In addition, it is challenging to tell how many groups containing similar methylation +levels we may have or what the similarities and differences are between groups. We can order these data to make the patterns more clear using hierarchical clustering. To do this, we can change the arguments we pass to From 8b313c3519595c678fc35a3a60ecb08c0d7e2c33 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:22:52 +0000 Subject: [PATCH 045/119] add title to signpost initial description of hierarchical clustering, part task 7 --- _episodes_rmd/07-hierarchical.Rmd | 2 ++ 1 file changed, 2 insertions(+) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index eac64bd9..28e61d21 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -161,6 +161,8 @@ cause of this is -- it could be a batch effect, or a known grouping (e.g., old vs young samples). However, clustering like this can be a useful part of exploratory analysis of data to build hypotheses. +# Hierarchical clustering + Now, let's cover the inner workings of hierarchical clustering in more detail. There are two things to consider before carrying out clustering: * how to define dissimilarity between observations using a distance matrix, and From 52e60159e0d4f75878ebd67b00738db61b6e591b Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:24:49 +0000 Subject: [PATCH 046/119] add short description of hierarchical clustering to methodology, task 7 --- _episodes_rmd/07-hierarchical.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 28e61d21..39caedb6 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -164,7 +164,8 @@ exploratory analysis of data to build hypotheses. # Hierarchical clustering Now, let's cover the inner workings of hierarchical clustering in more detail. -There are two things to consider before carrying out clustering: +Hierarchical clustering is a type of clustering that also allows us to estimate the number +of clusters. There are two things to consider before carrying out clustering: * how to define dissimilarity between observations using a distance matrix, and * how to define dissimilarity between clusters and when to fuse separate clusters. From 2a18084e18d3b4072e5256534e6bac4d925d1475 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:28:14 +0000 Subject: [PATCH 047/119] add section titles in line with two steps of clustering, tasks 8 and 9 clarifies that linkage etc is used to address the two steps in the preceeding bullet points --- _episodes_rmd/07-hierarchical.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 39caedb6..9e6c1850 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -169,7 +169,7 @@ of clusters. There are two things to consider before carrying out clustering: * how to define dissimilarity between observations using a distance matrix, and * how to define dissimilarity between clusters and when to fuse separate clusters. -# Creating the distance matrix +# Defining the dissimilarity between observations: creating the distance matrix Agglomerative hierarchical clustering is performed in two steps: calculating the distance matrix (containing distances between pairs of observations) and iteratively grouping observations into clusters using this matrix. @@ -192,7 +192,7 @@ clustering can have a big effect on the resulting tree. The decision of which distance matrix to use before carrying out hierarchical clustering depends on the type of data and question to be addressed. -# Linkage methods +# Defining the dissimilarity between clusters: Linkage methods The second step in performing hierarchical clustering after defining the distance matrix (or another function defining similarity between data points) From b445ea6a2d6618b462f9f3f38fe92e2727337e99 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:30:37 +0000 Subject: [PATCH 048/119] remove d notation, task 11 Only used once after --- _episodes_rmd/07-hierarchical.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 9e6c1850..5227454a 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -214,9 +214,9 @@ method used. Complete linkage (the default in `hclust()`) works by computing all pairwise dissimilarities between data points in different clusters. For each pair of two clusters, -it sets their dissimilarity ($d$) to the maximum dissimilarity value observed +it sets their dissimilarity to the maximum dissimilarity value observed between any of these clusters' constituent points. The two clusters -with smallest value of $d$ are then fused. +with smallest dissimilarity value are then fused. # Computing a dendrogram From 17f09d343a5d0653e3734b32f124323c3edb32e3 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:31:49 +0000 Subject: [PATCH 049/119] description of dendogram to definition, task 12 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 5227454a..57fd77d9 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -220,7 +220,7 @@ with smallest dissimilarity value are then fused. # Computing a dendrogram -Dendograms are useful tools to visualise the grouping of points and clusters into bigger clusters. +Dendograms are useful tools that plot the grouping of points and clusters into bigger clusters. We can create and plot dendrograms in R using `hclust()` which takes a distance matrix as input and creates the associated tree using hierarchical clustering. Here we create some example data to carry out hierarchical From 8a6d24401d5afd99fe9e524b923c3b17014df5c6 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:32:22 +0000 Subject: [PATCH 050/119] the associated tree to a tree, task 13 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 57fd77d9..cd7d8263 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -222,7 +222,7 @@ with smallest dissimilarity value are then fused. Dendograms are useful tools that plot the grouping of points and clusters into bigger clusters. We can create and plot dendrograms in R using `hclust()` which takes -a distance matrix as input and creates the associated tree using hierarchical +a distance matrix as input and creates a tree using hierarchical clustering. Here we create some example data to carry out hierarchical clustering. From 3aa5fbdb1b989076923cc73dd0df348039e6f8d3 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:33:10 +0000 Subject: [PATCH 051/119] clarify random data generation, task 14 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index cd7d8263..3994252e 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -227,7 +227,7 @@ clustering. Here we create some example data to carry out hierarchical clustering. Let's generate 20 data points in 2D space. Each -point belongs to one of three classes. Suppose we did not know which class +point is generated to belong to one of three classes/groups. Suppose we did not know which class data points belonged to and we want to identify these via cluster analysis. Hierarchical clustering carried out on the data can be used to produce a dendrogram showing how the data is partitioned into clusters. But how do we From 02d1f976b25cc43225d42e3cb150d4d805e6c4e9 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:39:25 +0000 Subject: [PATCH 052/119] move dendogram description to where this is actually generated and clarify that plot is random data, tasks 15 and 16 currently feels as though the plot of the data with numbered points is the dendogram and is a little confusing --- _episodes_rmd/07-hierarchical.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 3994252e..c8ecd1f6 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -228,10 +228,7 @@ clustering. Let's generate 20 data points in 2D space. Each point is generated to belong to one of three classes/groups. Suppose we did not know which class -data points belonged to and we want to identify these via cluster analysis. -Hierarchical clustering carried out on the data can be used to produce a -dendrogram showing how the data is partitioned into clusters. But how do we -interpret this dendrogram? Let's explore this using our example data. +data points belonged to and we want to identify these via cluster analysis. Let's first generate and plot our data: ```{r plotexample} @@ -255,6 +252,9 @@ text( dist_m <- dist(example_data, method = "euclidean") ``` +Hierarchical clustering carried out on the data can be used to produce a +dendrogram showing how the data is partitioned into clusters. But how do we interpret this dendrogram? Let's explore this using our example data in a Challenge. + > ## Challenge 1 > > Use `hclust()` to implement hierarchical clustering using the From ccdf908099da560ef5ca22b3f563a9d14ccfc1b0 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:41:54 +0000 Subject: [PATCH 053/119] clarify that dendogram generated in challenge 1, task 17 --- _episodes_rmd/07-hierarchical.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index c8ecd1f6..0efbaaaa 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -271,7 +271,8 @@ dendrogram showing how the data is partitioned into clusters. But how do we inte > {: .solution} {: .challenge} -This dendrogram shows similarities/differences in distances between data points. +A dendrogram, such as the one generated in Challenge 1, +shows similarities/differences in distances between data points. Each leaf of the dendrogram represents one of the 20 data points. These leaves fuse into branches as the height increases. Observations that are similar fuse into the same branches. The height at which any two From a058b9c6607a92bf6df723bf802533b6da99d23c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:44:25 +0000 Subject: [PATCH 054/119] clarify what leaves and branches are and look like from the plot, task 19 --- _episodes_rmd/07-hierarchical.Rmd | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 0efbaaaa..2c30770f 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -273,8 +273,9 @@ dendrogram showing how the data is partitioned into clusters. But how do we inte A dendrogram, such as the one generated in Challenge 1, shows similarities/differences in distances between data points. -Each leaf of the dendrogram represents one of the 20 data points. These leaves -fuse into branches as the height increases. Observations that are similar fuse into +Each vertical line at the bottom of the dendogram ('leaf') represents +one of the 20 data points. These leaves +fuse into fewer vertical lines ('branches') as the height increases. Observations that are similar fuse into the same branches. The height at which any two data points fuse indicates how different these two points are. Points that fuse at the top of the tree are very different from each other compared with two From bdd2985f3e840f1959ffec60b2de7a7091cebbbf Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:50:41 +0000 Subject: [PATCH 055/119] back reference cluster counting to plot in challenge, task 20 also quote figures consistently in numerics (rather than strings), use 5 as the height in the second example rather than 4 because 4 isn't labelled on y axis of dendogram, and make it clear that you can just count the number of times a vertical line crosses the horizontal cut --- _episodes_rmd/07-hierarchical.Rmd | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 2c30770f..38a76185 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -286,9 +286,10 @@ the scatterplot with their position on the tree. # Identifying clusters based on the dendrogram To do this, we can make a horizontal cut through the dendrogram at a user-defined height. -The sets of observations beneath this cut can be thought of as distinct clusters. For -example, a cut at height 10 produces two downstream clusters while a cut at -height 4 produces six downstream clusters. +The sets of observations beneath this cut can be thought of as distinct clusters. Equivalently, +we can count the vertical lines we encounter crossing the horizontal cut. For +example, a cut at height 10 produces 2 downstream clusters for the dendogram in Challenge 1, +while a cut at height 5 produces 5 downstream clusters. We can cut the dendrogram to determine number of clusters at different heights using `cutree()`. This function cuts a dendrogram into several From c44ea828e6654d025e3003b10a1134989fd2b750 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:51:54 +0000 Subject: [PATCH 056/119] typo fix dentrogram, task 23 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 38a76185..b373c7d5 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -356,7 +356,7 @@ downstream of the cut). # Highlighting dendrogram branches In addition to visualising cluster identity in scatter plots, it is also possible to -highlight branches in dentrograms. In this example, we calculate a distance matrix between +highlight branches in dendograms. In this example, we calculate a distance matrix between samples in the `methyl_mat` dataset. We then draw boxes round clusters obtained with `cutree`. ```{r plot-clust-method} From c21c7dacb33488f6f6e24fe1dff4d27c7c264f6c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:55:16 +0000 Subject: [PATCH 057/119] separate visualisation techniques, part task 21 --- _episodes_rmd/07-hierarchical.Rmd | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index b373c7d5..236e21ca 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -283,7 +283,7 @@ points that fuse at the bottom of the tree, which are quite similar. You can see this by comparing the position of similar/dissimilar points according to the scatterplot with their position on the tree. -# Identifying clusters based on the dendrogram +# Identifying the number of clusters To do this, we can make a horizontal cut through the dendrogram at a user-defined height. The sets of observations beneath this cut can be thought of as distinct clusters. Equivalently, @@ -291,6 +291,7 @@ we can count the vertical lines we encounter crossing the horizontal cut. For example, a cut at height 10 produces 2 downstream clusters for the dendogram in Challenge 1, while a cut at height 5 produces 5 downstream clusters. +## Numerical visualisation We can cut the dendrogram to determine number of clusters at different heights using `cutree()`. This function cuts a dendrogram into several groups (or clusters) where the number of desired groups is controlled by the @@ -353,7 +354,7 @@ downstream of the cut). > {: .solution} {: .challenge} -# Highlighting dendrogram branches +# Dendogram visualisation In addition to visualising cluster identity in scatter plots, it is also possible to highlight branches in dendograms. In this example, we calculate a distance matrix between From 7c17b6e2a4bc23c2710b784195b6a457f9591766 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 11:59:48 +0000 Subject: [PATCH 058/119] reorder numerical and dendogram visualisation, part task 21 For flow, it feels as though colouring the dendogram should come first (also post-hoc numerical evaluation is probably most valuable after?) --- _episodes_rmd/07-hierarchical.Rmd | 69 ++++++++++++++++--------------- 1 file changed, 36 insertions(+), 33 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 236e21ca..f624c845 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -291,8 +291,43 @@ we can count the vertical lines we encounter crossing the horizontal cut. For example, a cut at height 10 produces 2 downstream clusters for the dendogram in Challenge 1, while a cut at height 5 produces 5 downstream clusters. + + +# Dendogram visualisation + +We can first visualise cluster membership by highlight branches in dendograms. +In this example, we calculate a distance matrix between +samples in the `methyl_mat` dataset. We then draw boxes round clusters obtained with `cutree`. + +```{r plot-clust-method} +## create a distance matrix using euclidean method +distmat <- dist(methyl_mat) +## hierarchical clustering using complete method +clust <- hclust(distmat) +## plot resulting dendrogram +plot(clust) + +## draw border around three clusters +rect.hclust(clust, k = 3, border = 2:6) +## draw border around two clusters +rect.hclust(clust, k = 2, border = 2:6) +``` +We can also colour clusters downstream of a specified cut using `color_branches()` +from the **`dendextend`** package. + +```{r plot-coloured-branches} +## cut tree at height = 4 +cut <- cutree(clust, h = 50) + +library("dendextend") +avg_dend_obj <- as.dendrogram(clust) +## colour branches of dendrogram depending on clusters +plot(color_branches(avg_dend_obj, h = 50)) +``` + ## Numerical visualisation -We can cut the dendrogram to determine number of clusters at different heights +In addition to visualising clusters directly on the dendogram, we can cut +the dendrogram to determine number of clusters at different heights using `cutree()`. This function cuts a dendrogram into several groups (or clusters) where the number of desired groups is controlled by the user, by defining either `k` (number of groups) or `h` (height at which tree is @@ -354,38 +389,6 @@ downstream of the cut). > {: .solution} {: .challenge} -# Dendogram visualisation - -In addition to visualising cluster identity in scatter plots, it is also possible to -highlight branches in dendograms. In this example, we calculate a distance matrix between -samples in the `methyl_mat` dataset. We then draw boxes round clusters obtained with `cutree`. - -```{r plot-clust-method} -## create a distance matrix using euclidean method -distmat <- dist(methyl_mat) -## hierarchical clustering using complete method -clust <- hclust(distmat) -## plot resulting dendrogram -plot(clust) - -## draw border around three clusters -rect.hclust(clust, k = 3, border = 2:6) -## draw border around two clusters -rect.hclust(clust, k = 2, border = 2:6) -``` -We can also colour clusters downstream of a specified cut using `color_branches()` -from the **`dendextend`** package. - -```{r plot-coloured-branches} -## cut tree at height = 4 -cut <- cutree(clust, h = 50) - -library("dendextend") -avg_dend_obj <- as.dendrogram(clust) -## colour branches of dendrogram depending on clusters -plot(color_branches(avg_dend_obj, h = 50)) -``` - # The effect of different linkage methods Now let us look into changing the default behaviour of `hclust()`. Imagine we have two crescent-shaped point clouds as shown below. ```{r crescents} From 06a6ca49c17633bfffd099b052041e88d8a46c69 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:26:38 +0000 Subject: [PATCH 059/119] redefine clust for reordered section, task 21 --- _episodes_rmd/07-hierarchical.Rmd | 1 + 1 file changed, 1 insertion(+) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index f624c845..0dc3b37a 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -334,6 +334,7 @@ user, by defining either `k` (number of groups) or `h` (height at which tree is cut). ```{r cutree} +clust <- hclust(dist_m, method = "complete") ## k is a user defined parameter determining ## the desired number of clusters at which to cut the treee cutree(clust, k = 3) From 812e09a8008fe56b5d01451ba40e8b8be01b8e30 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:29:58 +0000 Subject: [PATCH 060/119] rewording in reordered sections, task 21 --- _episodes_rmd/07-hierarchical.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 0dc3b37a..00689eb7 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -285,11 +285,11 @@ the scatterplot with their position on the tree. # Identifying the number of clusters -To do this, we can make a horizontal cut through the dendrogram at a user-defined height. +To identify the number of clusters, we can make a horizontal cut through the dendrogram at a user-defined height. The sets of observations beneath this cut can be thought of as distinct clusters. Equivalently, we can count the vertical lines we encounter crossing the horizontal cut. For example, a cut at height 10 produces 2 downstream clusters for the dendogram in Challenge 1, -while a cut at height 5 produces 5 downstream clusters. +while a cut at height 4 produces 6 downstream clusters. From 92b5293630cff7ddcb256840e76f7660107a2d6b Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:33:36 +0000 Subject: [PATCH 061/119] clarify numerical summaries, task 22 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 00689eb7..569decb6 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -331,7 +331,7 @@ the dendrogram to determine number of clusters at different heights using `cutree()`. This function cuts a dendrogram into several groups (or clusters) where the number of desired groups is controlled by the user, by defining either `k` (number of groups) or `h` (height at which tree is -cut). +cut). The function outputs the cluster labels of each data point in order. ```{r cutree} clust <- hclust(dist_m, method = "complete") From 31cf70395a2edeec4108661b4694e4aea59ff037 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:34:38 +0000 Subject: [PATCH 062/119] elaborate on border argument, task 24 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 569decb6..9875026e 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -308,7 +308,7 @@ clust <- hclust(distmat) plot(clust) ## draw border around three clusters -rect.hclust(clust, k = 3, border = 2:6) +rect.hclust(clust, k = 3, border = 2:6) #border argument specifies the colours ## draw border around two clusters rect.hclust(clust, k = 2, border = 2:6) ``` From 63c3f34e7d8cb2116b592d679e5fb4143537b82e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:40:24 +0000 Subject: [PATCH 063/119] different to the, task 25 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 1c39c57f..d747e0ad 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -468,7 +468,7 @@ other crescent and so it splits both crescents. So far, we've been using Euclidean distance to define the dissimilarity or distance between observations. However, this isn't always the best -metric for how dissimilar different observations are. Let's make an +metric for how dissimilar the observations are. Let's make an example to demonstrate. Here, we're creating two samples each with ten observations of random noise: From 367e4c8d16a502e81c74d3b5d337f9f1deb10839 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:40:43 +0000 Subject: [PATCH 064/119] to to from, task 26 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index d747e0ad..00a4abcc 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -498,7 +498,7 @@ head(cor_example) ``` If we plot a heatmap of this, we can see that `sample_a` and `sample_b` are -grouped together because they have a small distance to each other, despite +grouped together because they have a small distance from each other, despite being quite different in their pattern across the different features. In contrast, `sample_a` and `sample_c` are very distant, despite having *exactly* the same pattern across the different features. From afc9a33c7f79bc7a98f9b6bab3e939290d27a9f1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:45:33 +0000 Subject: [PATCH 065/119] add axis labels, cor_example plot, task 28 --- _episodes_rmd/07-hierarchical.Rmd | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 00a4abcc..ccb58958 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -515,7 +515,9 @@ We can see that more clearly if we do a line plot: plot( 1:nrow(cor_example), rep(range(cor_example), 5), - type = "n" + type = "n", + xlab = "Feature number", + ylab = "Value" ) ## draw a red line for sample_a lines(cor_example$sample_a, col = "firebrick") From 05f3f4fea234b4e2182a2d3fce25fdb96cbad6f9 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:46:11 +0000 Subject: [PATCH 066/119] typo, remove to functions, task 29 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index ccb58958..045ded05 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -559,7 +559,7 @@ are grouped together, while `sample_b` is seen as distant because it has a different pattern, even though its values are closer to `sample_a`. Using your own distance function is often useful, especially if you have missing or unusual data. It's often possible to use correlation and other custom -distance functions to functions that perform hierarchical clustering, such as +distance functions that perform hierarchical clustering, such as `pheatmap()` and `stats::heatmap()`: ```{r heatmap-cor-cor-example} From 5108de659d3c5de3af0d8fecbf22f78cb39c1025 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:50:44 +0000 Subject: [PATCH 067/119] clarify that hierarchical clustering performed by heatmap rather than that heatmap used as distance function, task 30 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 045ded05..b2375896 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -559,7 +559,7 @@ are grouped together, while `sample_b` is seen as distant because it has a different pattern, even though its values are closer to `sample_a`. Using your own distance function is often useful, especially if you have missing or unusual data. It's often possible to use correlation and other custom -distance functions that perform hierarchical clustering, such as +distance functions in functions that perform hierarchical clustering, such as `pheatmap()` and `stats::heatmap()`: ```{r heatmap-cor-cor-example} From d993ecaf718403967bc3f20f09e1d0c80640eefe Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:51:11 +0000 Subject: [PATCH 068/119] correct expactations, task 31 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index b2375896..deb34cbf 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -687,7 +687,7 @@ clustering results, due to its high sensitivity to noise in the dataset. An alternative is to use silhouette scores (see the k-means clustering episode). As we said before (see previous episode), clustering is a non-trivial task. -It is important to think about the nature of your data and your expactations +It is important to think about the nature of your data and your expectations rather than blindly using a some algorithm for clustering or cluster validation. # Further reading From 8fe71d519a59ac3b2763a13d9e0b4d301a33e1e7 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:54:51 +0000 Subject: [PATCH 069/119] alt text to first two plots, remove figure indexing, part task 32 figure indexing not used anywhere else in episodes. --- _episodes_rmd/07-hierarchical.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index deb34cbf..6cf99008 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -88,12 +88,12 @@ panel below). Groups of observations may then be merged into a larger cluster (see next panel below, green box). This process continues until all the observations are included in a single cluster. -```{r hclustfig1, echo=FALSE, out.width="500px", fig.cap="Figure 1a: Example data showing two clusters of observation pairs"} +```{r hclustfig1, echo=FALSE, out.width="500px", fig.cap="Example data showing two clusters of observation pairs.", fig.alt="Scatter plot of observations x2 versus x1. Two clusters of pairs of observations are shown by blue and red boxes, each grouping two observations that are close in the space."} knitr::include_graphics("../fig/hierarchical_clustering_1.png") ``` -```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Figure 1b: Example data showing fusing of one observation into larger cluster"} +```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time, blue and red boxes containing two observations each and separated in the space. A third green box is shown encompassing the blue box and an additional data point."} knitr::include_graphics("../fig/hierarchical_clustering_2.png") ``` From 08783f6432cbdcc9f4d06fa229027ea49fcaa45c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 12:59:27 +0000 Subject: [PATCH 070/119] add alt text and captions to heat map 1, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 6cf99008..3e55fe80 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -104,7 +104,7 @@ clustering is really useful, and then we can understand how to apply it in more detail. To do this, we'll return to the large methylation dataset we worked with in the regression lessons. Let's load the data and look at it. -```{r} +```{r, fig.cap="Heat map of methylation data.", fig.alt="Heat map of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4) and blue delineates low methylation levels (to around -4). The plot shows many vertical blue and red stripes."} library("minfi") library("here") library("ComplexHeatmap") From 14bab0cdbf7fd4b4477098f0ce2a86117ea1c8a1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:03:49 +0000 Subject: [PATCH 071/119] add alt text and captions, clustered heat map, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 3e55fe80..e1a569ac 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -135,7 +135,7 @@ clustering. To do this, we can change the arguments we pass to groups features based on dissimilarity (here, Euclidean distance) and orders rows and columns to show clustering of features and observations. -```{r heatmap-clust} +```{r heatmap-clust, fig.cap="Heat map of methylation data clustered by methylation sites and individuals.", fig.alt="Heat map of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4), blue delineates low methylation levels (to around -4) and white delineates methylation levels close to zero. This time, the individuals and methylation sites are clustered and the plot fades from vertical red lines on the left side to vertical blue lines on the right side. There are two, arguably three, white stripes towards the middle of the plot."} Heatmap(methyl_mat, name = "Methylation level", cluster_rows = TRUE, cluster_columns = TRUE, From 66cd05d4db7f50c95dddb8c43eddc5bb459a4709 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:07:54 +0000 Subject: [PATCH 072/119] add alt text and caption for computing dendogram scatter plot, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index e1a569ac..28eaba8f 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -226,7 +226,7 @@ dendrogram showing how the data is partitioned into clusters. But how do we interpret this dendrogram? Let's explore this using our example data. -```{r plotexample} +```{r plotexample, fig.cap="Scatter plot of randomly-generated data x2 versus x1.", fig.alt="A scatter plot of randomly-generated data x2 versus x1. The points appear fairly randomly scattered, arguably centered towards the bottom of the plot."} #First, create some example data with two variables x1 and x2 set.seed(450) example_data <- data.frame( From 8ce13aaeb70ddd1d1c8a68c3a7fabd1e132589d8 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:10:09 +0000 Subject: [PATCH 073/119] remove figure caption from challenge 1, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 28eaba8f..efe00aff 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -256,7 +256,7 @@ dist_m <- dist(example_data, method = "euclidean") > > > ## Solution: > > -> > ```{r plotclustex} +> > ```{r plotclustex, fig.cap=" "} > > clust <- hclust(dist_m, method = "complete") > > plot(clust) > > ``` From e771c397410bfaffcbdbb487b10c8369b5a15556 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:13:36 +0000 Subject: [PATCH 074/119] add alt text and caption to clustered scatter plot, also swap x and y in plot for consistency, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index efe00aff..2cae8bfe 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -286,7 +286,7 @@ groups (or clusters) where the number of desired groups is controlled by the user, by defining either `k` (number of groups) or `h` (height at which tree is cut). -```{r cutree} +```{r cutree, fig.cap="Scatter plot of data x2 versus x1, coloured by cluster.", fig.alt="A scatter plot of the example data x2 versus x1, coloured by 8 different clusters. There are two clusters in the south east of the plot, 4 clusters in the north west of the plot, and a final cluster consisting of one point in the north east of the plot."} ## k is a user defined parameter determining ## the desired number of clusters at which to cut the treee cutree(clust, k = 3) @@ -305,7 +305,7 @@ count(example_cl, cluster) #plot cluster each point belongs to on original scatterplot library(ggplot2) -ggplot(example_cl, aes(x = x2, y = x1, color = factor(cluster))) + geom_point() +ggplot(example_cl, aes(x = x1, y = x2, color = factor(cluster))) + geom_point() ``` Note that this cut produces 8 clusters (two before the cut and another six From bfca6fd07bea098e0b831cc477800ff786931a31 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:16:38 +0000 Subject: [PATCH 075/119] add alt text and captions to boxed dendogram, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 2cae8bfe..8af071f0 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -348,7 +348,7 @@ In addition to visualising cluster identity in scatter plots, it is also possibl highlight branches in dentrograms. In this example, we calculate a distance matrix between samples in the `methyl_mat` dataset. We then draw boxes round clusters obtained with `cutree`. -```{r plot-clust-method} +```{r plot-clust-method, fig.cap="Dendogram with boxes around clusters.", fig.alt="A dendogram for the methyl_mat data with boxes overlain on clusters. There are 5 clusters in total, each delineating a separate cluster."} ## create a distance matrix using euclidean method distmat <- dist(methyl_mat) ## hierarchical clustering using complete method From 37d5e09ffb07cf5fae92dc10495907665e919a8a Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:18:03 +0000 Subject: [PATCH 076/119] alt text and captions for branch coloured dendogram, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 8af071f0..0305be9b 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -364,7 +364,7 @@ rect.hclust(clust, k = 2, border = 2:6) We can also colour clusters downstream of a specified cut using `color_branches()` from the **`dendextend`** package. -```{r plot-coloured-branches} +```{r plot-coloured-branches, fig.cap="Dendogram with coloured branches delineating different clusters.", fig.alt="A dendogram with the different clusters in 4 different colours."} ## cut tree at height = 4 cut <- cutree(clust, h = 50) From bbfeb3dae8918b0e1b9dc58ca01a6c0ef724d7b1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:20:33 +0000 Subject: [PATCH 077/119] alt text and caption for crescent simulated data, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 0305be9b..428972be 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -376,7 +376,7 @@ plot(color_branches(avg_dend_obj, h = 50)) # The effect of different linkage methods Now let us look into changing the default behaviour of `hclust()`. Imagine we have two crescent-shaped point clouds as shown below. -```{r crescents} +```{r crescents, fig.cap="Scatter plot of data simulated according to two crescent-shaped point clouds.", fig.alt="A scatter plot of data simulated to form two crescent shapes. The crescents are horizontally orientated with a a rough line of vertical symmetry."} # These two functions are to help us make crescents. Don't worry it you do not understand all this code. # The importent bit is the object "cres", which consists of two columns (x and y coordinates of two crescents). is.insideCircle <- function(co, r=0.5, offs=c(0,0)){ From cbd2a8402d3218c36fafec3dd4172d26ea80f424 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:23:54 +0000 Subject: [PATCH 078/119] alt text and captions, crescent clustering, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 428972be..e08fdbe7 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -399,7 +399,7 @@ plot(cres) We might expect that the crescents are resolved into separate clusters. But if we run hierarchical clustering with the default arguments, we get this: -```{r cresClustDefault} +```{r cresClustDefault, fig.cap="Scatter plot of crescent-shaped simulated data, coloured according to clusters calculated using Euclidean distance.", fig.alt="A scatter plot of the crescent-shaped simulated data calculated using Euclidean distance. The points are coloured in black or red according to their membership to 2 clusters. The points in the tails of each crescent have inherited the colour of the opposite crescent."} cresClass <- cutree(hclust(dist(cres)), k=2) # save partition for colouring plot(cres, col=cresClass) # colour scatterplot by partition ``` From 9105d07e07b0a46bcf596feccccc15921517976c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:26:34 +0000 Subject: [PATCH 079/119] alt text and caption, using different distance methods, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index e08fdbe7..d68a201b 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -503,7 +503,7 @@ being quite different in their pattern across the different features. In contrast, `sample_a` and `sample_c` are very distant, despite having *exactly* the same pattern across the different features. -```{r heatmap-cor-example} +```{r heatmap-cor-example, fig.cap="Heat map of simulated data.", fig.alt="Heat map of simulated data: feature versus sample. The grid cells of the heat map are coloured from red (high) to blue (low) according to value of the simulated data."} Heatmap(as.matrix(cor_example)) ``` From 2cf2cf9e2e6bf1ccd3d655d41da440900e2013ed Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:29:11 +0000 Subject: [PATCH 080/119] alt text and captions, using different linkage methods, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index d68a201b..dfab7717 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -508,7 +508,7 @@ Heatmap(as.matrix(cor_example)) ``` We can see that more clearly if we do a line plot: -```{r lineplot-cor-example} +```{r lineplot-cor-example, fig.cap="Line plot of simulated value versus observation number, coloured by sample.", fig.alt="A line plot of simulated value versus observation number, coloured by sample. Samples a and b are concentrated at the bottom of the plot, while sample c is concentrated at the top of the plot."} ## create a blank plot (type = "n" means don't draw anything) ## with an x range to hold the number of features we have. ## the range of y needs to be enough to show all the values for every feature From b52dda312479815ded3d0e28a24d751a52a844a7 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:31:12 +0000 Subject: [PATCH 081/119] alt text and caption, using different distance methods dendogram, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index dfab7717..ce87cd4a 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -533,7 +533,7 @@ the values, they have a high distance to each other. We can see that if we cluster and plot the data ourselves using Euclidean distance: -```{r clust-euc-cor-example} +```{r clust-euc-cor-example, fig.cap="Dendogram of the example simulated data.", fig.alt="A dendogram of the example simulated data. The dendogram shows that sample c definitely forms its own cluster for any cut height and sammples b and a merge into a cluster at a height of around 6."} clust_dist <- hclust(dist(t(cor_example))) plot(clust_dist) ``` From 57994b2ca2368dec0fa35a102d98f20012e9e23a Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:34:56 +0000 Subject: [PATCH 082/119] edits to alt text and captions, choosing different distances, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index ce87cd4a..04f95935 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -508,7 +508,7 @@ Heatmap(as.matrix(cor_example)) ``` We can see that more clearly if we do a line plot: -```{r lineplot-cor-example, fig.cap="Line plot of simulated value versus observation number, coloured by sample.", fig.alt="A line plot of simulated value versus observation number, coloured by sample. Samples a and b are concentrated at the bottom of the plot, while sample c is concentrated at the top of the plot."} +```{r lineplot-cor-example, fig.cap="Line plot of simulated value versus observation number, coloured by sample.", fig.alt="A line plot of simulated value versus observation number, coloured by sample. Samples a and b are concentrated at the bottom of the plot, while sample c is concentrated at the top of the plot. Samples a and c have exactly the same vertical pattern."} ## create a blank plot (type = "n" means don't draw anything) ## with an x range to hold the number of features we have. ## the range of y needs to be enough to show all the values for every feature @@ -533,7 +533,7 @@ the values, they have a high distance to each other. We can see that if we cluster and plot the data ourselves using Euclidean distance: -```{r clust-euc-cor-example, fig.cap="Dendogram of the example simulated data.", fig.alt="A dendogram of the example simulated data. The dendogram shows that sample c definitely forms its own cluster for any cut height and sammples b and a merge into a cluster at a height of around 6."} +```{r clust-euc-cor-example, fig.cap="Dendogram of the example simulated data clustered according to Euclidean distance.", fig.alt="A dendogram of the example simulated data clustered according to Euclidean distance. The dendogram shows that sample c definitively forms its own cluster for any cut height and samples a and b merge into a cluster at a height of around 6."} clust_dist <- hclust(dist(t(cor_example))) plot(clust_dist) ``` @@ -548,7 +548,7 @@ we can use `1 - cor(x)` as the distance metric. The input to `hclust()` must be a `dist` object, so we also need to call `as.dist()` on it before passing it in. -```{r clust-cor-cor-example} +```{r clust-cor-cor-example, fig.cap="Dendogram of the example simulated data clustered according to correlation.", fig.alt="A dendogram of the example simulated data clustered according to correlation. The dendogram shows that sample b definitively forms its own cluster and samples a and c form definitively form their own cluster for any cut height."} cor_as_dist <- as.dist(1 - cor(cor_example)) clust_cor <- hclust(cor_as_dist) plot(clust_cor) From dc1def70ef262105368cbe64bbd4123ff2e78729 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:39:49 +0000 Subject: [PATCH 083/119] alt text and captions, different distances, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 04f95935..442da2df 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -562,7 +562,7 @@ or unusual data. It's often possible to use correlation and other custom distance functions in functions that perform hierarchical clustering, such as `pheatmap()` and `stats::heatmap()`: -```{r heatmap-cor-cor-example} +```{r heatmap-cor-cor-example, fig.cap="Heat map of features versus samples clustered in the samples according to correlation.", fig.alt="A heat map of features versus samples, coloured by simulated value. The columns (samples) are clustered according to the correlation. Samples a and b have mostly low (blue) values, while sample c has mostly mostly high (red) values."} ## pheatmap allows you to select correlation directly pheatmap(as.matrix(cor_example), clustering_distance_cols = "correlation") ## Using the built-in stats::heatmap From 21e56328fd2ae671be63ece59f3e424800875d18 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:41:46 +0000 Subject: [PATCH 084/119] edits to distance plots alt text and captions to describe both heat maps, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 442da2df..b59b4434 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -562,7 +562,7 @@ or unusual data. It's often possible to use correlation and other custom distance functions in functions that perform hierarchical clustering, such as `pheatmap()` and `stats::heatmap()`: -```{r heatmap-cor-cor-example, fig.cap="Heat map of features versus samples clustered in the samples according to correlation.", fig.alt="A heat map of features versus samples, coloured by simulated value. The columns (samples) are clustered according to the correlation. Samples a and b have mostly low (blue) values, while sample c has mostly mostly high (red) values."} +```{r heatmap-cor-cor-example, fig.cap="Heat maps of features versus samples clustered in the samples according to correlation.", fig.alt="Heat maps of features versus samples, coloured by simulated value. The columns (samples) are clustered according to the correlation. Samples a and b have mostly low values, delineated by blue in the first plot and yellow in the second plot. Sample c has mostly high values, delineated by red in the first plot and brown in the second plot."} ## pheatmap allows you to select correlation directly pheatmap(as.matrix(cor_example), clustering_distance_cols = "correlation") ## Using the built-in stats::heatmap From 5ae51ef39f3355dbda5028dc3fc3b4330a1bae51 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:43:48 +0000 Subject: [PATCH 085/119] alt text and caption, Dunn index, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index b59b4434..74008d9c 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -598,7 +598,7 @@ Dunn index, the better defined the clusters. Let's calculate the Dunn index for clustering carried out on the `methyl_mat` dataset using the **`clValid`** package. -```{r plot-clust-dunn} +```{r plot-clust-dunn, fig.cap="Dendogram for clustering of methylation data.", fig.alt="A dendogram for clustering of methylation data. Identical to that in the section Highlighting dendrogram branches, without the colour overlay to show clusters."} ## calculate dunn index ## (ratio of the smallest distance between obs not in the same cluster ## to the largest intra-cluster distance) From 7f353883bd53bb2af09b0222cdd851cc636c1d5a Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:46:13 +0000 Subject: [PATCH 086/119] alt text and captions, dunn index scatter plot, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 74008d9c..65a74ff8 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -662,7 +662,7 @@ between sets of clusters with larger values being preferred. The figures below show in a more systematic way how changing the values of `k` and `h` using `cutree()` affect the Dunn index. -```{r hclust-fig3, echo=TRUE, fig.cap="Figure 3: Dunn index"} +```{r hclust-fig3, echo=TRUE, fig.cap="Dunn index versus cut height for methylation data.", fig.alt="Scatter plot of Dunn index versus cut height for methylation data. The Dunn index is high (around 1.6) for height values up to 20. The Dunn index drops around height 20 and the points fluctuate around 0.8 and 1 as height increases."} h_seq <- 70:10 h_dunn <- sapply(h_seq, function(x) dunn(distance = distmat, cutree(clust, h = x))) k_seq <- seq(2, 10) From 6887a390b7d588bd9cd5997c64438c867093599c Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Fri, 8 Mar 2024 13:46:26 +0000 Subject: [PATCH 087/119] Update _episodes_rmd/07-hierarchical.Rmd --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 9875026e..3e59d73e 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -300,7 +300,7 @@ In this example, we calculate a distance matrix between samples in the `methyl_mat` dataset. We then draw boxes round clusters obtained with `cutree`. ```{r plot-clust-method} -## create a distance matrix using euclidean method +## create a distance matrix using euclidean distance distmat <- dist(methyl_mat) ## hierarchical clustering using complete method clust <- hclust(distmat) From 092923a7075594e5d47345f332b1c16d093129aa Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:48:28 +0000 Subject: [PATCH 088/119] alt text and captions, final plot, complete task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 65a74ff8..95ebdd68 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -675,7 +675,7 @@ is not very useful - cutting the given tree at a low `h` value like 15 leads to ending up each in its own cluster. More relevant is the second maximum in the plot, around `h=55`. Looking at the dendrogram, this corresponds to `k=4`. -```{r hclust-fig4, echo=TRUE, fig.cap="Figure 4: Dunn index continued"} +```{r hclust-fig4, echo=TRUE, fig.cap="Scatter plot of Dunn index versus the number of clusters for the methylation data.", fig.alt="A scatter plot of the Dunn index versus the number of clusters for the methylation data. The points appear randomly scattered around the space between Dunn indices of 0.77 to 0.85, apart from for 4 clusters where the Dunn index reaches just over 0.88."} plot(k_seq, k_dunn, xlab = "Number of clusters (k)", ylab = "Dunn index") grid() ``` From 45ca745d07f87e63b286a7718621d4705c97f2ca Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 13:51:43 +0000 Subject: [PATCH 089/119] add alt text and caption to first heat map, task 32 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 95ebdd68..146dd8fb 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -119,7 +119,7 @@ Looking at a heatmap of these data, we may spot some patterns -- many columns appear to have a similar methylation levels across all rows. However, they are all quite jumbled at the moment, so it's hard to tell how many line up exactly. -```{r heatmap-noclust, echo=FALSE} +```{r heatmap-noclust, echo=FALSE, fig.cap="Heat map of methylation data.", fig.alt="Heat map of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4), blue delineates low methylation levels (to around -4) and white delineates methylation levels close to zero. There are many vertical blue and red stripes."} Heatmap(methyl_mat, name = "Methylation level", From 34fae943217bb816600b8ceb22d8dc23f11cf621 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 15:37:38 +0000 Subject: [PATCH 090/119] change "the space" --- _episodes_rmd/07-hierarchical.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 146dd8fb..f854fc0f 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -88,12 +88,12 @@ panel below). Groups of observations may then be merged into a larger cluster (see next panel below, green box). This process continues until all the observations are included in a single cluster. -```{r hclustfig1, echo=FALSE, out.width="500px", fig.cap="Example data showing two clusters of observation pairs.", fig.alt="Scatter plot of observations x2 versus x1. Two clusters of pairs of observations are shown by blue and red boxes, each grouping two observations that are close in the space."} +```{r hclustfig1, echo=FALSE, out.width="500px", fig.cap="Example data showing two clusters of observation pairs.", fig.alt="Scatter plot of observations x2 versus x1. Two clusters of pairs of observations are shown by blue and red boxes, each grouping two observations that are close in their x and y distance."} knitr::include_graphics("../fig/hierarchical_clustering_1.png") ``` -```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time, blue and red boxes containing two observations each and separated in the space. A third green box is shown encompassing the blue box and an additional data point."} +```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time, blue and red boxes containing two observations each and separated in their x and y distance. A third green box is shown encompassing the blue box and an additional data point."} knitr::include_graphics("../fig/hierarchical_clustering_2.png") ``` @@ -675,7 +675,7 @@ is not very useful - cutting the given tree at a low `h` value like 15 leads to ending up each in its own cluster. More relevant is the second maximum in the plot, around `h=55`. Looking at the dendrogram, this corresponds to `k=4`. -```{r hclust-fig4, echo=TRUE, fig.cap="Scatter plot of Dunn index versus the number of clusters for the methylation data.", fig.alt="A scatter plot of the Dunn index versus the number of clusters for the methylation data. The points appear randomly scattered around the space between Dunn indices of 0.77 to 0.85, apart from for 4 clusters where the Dunn index reaches just over 0.88."} +```{r hclust-fig4, echo=TRUE, fig.cap="Scatter plot of Dunn index versus the number of clusters for the methylation data.", fig.alt="A scatter plot of the Dunn index versus the number of clusters for the methylation data. The points appear randomly scattered around the plot area between Dunn indices of 0.77 to 0.85, apart from for 4 clusters where the Dunn index reaches just over 0.88."} plot(k_seq, k_dunn, xlab = "Number of clusters (k)", ylab = "Dunn index") grid() ``` From 4adbcfdc9635c11ebc6b6f632d71b935fa6ff6e2 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 15:40:12 +0000 Subject: [PATCH 091/119] rephrase hclustfig2 alt text --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index f854fc0f..c2c2c952 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -93,7 +93,7 @@ knitr::include_graphics("../fig/hierarchical_clustering_1.png") ``` -```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time, blue and red boxes containing two observations each and separated in their x and y distance. A third green box is shown encompassing the blue box and an additional data point."} +```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time, blue and red boxes containing two observations each. The two boxes encompass points that are relatively far apart. A third green box is shown encompassing the blue box and an additional data point."} knitr::include_graphics("../fig/hierarchical_clustering_2.png") ``` From a2144a74ea97e559701fb2e50a2b25603be7968c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 15:51:56 +0000 Subject: [PATCH 092/119] encompass to contain --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index c2c2c952..6929ec8f 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -93,7 +93,7 @@ knitr::include_graphics("../fig/hierarchical_clustering_1.png") ``` -```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time, blue and red boxes containing two observations each. The two boxes encompass points that are relatively far apart. A third green box is shown encompassing the blue box and an additional data point."} +```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time, blue and red boxes containing two observations each. The two boxes contain points that are relatively far apart. A third green box is shown encompassing the blue box and an additional data point."} knitr::include_graphics("../fig/hierarchical_clustering_2.png") ``` From 024318a24379c25fba65d63c6852043ffdd244f7 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 15:52:42 +0000 Subject: [PATCH 093/119] minor wording change hclustfig2 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 6929ec8f..8dc0892b 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -93,7 +93,7 @@ knitr::include_graphics("../fig/hierarchical_clustering_1.png") ``` -```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time, blue and red boxes containing two observations each. The two boxes contain points that are relatively far apart. A third green box is shown encompassing the blue box and an additional data point."} +```{r hclustfig2, echo=FALSE, out.width="500px", fig.cap="Example data showing fusing of one observation into larger cluster.", fig.alt="Scatter plot of observations x2 versus x1. Three boxes are shown this time. Blue and red boxes contain two observations each. The two boxes contain points that are relatively far apart. A third green box is shown encompassing the blue box and an additional data point."} knitr::include_graphics("../fig/hierarchical_clustering_2.png") ``` From 380c6ca9ebdfcb91b46215b9c4ce54a41e585888 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 15:56:29 +0000 Subject: [PATCH 094/119] heat map to heatmap throughout --- _episodes_rmd/07-hierarchical.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 8dc0892b..965767b1 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -104,7 +104,7 @@ clustering is really useful, and then we can understand how to apply it in more detail. To do this, we'll return to the large methylation dataset we worked with in the regression lessons. Let's load the data and look at it. -```{r, fig.cap="Heat map of methylation data.", fig.alt="Heat map of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4) and blue delineates low methylation levels (to around -4). The plot shows many vertical blue and red stripes."} +```{r, fig.cap="Heatmap of methylation data.", fig.alt="Heatmap of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4) and blue delineates low methylation levels (to around -4). The plot shows many vertical blue and red stripes."} library("minfi") library("here") library("ComplexHeatmap") @@ -119,7 +119,7 @@ Looking at a heatmap of these data, we may spot some patterns -- many columns appear to have a similar methylation levels across all rows. However, they are all quite jumbled at the moment, so it's hard to tell how many line up exactly. -```{r heatmap-noclust, echo=FALSE, fig.cap="Heat map of methylation data.", fig.alt="Heat map of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4), blue delineates low methylation levels (to around -4) and white delineates methylation levels close to zero. There are many vertical blue and red stripes."} +```{r heatmap-noclust, echo=FALSE, fig.cap="Heatmap of methylation data.", fig.alt="Heatmap of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4), blue delineates low methylation levels (to around -4) and white delineates methylation levels close to zero. There are many vertical blue and red stripes."} Heatmap(methyl_mat, name = "Methylation level", @@ -135,7 +135,7 @@ clustering. To do this, we can change the arguments we pass to groups features based on dissimilarity (here, Euclidean distance) and orders rows and columns to show clustering of features and observations. -```{r heatmap-clust, fig.cap="Heat map of methylation data clustered by methylation sites and individuals.", fig.alt="Heat map of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4), blue delineates low methylation levels (to around -4) and white delineates methylation levels close to zero. This time, the individuals and methylation sites are clustered and the plot fades from vertical red lines on the left side to vertical blue lines on the right side. There are two, arguably three, white stripes towards the middle of the plot."} +```{r heatmap-clust, fig.cap="Heatmap of methylation data clustered by methylation sites and individuals.", fig.alt="Heatmap of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4), blue delineates low methylation levels (to around -4) and white delineates methylation levels close to zero. This time, the individuals and methylation sites are clustered and the plot fades from vertical red lines on the left side to vertical blue lines on the right side. There are two, arguably three, white stripes towards the middle of the plot."} Heatmap(methyl_mat, name = "Methylation level", cluster_rows = TRUE, cluster_columns = TRUE, @@ -503,7 +503,7 @@ being quite different in their pattern across the different features. In contrast, `sample_a` and `sample_c` are very distant, despite having *exactly* the same pattern across the different features. -```{r heatmap-cor-example, fig.cap="Heat map of simulated data.", fig.alt="Heat map of simulated data: feature versus sample. The grid cells of the heat map are coloured from red (high) to blue (low) according to value of the simulated data."} +```{r heatmap-cor-example, fig.cap="Heatmap of simulated data.", fig.alt="Heatmap of simulated data: feature versus sample. The grid cells of the heatmap are coloured from red (high) to blue (low) according to value of the simulated data."} Heatmap(as.matrix(cor_example)) ``` @@ -562,7 +562,7 @@ or unusual data. It's often possible to use correlation and other custom distance functions in functions that perform hierarchical clustering, such as `pheatmap()` and `stats::heatmap()`: -```{r heatmap-cor-cor-example, fig.cap="Heat maps of features versus samples clustered in the samples according to correlation.", fig.alt="Heat maps of features versus samples, coloured by simulated value. The columns (samples) are clustered according to the correlation. Samples a and b have mostly low values, delineated by blue in the first plot and yellow in the second plot. Sample c has mostly high values, delineated by red in the first plot and brown in the second plot."} +```{r heatmap-cor-cor-example, fig.cap="Heatmaps of features versus samples clustered in the samples according to correlation.", fig.alt="Heatmaps of features versus samples, coloured by simulated value. The columns (samples) are clustered according to the correlation. Samples a and b have mostly low values, delineated by blue in the first plot and yellow in the second plot. Sample c has mostly high values, delineated by red in the first plot and brown in the second plot."} ## pheatmap allows you to select correlation directly pheatmap(as.matrix(cor_example), clustering_distance_cols = "correlation") ## Using the built-in stats::heatmap From bf9b58f7de6b7dd9020982155f2cdd1a0d047cee Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 15:57:22 +0000 Subject: [PATCH 095/119] remove alt text and caption for code block without plot --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 965767b1..04600d42 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -104,7 +104,7 @@ clustering is really useful, and then we can understand how to apply it in more detail. To do this, we'll return to the large methylation dataset we worked with in the regression lessons. Let's load the data and look at it. -```{r, fig.cap="Heatmap of methylation data.", fig.alt="Heatmap of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4) and blue delineates low methylation levels (to around -4). The plot shows many vertical blue and red stripes."} +```{r} library("minfi") library("here") library("ComplexHeatmap") From 4190e98d373c580f4497860917c2a1a2afec5563 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 16:00:02 +0000 Subject: [PATCH 096/119] heatmapclust alt text and caption edits --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 04600d42..c26783cb 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -135,7 +135,7 @@ clustering. To do this, we can change the arguments we pass to groups features based on dissimilarity (here, Euclidean distance) and orders rows and columns to show clustering of features and observations. -```{r heatmap-clust, fig.cap="Heatmap of methylation data clustered by methylation sites and individuals.", fig.alt="Heatmap of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4), blue delineates low methylation levels (to around -4) and white delineates methylation levels close to zero. This time, the individuals and methylation sites are clustered and the plot fades from vertical red lines on the left side to vertical blue lines on the right side. There are two, arguably three, white stripes towards the middle of the plot."} +```{r heatmap-clust, fig.cap="Heatmap of methylation data clustered by methylation sites and individuals.", fig.alt="Heatmap of methylation level with individuals along the y axis and methylation sites along the x axis, clustered by methylation sites and individuals. Red colours indicate high methylation levels (up to around 4), blue colours indicate low methylation levels (to around -4) and white indicates methylation levels close to zero. This time, the individuals and methylation sites are clustered and the plot fades from vertical red lines on the left side to vertical blue lines on the right side. There are two, arguably three, white stripes towards the middle of the plot."} Heatmap(methyl_mat, name = "Methylation level", cluster_rows = TRUE, cluster_columns = TRUE, From a742fd8322586ec5a062136f5763e7e78db5c885 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 8 Mar 2024 16:01:03 +0000 Subject: [PATCH 097/119] edits to cutree plot --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index c26783cb..65fded0b 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -286,7 +286,7 @@ groups (or clusters) where the number of desired groups is controlled by the user, by defining either `k` (number of groups) or `h` (height at which tree is cut). -```{r cutree, fig.cap="Scatter plot of data x2 versus x1, coloured by cluster.", fig.alt="A scatter plot of the example data x2 versus x1, coloured by 8 different clusters. There are two clusters in the south east of the plot, 4 clusters in the north west of the plot, and a final cluster consisting of one point in the north east of the plot."} +```{r cutree, fig.cap="Scatter plot of data x2 versus x1, coloured by cluster.", fig.alt="A scatter plot of the example data x2 versus x1, coloured by 8 different clusters. There are two clusters in the bottom right of the plot, 4 clusters in the top left of the plot, and a final cluster consisting of one point in the top right of the plot."} ## k is a user defined parameter determining ## the desired number of clusters at which to cut the treee cutree(clust, k = 3) From d31570507c0f4aeb4d4be7c60ccbfbe7534a082e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 09:47:22 +0000 Subject: [PATCH 098/119] associated to resulting Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 88d89fa4..48e7b7ba 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -61,7 +61,7 @@ PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect. This is useful from an exploratory point of view, discovering how variables might be associated and combined. The -associated principal component could also be used as an effect in further analysis +resulting principal component could also be used as an effect in further analysis (e.g. linear regression). From fdc03498006ba5c29f634f572c0b7cf95bf7d340 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 09:55:47 +0000 Subject: [PATCH 099/119] plural to singular Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 48e7b7ba..18d79e49 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -122,7 +122,7 @@ fallow for 50 farms in southern England. The red line represents the first principal component scores, which pass through the points with the greatest variability. The points along this line give the first principal component. The second principal component explains the next highest amount of variability -in the data and are represented by the line perpendicular to the first (green line). +in the data and is represented by the line perpendicular to the first (green line). ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} From 99c64395f4e9f1065fbb537af72a249a4d6c9d3a Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 09:56:09 +0000 Subject: [PATCH 100/119] remove iteratively Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 18d79e49..32369417 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -130,7 +130,7 @@ in the data and is represented by the line perpendicular to the first (green lin knitr::include_graphics(here("fig/bio_index_vs_percentage_fallow.png")) ``` -The animation below illustrates how principal components are calculated iteratively from +The animation below illustrates how principal components are calculated from data. You can imagine that the black line is a rod and each red dashed line is a spring. The energy of each spring is proportional to its squared length. The direction of the first principal component is the one that minimises the total From 3d3ed3a6368c123fcc89af9ff339647750da9157 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 09:57:26 +0000 Subject: [PATCH 101/119] change to possessive Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 32369417..db2401d9 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -153,7 +153,7 @@ knitr::include_graphics(here("fig/pendulum.gif")) > where $a_{11}...a_{p1}$ represent principal component _loadings_. {: .callout} -In summary, the principal components values are called _scores_. The loadings can +In summary, the principal components' values are called _scores_. The loadings can be thought of as the degree to which each original variable contributes to the principal component scores. In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. From e164deb39caaa421fc2d924953ea030dd2e6dd47 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 11:05:19 +0000 Subject: [PATCH 102/119] edit biodiversity explanation --- _episodes_rmd/04-principal-component-analysis.Rmd | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index db2401d9..8bbc9a9d 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -118,11 +118,13 @@ in the underlying dataset. The second principal component derived explains the s To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left -fallow for 50 farms in southern England. The red line represents the first -principal component scores, which pass through the points with the greatest -variability. The points along this line give the first principal component. -The second principal component explains the next highest amount of variability -in the data and is represented by the line perpendicular to the first (green line). +fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points called _scores_. +The red line on the plot represents the line passing through the scores (points) of the first principal component. +The angle that the first principal component line passes through the data points at is set to the direction with the highest +variability. The plotted first principal components can therefore be thought of reflecting the +effect in the data that has the highest variability. The second principal component explains the next highest amount of variability +in the data and is represented by the line perpendicular to the first (the green line). The second principal component can be thought of as +capturing the overall effect in the data that has the second-highest variability. ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} From 6f3a5a4e845cdf3535e4debbdbd64384a189eb94 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 11:07:14 +0000 Subject: [PATCH 103/119] remove echo FALSE box with only code comments --- _episodes_rmd/04-principal-component-analysis.Rmd | 6 ------ 1 file changed, 6 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 8bbc9a9d..15f07078 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -107,12 +107,6 @@ resulting principal component could also be used as an effect in further analysi # Principal component analysis -```{r, eval=FALSE, echo=FALSE} -# A PCA is carried out by calculating a matrix of Pearson's correlations from -# the original dataset which shows how each of the variables in the dataset -# relate to each other. -``` - PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. From ba029bae7b427be2db0dd0b5c2f15f7f934e0d11 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 12:19:55 +0000 Subject: [PATCH 104/119] add here to k means Co-authored-by: Alan O'Callaghan --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 60b7cca7..b53830cf 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -85,7 +85,7 @@ these two steps until appropriate clusters have been formed: We can see this process in action in this animation: ```{r kmeans-animation, echo = FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics("../fig/kmeans.gif") +knitr::include_graphics(here::here("fig/kmeans.gif")) ``` While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, namely difficulties in identifying From c43d9b8228e1271e5eefd4082a06361f493cbac6 Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Mon, 18 Mar 2024 20:30:34 +0000 Subject: [PATCH 105/119] Update _episodes_rmd/04-principal-component-analysis.Rmd --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 15f07078..dbf7e790 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -123,7 +123,7 @@ capturing the overall effect in the data that has the second-highest variability ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} # ![Figure 1: Biodiversity index and percentage area fallow PCA](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Bio index vs percentage fallow.png) -knitr::include_graphics(here("fig/bio_index_vs_percentage_fallow.png")) +knitr::include_graphics("../fig/bio_index_vs_percentage_fallow.png") ``` The animation below illustrates how principal components are calculated from From a1103da687c4f778234abc1afd99260c58fbe863 Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Mon, 18 Mar 2024 20:31:43 +0000 Subject: [PATCH 106/119] Update _episodes_rmd/06-k-means.Rmd --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index b53830cf..60b7cca7 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -85,7 +85,7 @@ these two steps until appropriate clusters have been formed: We can see this process in action in this animation: ```{r kmeans-animation, echo = FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics(here::here("fig/kmeans.gif")) +knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, namely difficulties in identifying From 351a52ee6fe77fcd68272b3072b5a6e2a197613b Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Mon, 18 Mar 2024 20:32:44 +0000 Subject: [PATCH 107/119] Update _episodes_rmd/04-principal-component-analysis.Rmd --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index dbf7e790..c68168f5 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -137,7 +137,7 @@ principal component. This is explained in more detail on [this Q&A website](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues). ```{r pendulum, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics(here("fig/pendulum.gif")) +knitr::include_graphics("../fig/pendulum.gif") ``` > ## Mathematical description of PCA From b59028abf61cf38c698e3f0eb8da242fb91a9cef Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 21 Mar 2024 15:07:24 +0000 Subject: [PATCH 108/119] add that individual level scores --- _episodes_rmd/04-principal-component-analysis.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c68168f5..ac6aae4f 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -112,8 +112,8 @@ in the underlying dataset. The second principal component derived explains the s To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left -fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points called _scores_. -The red line on the plot represents the line passing through the scores (points) of the first principal component. +fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points for each individual observation called _scores_. +The red line on the plot represents the line passing through the scores (points) of the first principal component for each observation. The angle that the first principal component line passes through the data points at is set to the direction with the highest variability. The plotted first principal components can therefore be thought of reflecting the effect in the data that has the highest variability. The second principal component explains the next highest amount of variability From 92f16b44e45eb89c2ee52a923cfc56b10e997408 Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Thu, 21 Mar 2024 16:58:37 +0000 Subject: [PATCH 109/119] Lowercase prostate (#157) * Resolve #156 * Remove ordering by effect size for toptab --- ...-introduction-to-high-dimensional-data.Rmd | 44 +++++++++---------- .../02-high-dimensional-regression.Rmd | 5 +-- 2 files changed, 24 insertions(+), 25 deletions(-) diff --git a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd index 07cb9150..cd1d8e52 100644 --- a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd +++ b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd @@ -124,11 +124,11 @@ of the challenges we are facing when working with high-dimensional data. > encountered when working with many features in a high-dimensional data set. > > First, make sure you have completed the setup instructions [here](https://carpentries-incubator.github.io/high-dimensional-stats-r/setup.html). -> Next, let's Load the `Prostate` dataset as follows: +> Next, let's Load the `prostate` dataset as follows: > > ```{r prostate} > library("here") -> Prostate <- readRDS(here("data/prostate.rds")) +> prostate <- readRDS(here("data/prostate.rds")) > ``` > > Examine the dataset (in which each row represents a single patient) to: @@ -142,21 +142,21 @@ of the challenges we are facing when working with high-dimensional data. > > > > > > ```{r dim-prostate, eval = FALSE} -> > dim(Prostate) #print the number of rows and columns +> > dim(prostate) #print the number of rows and columns > > ``` > > > > ```{r head-prostate, eval = FALSE} -> > names(Prostate) # examine the variable names -> > head(Prostate) #print the first 6 rows +> > names(prostate) # examine the variable names +> > head(prostate) #print the first 6 rows > > ``` > > > > ```{r pairs-prostate} -> > names(Prostate) #examine column names +> > names(prostate) #examine column names > > -> > pairs(Prostate) #plot each pair of variables against each other +> > pairs(prostate) #plot each pair of variables against each other > > ``` > > The `pairs()` function plots relationships between each of the variables in -> > the `Prostate` dataset. This is possible for datasets with smaller numbers +> > the `prostate` dataset. This is possible for datasets with smaller numbers > > of variables, but for datasets in which $p$ is larger it becomes difficult > > (and time consuming) to visualise relationships between all variables in the > > dataset. Even where visualisation is possible, fitting models to datasets @@ -211,7 +211,7 @@ explore why high correlations might be an issue in a Challenge. > ## Challenge 3 > > Use the `cor()` function to examine correlations between all variables in the -> `Prostate` dataset. Are some pairs of variables highly correlated using a threshold of +> `prostate` dataset. Are some pairs of variables highly correlated using a threshold of > 0.75 for the correlation coefficients? > > Use the `lm()` function to fit univariate regression models to predict patient @@ -224,11 +224,11 @@ explore why high correlations might be an issue in a Challenge. > > > ## Solution > > -> > Create a correlation matrix of all variables in the Prostate dataset +> > Create a correlation matrix of all variables in the `prostate` dataset > > > > ```{r cor-prostate} -> > cor(Prostate) -> > round(cor(Prostate), 2) # rounding helps to visualise the correlations +> > cor(prostate) +> > round(cor(prostate), 2) # rounding helps to visualise the correlations > > ``` > > > > As seen above, some variables are highly correlated. In particular, the @@ -238,15 +238,15 @@ explore why high correlations might be an issue in a Challenge. > > as predictors. > > > > ```{r univariate-prostate} -> > model1 <- lm(age ~ gleason, data = Prostate) -> > model2 <- lm(age ~ pgg45, data = Prostate) +> > model_gleason <- lm(age ~ gleason, data = prostate) +> > model_pgg45 <- lm(age ~ pgg45, data = prostate) > > ``` > > > > Check which covariates have a significant efffect > > > > ```{r summary-prostate} -> > summary(model1) -> > summary(model2) +> > summary(model_gleason) +> > summary(model_pgg45) > > ``` > > > > Based on these results we conclude that both `gleason` and `pgg45` have a @@ -257,8 +257,8 @@ explore why high correlations might be an issue in a Challenge. > > as predictors > > > > ```{r multivariate-prostate} -> > model3 <- lm(age ~ gleason + pgg45, data = Prostate) -> > summary(model3) +> > model_multivar <- lm(age ~ gleason + pgg45, data = prostate) +> > summary(model_multivar) > > ``` > > > > Although `gleason` and `pgg45` have statistically significant univariate effects, @@ -298,7 +298,7 @@ In this course, we will cover four methods that help in dealing with high-dimens (3) dimensionality reduction, and (4) clustering. Here are some examples of when each of these approaches may be used: -(1) Regression with numerous outcomes refers to situations in which there are +1. Regression with numerous outcomes refers to situations in which there are many variables of a similar kind (expression values for many genes, methylation levels for many sites in the genome) and when one is interested in assessing whether these variables are associated with a specific covariate of interest, @@ -308,7 +308,7 @@ predictor) could be fitted independently. In the context of high-dimensional molecular data, a typical example are *differential gene expression* analyses. We will explore this type of analysis in the *Regression with many outcomes* episode. -(2) Regularisation (also known as *regularised regression* or *penalised regression*) +2. Regularisation (also known as *regularised regression* or *penalised regression*) is typically used to fit regression models when there is a single outcome variable or interest but the number of potential predictors is large, e.g. there are more predictors than observations. Regularisation can help to prevent @@ -318,14 +318,14 @@ been often used when building *epigenetic clocks*, where methylation values across several thousands of genomic sites are used to predict chronological age. We will explore this in more detail in the *Regularised regression* episode. -(3) Dimensionality reduction is commonly used on high-dimensional datasets for +3. Dimensionality reduction is commonly used on high-dimensional datasets for data exploration or as a preprocessing step prior to other downstream analyses. For instance, a low-dimensional visualisation of a gene expression dataset may be used to inform *quality control* steps (e.g. are there any anomalous samples?). This course contains two episodes that explore dimensionality reduction techniques: *Principal component analysis* and *Factor analysis*. -(4) Clustering methods can be used to identify potential grouping patterns +4. Clustering methods can be used to identify potential grouping patterns within a dataset. A popular example is the *identification of distinct cell types* through clustering cells with similar gene expression patterns. The *K-means* episode will explore a specific method to perform clustering analysis. diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index 2097bac7..b4ca084b 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -621,7 +621,7 @@ head(design_age) > that minimises the differences between outcome values and those values > predicted by using the covariates (or predictor variables). But how do we get > from a set of predictors and regression coefficients to predicted values? This -> is done via matrix multipliciation. The matrix of predictors is (matrix) +> is done via matrix multiplication. The matrix of predictors is (matrix) > multiplied by the vector of coefficients. That matrix is called the > **model matrix** (or design matrix). It has one row for each observation and > one column for each predictor plus (by default) one aditional column of ones @@ -669,8 +669,7 @@ of the input matrix. ```{r ebayes-toptab} toptab_age <- topTable(fit_age, coef = 2, number = nrow(fit_age)) -orderEffSize <- rev(order(abs(toptab_age$logFC))) # order by effect size (absolute log-fold change) -head(toptab_age[orderEffSize, ]) +head(toptab_age) ``` The output of `topTable` includes the coefficient, here termed a log From eaa445ecc9289f95c06e2463bd4d2779ffd7a204 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 21 Mar 2024 17:04:24 +0000 Subject: [PATCH 110/119] edit figure caption 1 --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 65fded0b..18aef6b0 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -348,7 +348,7 @@ In addition to visualising cluster identity in scatter plots, it is also possibl highlight branches in dentrograms. In this example, we calculate a distance matrix between samples in the `methyl_mat` dataset. We then draw boxes round clusters obtained with `cutree`. -```{r plot-clust-method, fig.cap="Dendogram with boxes around clusters.", fig.alt="A dendogram for the methyl_mat data with boxes overlain on clusters. There are 5 clusters in total, each delineating a separate cluster."} +```{r plot-clust-method, fig.cap="Dendogram with boxes around clusters.", fig.alt="A dendogram for the methyl_mat data with boxes overlaid on clusters. There are 5 boxes in total, each indicating separate clusters."} ## create a distance matrix using euclidean method distmat <- dist(methyl_mat) ## hierarchical clustering using complete method From f58a447ae3654639735a6b5d1f2032d3ea660af8 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 21 Mar 2024 17:05:56 +0000 Subject: [PATCH 111/119] edit heatmap caption --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 18aef6b0..66531dfc 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -119,7 +119,7 @@ Looking at a heatmap of these data, we may spot some patterns -- many columns appear to have a similar methylation levels across all rows. However, they are all quite jumbled at the moment, so it's hard to tell how many line up exactly. -```{r heatmap-noclust, echo=FALSE, fig.cap="Heatmap of methylation data.", fig.alt="Heatmap of individual versus methylation sides, coloured by methylation level. Red delineates high methylation levels (up to around 4), blue delineates low methylation levels (to around -4) and white delineates methylation levels close to zero. There are many vertical blue and red stripes."} +```{r heatmap-noclust, echo=FALSE, fig.cap="Heatmap of methylation data.", fig.alt="Heatmap of methylation level with individuals along the y axis and methylation sites along the x axis. Red colours indicate high methylation levels (up to around 4), blue colours indicate low methylation levels (to around -4) and white indicates methylation levels close to zero. There are many vertical blue and red stripes."} Heatmap(methyl_mat, name = "Methylation level", From 8aea374c1d404f84849479a652a8a2e07d767a9f Mon Sep 17 00:00:00 2001 From: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> Date: Fri, 22 Mar 2024 09:53:18 +0000 Subject: [PATCH 112/119] clarify pcs to keep Co-authored-by: Mary Llewellyn --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index ac6aae4f..8c80c811 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -108,7 +108,7 @@ resulting principal component could also be used as an effect in further analysi # Principal component analysis PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability -in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. +in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed into principal components, we can extract a subset of the principal components in order of the variance they explain (starting with the first principal component that by definition explains the most variability, and then the second), giving new variables that explain a lot of the variability in the original dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left From 69e2f1d4e744ea5597929a796cbe05457fc116cf Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 22 Mar 2024 10:03:08 +0000 Subject: [PATCH 113/119] clarify individual scores relationship Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 8c80c811..6cc67786 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -112,7 +112,7 @@ in the underlying dataset. The second principal component derived explains the s To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left -fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points for each individual observation called _scores_. +fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points, one for each individual observation called _scores_. The red line on the plot represents the line passing through the scores (points) of the first principal component for each observation. The angle that the first principal component line passes through the data points at is set to the direction with the highest variability. The plotted first principal components can therefore be thought of reflecting the From 93c42e3a2387a14f4d0055a5b6e0798d0e66ceaa Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 14:58:24 +0000 Subject: [PATCH 114/119] distance to dissimilarity --- _episodes_rmd/07-hierarchical.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 188a66cc..e597f3cb 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -559,7 +559,7 @@ are grouped together, while `sample_b` is seen as distant because it has a different pattern, even though its values are closer to `sample_a`. Using your own distance function is often useful, especially if you have missing or unusual data. It's often possible to use correlation and other custom -distance functions in functions that perform hierarchical clustering, such as +dissimilarity measures in functions that perform hierarchical clustering, such as `pheatmap()` and `stats::heatmap()`: ```{r heatmap-cor-cor-example, fig.cap="Heatmaps of features versus samples clustered in the samples according to correlation.", fig.alt="Heatmaps of features versus samples, coloured by simulated value. The columns (samples) are clustered according to the correlation. Samples a and b have mostly low values, delineated by blue in the first plot and yellow in the second plot. Sample c has mostly high values, delineated by red in the first plot and brown in the second plot."} From fe13a414e8acefa0c913149a3bc7992ea21edc95 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:30:37 +0000 Subject: [PATCH 115/119] add full stop Co-authored-by: Alan O'Callaghan --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 60b7cca7..390f68c9 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -80,7 +80,7 @@ and this is discussed below. Once we have picked intitial points, we then follow these two steps until appropriate clusters have been formed: 1. Assign each data point to the cluster with the closest centroid -2. Update centroid positions as the average of the points in that cluster +2. Update centroid positions as the average of the points in that cluster. We can see this process in action in this animation: From 7f7b8ffc8ecdc15249121c8e270b16b4e0929b15 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:31:40 +0000 Subject: [PATCH 116/119] typo fix adn --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 390f68c9..2fd2607f 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -69,7 +69,7 @@ so that we become more confident about the shape and size of the clusters. user-defined number of distinct, non-overlapping clusters. To create clusters of 'similar' data points, K-means clustering creates clusters that minimise the -within-cluster variation adn thus the amount that +within-cluster variation and thus the amount that data points within a cluster differ from each other. The distance between data points within a cluster is used as a measure of within-cluster variation. From ee8917a682c861a54c6e3637f0f63eeaf2b13901 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:32:32 +0000 Subject: [PATCH 117/119] namely to particularly --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 2fd2607f..071c5e7e 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -88,7 +88,7 @@ We can see this process in action in this animation: knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and -to understand), it does have some disadvantages, namely difficulties in identifying +to understand), it does have some disadvantages, particularly difficulties in identifying initial clusters which observations belong to and the need for the user to specifiy the number of clusters that the data should be partitioned into. From 67a85aa860213df33559d419b7168a08846e9156 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 25 Mar 2024 15:33:08 +0000 Subject: [PATCH 118/119] specifiy typo fix --- _episodes_rmd/06-k-means.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index 071c5e7e..1ea085e5 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -89,7 +89,7 @@ knitr::include_graphics("../fig/kmeans.gif") ``` While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, particularly difficulties in identifying -initial clusters which observations belong to and the need for the user to specifiy the +initial clusters which observations belong to and the need for the user to specify the number of clusters that the data should be partitioned into. > ## Initialisation From d831ddb23707a49001a1606516d4d45f40782860 Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Mon, 25 Mar 2024 15:39:02 +0000 Subject: [PATCH 119/119] Spacing --- _episodes_rmd/07-hierarchical.Rmd | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index 04bb8caa..e35a4329 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -295,8 +295,6 @@ we can count the vertical lines we encounter crossing the horizontal cut. For example, a cut at height 10 produces 2 downstream clusters for the dendogram in Challenge 1, while a cut at height 4 produces 6 downstream clusters. - - # Dendogram visualisation We can first visualise cluster membership by highlight branches in dendograms. @@ -330,6 +328,7 @@ plot(color_branches(avg_dend_obj, h = 50)) ``` ## Numerical visualisation + In addition to visualising clusters directly on the dendogram, we can cut the dendrogram to determine number of clusters at different heights using `cutree()`. This function cuts a dendrogram into several