From c55220044e2f1fb6b87511a7909b3e55b1162179 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:18:24 +0000 Subject: [PATCH 01/44] remove "imagine", task 1 more consistent with high-dim data definition at the start also --- _episodes_rmd/04-principal-component-analysis.Rmd | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index a1d54d0c..26353334 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -46,12 +46,11 @@ knitr_fig_path("05-") # Introduction -Imagine a dataset which contains many variables ($p$), close to the total number -of rows in the dataset ($n$). Some of these variables are highly correlated and -several form groups which you might expect to represent the same overall effect. -Such datasets are challenging to analyse for several reasons, with the main -problem being how to reduce dimensionality in the dataset while retaining the -important features. +If a dataset contains many variables ($p$), it is likely that some of these +variables will be highly correlated. Variables may even be so highly correlated +that they represent the same overall effect. Such datasets are challenging +to analyse for several reasons, with the main problem being how to reduce +dimensionality in the dataset while retaining the important features. In this episode we will explore *principal component analysis* (PCA) as a popular method of analysing high-dimensional data. PCA is an unsupervised From 6c8b768f09f006ca6dfebc8954298e1a199ebfb7 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:30:37 +0000 Subject: [PATCH 02/44] rewording end of introduction, task 2 --- _episodes_rmd/04-principal-component-analysis.Rmd | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 26353334..5493956a 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -56,13 +56,13 @@ In this episode we will explore *principal component analysis* (PCA) as a popular method of analysing high-dimensional data. PCA is an unsupervised statistical method which allows large datasets of correlated variables to be summarised into smaller numbers of uncorrelated principal components that -explain most of the variability in the original dataset. This is useful, -for example, during initial data exploration as it allows correlations among -data points to be observed and principal components to be calculated for -inclusion in further analysis (e.g. linear regression). An example of PCA might -be reducing several variables representing aspects of patient health -(blood pressure, heart rate, respiratory rate) into a single feature. - +explain most of the variability in the original dataset. As an example, +PCA might reduce several variables representing aspects of patient health +(blood pressure, heart rate, respiratory rate) into a single feature capturing +an overarching "patient health" effect. This is useful from an exploratory point +of view, discovering how variables might be associated and combined, but the +associated principal component can also be used as an effect in further analysis +(e.g. linear regression). From 87e86664b00b04dcf1197e8da2ff02af2e553001 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:31:36 +0000 Subject: [PATCH 03/44] minor edits to end of introduction, task 2 --- _episodes_rmd/04-principal-component-analysis.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 5493956a..7eee5cdb 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -60,8 +60,8 @@ explain most of the variability in the original dataset. As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect. This is useful from an exploratory point -of view, discovering how variables might be associated and combined, but the -associated principal component can also be used as an effect in further analysis +of view, discovering how variables might be associated and combined. The the +associated principal component could also be used as an effect in further analysis (e.g. linear regression). From e0e1118b513b2f497de97a7f1ec8fc540136cba1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:36:56 +0000 Subject: [PATCH 04/44] remove all reference to supervised learning It is referenced in the callout after its first and only mention in the introduction (and not referenced in any other episodes as far as I can tell). I think it's probably not necessary and maybe even distracting from a cognitive overload point of view. If we want to integrate this with some ML jargon, I would suggest it's included much earlier in the episodes and the terminology is used throughout. --- .../04-principal-component-analysis.Rmd | 24 +------------------ 1 file changed, 1 insertion(+), 23 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 7eee5cdb..fcb9e107 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -53,7 +53,7 @@ to analyse for several reasons, with the main problem being how to reduce dimensionality in the dataset while retaining the important features. In this episode we will explore *principal component analysis* (PCA) as a -popular method of analysing high-dimensional data. PCA is an unsupervised +popular method of analysing high-dimensional data. PCA is a statistical method which allows large datasets of correlated variables to be summarised into smaller numbers of uncorrelated principal components that explain most of the variability in the original dataset. As an example, @@ -90,28 +90,6 @@ Disadvantages: regression). -> ## Supervised vs unsupervised learning -> Most statistical problems fall into one of two categories: supervised or -> unsupervised learning. -> Examples of supervised learning problems include linear regression and include -> analyses in which each observation has both at least one independent variable -> ($x$) as well as a dependent variable ($y$). In supervised learning problems -> the aim is to predict the value of the response given future observations or -> to understand the relationship between the dependent variable and the -> predictors. In unsupervised learning for each observation there is no -> dependent variable ($y$), but only -> a series of independent variables. In this situation there is no need for -> prediction, as there is no dependent variable to predict (hence the analysis -> can be thought as being unsupervised by the dependent variable). Instead -> statistical analysis can be used to understand relationships between the -> independent variables or between observations themselves. Unsupervised -> learning problems often occur when analysing high-dimensional datasets in -> which there is no obvious dependent variable to be -> predicted, but the analyst would like to understand more about patterns -> between groups of observations or reduce dimensionality so that a supervised -> learning process may be used. -{: .callout} - > ## Challenge 1 > From 8515fc7a42552d2da72e0237f8199c13e2a3046c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 1 Mar 2024 09:39:30 +0000 Subject: [PATCH 05/44] move advantages and disadvantages to after description of PCA, task 3 propose that this is difficult to understand without first really understanding what PCA is --- .../04-principal-component-analysis.Rmd | 50 +++++++++---------- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index fcb9e107..33acc5f2 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -66,31 +66,6 @@ associated principal component could also be used as an effect in further analys -# Advantages and disadvantages of PCA - -Advantages: -* It is a relatively easy to use and popular method. -* Various software/packages are available to run a PCA. -* The calculations used in a PCA are easy to understand for statisticians and - non-statisticians alike. - -Disadvantages: -* It assumes that variables in a dataset are correlated. -* It is sensitive to the scale at which input variables are measured. - If input variables are measured at different scales, the variables - with large variance relative to the scale of measurement will have - greater impact on the principal components relative to variables with smaller - variance. In many cases, this is not desirable. -* It is not robust against outliers, meaning that very large or small data - points can have a large effect on the output of the PCA. -* PCA assumes a linear relationship between variables which is not always a - realistic assumption. -* It can be difficult to interpret the meaning of the principal components, - especially when including them in further analyses (e.g. inclusion in a linear - regression). - - - > ## Challenge 1 > > Descriptions of three datasets and research questions are given below. For @@ -439,6 +414,31 @@ depending on the PCA implementation you use. Here are some examples: +# Advantages and disadvantages of PCA + +Advantages: +* It is a relatively easy to use and popular method. +* Various software/packages are available to run a PCA. +* The calculations used in a PCA are easy to understand for statisticians and + non-statisticians alike. + +Disadvantages: +* It assumes that variables in a dataset are correlated. +* It is sensitive to the scale at which input variables are measured. + If input variables are measured at different scales, the variables + with large variance relative to the scale of measurement will have + greater impact on the principal components relative to variables with smaller + variance. In many cases, this is not desirable. +* It is not robust against outliers, meaning that very large or small data + points can have a large effect on the output of the PCA. +* PCA assumes a linear relationship between variables which is not always a + realistic assumption. +* It can be difficult to interpret the meaning of the principal components, + especially when including them in further analyses (e.g. inclusion in a linear + regression). + + + # Using PCA to analyse gene expression data In this section you will carry out your own PCA using the Bioconductor package **`PCAtools`** From 7ecc5f2f3011d49bd690dd70144fa0048a455e4d Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:06:46 +0000 Subject: [PATCH 06/44] edit advantages and disadvantages, task 4 make PCA being easy statement relative, rather than saying it's generally easy for everyone --- _episodes_rmd/04-principal-component-analysis.Rmd | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 109236f3..37aee14d 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -417,8 +417,7 @@ depending on the PCA implementation you use. Here are some examples: Advantages: * It is a relatively easy to use and popular method. * Various software/packages are available to run a PCA. -* The calculations used in a PCA are easy to understand for statisticians and - non-statisticians alike. +* The calculations used in a PCA are simple to understand compared to other methods for dimension reduction. Disadvantages: * It assumes that variables in a dataset are correlated. From 576d05636d1684faea4831172b72dcda951426eb Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:08:38 +0000 Subject: [PATCH 07/44] edit PCA section title I think this should be called PCA for signposting that this presents the whole method essentially --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 37aee14d..0eda584e 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -105,7 +105,7 @@ associated principal component could also be used as an effect in further analys {: .challenge} -# What is a principal component? +# Principal component analysis ```{r, eval=FALSE, echo=FALSE} From 9db44fa3f41ab25f6323623d1dede1a152f34f09 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:09:01 +0000 Subject: [PATCH 08/44] complete task 5 --- _episodes_rmd/04-principal-component-analysis.Rmd | 1 - 1 file changed, 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 0eda584e..aed0ef20 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -107,7 +107,6 @@ associated principal component could also be used as an effect in further analys # Principal component analysis - ```{r, eval=FALSE, echo=FALSE} # A PCA is carried out by calculating a matrix of Pearson's correlations from # the original dataset which shows how each of the variables in the dataset From ccd9762946d3905110f5a39a5544bd86fe75e7a9 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:17:16 +0000 Subject: [PATCH 09/44] describe what a principal component is first, task 5 --- _episodes_rmd/04-principal-component-analysis.Rmd | 1 + 1 file changed, 1 insertion(+) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index aed0ef20..6b1bb2e0 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -113,6 +113,7 @@ associated principal component could also be used as an effect in further analys # relate to each other. ``` +PCA transforms data to new uncorrelated variables called "principal components". The first principal component is the direction of the data along which the observations vary the most. The second principal component is the direction of the data along which the observations show the next highest amount of variation. From 930cd0b1b487855bc9b748b36133d07ae51b4ce1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:43:13 +0000 Subject: [PATCH 10/44] reorder, rewrite description of pca and remove repetition --- .../04-principal-component-analysis.Rmd | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 6b1bb2e0..438d7a6e 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -114,10 +114,16 @@ associated principal component could also be used as an effect in further analys ``` PCA transforms data to new uncorrelated variables called "principal components". -The first principal component is the direction of the data along which the -observations vary the most. The second principal component is the direction of -the data along which the observations show the next highest amount of variation. -For example, Figure 1 shows biodiversity index versus percentage area left +Each principal component is a linear combination of the variables in the data +set. The first principal component is the direction of the data along which the +observations vary the most. In other words, the first principal component +explains the largest amount of the variability in the underlying dataset. +The second principal component is the direction of +the data along which the observations show the next highest amount of variation +(and explains the second largest amount of variability in the dataset). + + +Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component direction of the data, which is the direction along which there is greatest variability in the data. Projecting points onto this line @@ -126,11 +132,6 @@ vector of points with the greatest possible variance. The next highest amount of variability in the data is represented by the line perpendicular to first regression line which represents the second principal component (green line). -The second principal component is a linear combination of the variables that -is uncorrelated with the first principal component. There are as many principal -components as there are variables in your dataset, but as we'll see, some are -more useful at explaining your data than others. By definition, the first -principal component explains more variation than other principal components. ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} # ![Figure 1: Biodiversity index and percentage area fallow PCA](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Bio index vs percentage fallow.png) From 84eacc8e1ca0204af3bcce1b5416b9cd775dcf15 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:49:12 +0000 Subject: [PATCH 11/44] avoid talking about projections This may need further editing as I'm not sure it's easy to understand yet --- _episodes_rmd/04-principal-component-analysis.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 438d7a6e..dc4fdf0d 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -126,11 +126,11 @@ the data along which the observations show the next highest amount of variation Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component direction of the data, which is the direction along which -there is greatest variability in the data. Projecting points onto this line -(i.e. by finding the location on the line closest to the point) would give a -vector of points with the greatest possible variance. The next highest amount +there is greatest variability in the data. Finding the location on the line +closest to a given data point would yield a vector of points with the +greatest possible variance. The next highest amount of variability in the data is represented by the line perpendicular to first -regression line which represents the second principal component (green line). +regression line, which represents the uncorrelated second principal component (green line). ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} From 4385f8a292f7abbf5656ae9908d2340bb42cfe35 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 12:59:13 +0000 Subject: [PATCH 12/44] mathematical description sooner and link to initial description, task 6 --- .../04-principal-component-analysis.Rmd | 24 ++++++++++++------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index dc4fdf0d..3c246bff 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -120,8 +120,20 @@ observations vary the most. In other words, the first principal component explains the largest amount of the variability in the underlying dataset. The second principal component is the direction of the data along which the observations show the next highest amount of variation -(and explains the second largest amount of variability in the dataset). +(and explains the second largest amount of variability in the dataset), and so on. +Mathematically, the first principal component values or _scores_, $Z_1$, are a linear combination +of variables in the dataset, $X_1...X_p$: + +$$ + Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, +$$ + +where $a_{11}...a_{p1}$ represent principal component _loadings_, +which can be thought of as the degree to +which each variable contributes to the calculation of the principal component. +The values of $a_{11}...a_{p1}$ are found so that the principal component (scores), $Z_1$, +explain most of the variation in the dataset. Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first @@ -129,7 +141,7 @@ principal component direction of the data, which is the direction along which there is greatest variability in the data. Finding the location on the line closest to a given data point would yield a vector of points with the greatest possible variance. The next highest amount -of variability in the data is represented by the line perpendicular to first +of variability in the data is represented by the line perpendicular to the first regression line, which represents the uncorrelated second principal component (green line). @@ -153,15 +165,9 @@ knitr::include_graphics("../fig/pendulum.gif") ``` -The first principal component's scores ($Z_1$) are calculated using the equation: -$$ - Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p -$$ -$X_1...X_p$ represents variables in the original dataset and $a_{11}...a_{p1}$ -represent principal component loadings, which can be thought of as the degree to -which each variable contributes to the calculation of the principal component. + We will come back to principal component scores and loadings further below. # How do we perform a PCA? From 5709dec112677046f2ca282d39a59ccc26582643 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:10:26 +0000 Subject: [PATCH 13/44] simplify description of pca --- .../04-principal-component-analysis.Rmd | 21 ++++++++----------- 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 3c246bff..aa6ba6cd 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -131,18 +131,17 @@ $$ where $a_{11}...a_{p1}$ represent principal component _loadings_, which can be thought of as the degree to -which each variable contributes to the calculation of the principal component. -The values of $a_{11}...a_{p1}$ are found so that the principal component (scores), $Z_1$, -explain most of the variation in the dataset. +which each variable contributes to the calculation of the principal component scores. +The values of $a_{11}...a_{p1}$ are found so that the principal component scores, $Z_1$, +explain most of the variation in the dataset. Once we have calculated the principal component scores by finding the loadings, we can use them as new variables. +To see what these new principal component variables may look like, Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first -principal component direction of the data, which is the direction along which -there is greatest variability in the data. Finding the location on the line -closest to a given data point would yield a vector of points with the -greatest possible variance. The next highest amount -of variability in the data is represented by the line perpendicular to the first -regression line, which represents the uncorrelated second principal component (green line). +principal component scores, which pass through the points with the greatest +variability. The points along this line give the first principal component scores. +The second principal component scores explain the next highest amount of variability +in the data and are represented by the line perpendicular to the first (green line). ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} @@ -150,7 +149,7 @@ regression line, which represents the uncorrelated second principal component (g knitr::include_graphics("../fig/bio_index_vs_percentage_fallow.png") ``` -The animation below illustrates how principal components are calculated from +The animation below illustrates how principal components are calculated iteratively from data. You can imagine that the black line is a rod and each red dashed line is a spring. The energy of each spring is proportional to its squared length. The direction of the first principal component is the one that minimises the total @@ -166,8 +165,6 @@ knitr::include_graphics("../fig/pendulum.gif") - - We will come back to principal component scores and loadings further below. # How do we perform a PCA? From baee36e30656fc2b32a26daa2e2cf64810497fc1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:14:25 +0000 Subject: [PATCH 14/44] use "here" for file paths --- _episodes_rmd/04-principal-component-analysis.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index aa6ba6cd..07749a02 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -146,7 +146,7 @@ in the data and are represented by the line perpendicular to the first (green li ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} # ![Figure 1: Biodiversity index and percentage area fallow PCA](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Bio index vs percentage fallow.png) -knitr::include_graphics("../fig/bio_index_vs_percentage_fallow.png") +knitr::include_graphics(here("fig/bio_index_vs_percentage_fallow.png")) ``` The animation below illustrates how principal components are calculated iteratively from @@ -160,7 +160,7 @@ principal component. This is explained in more detail on [this Q&A website](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues). ```{r pendulum, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics("../fig/pendulum.gif") +knitr::include_graphics(here("fig/pendulum.gif")) ``` @@ -470,7 +470,7 @@ associated metadata, downloaded from the ```{r se} library("SummarizedExperiment") -cancer <- readRDS(here::here("data/cancer_expression.rds")) +cancer <- readRDS(here("data/cancer_expression.rds") mat <- assay(cancer) metadata <- colData(cancer) ``` From 6f077ce512825ed77f6f2718b0878210529a325c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:15:05 +0000 Subject: [PATCH 15/44] add close bracket for file paths --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 07749a02..627f914d 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -470,7 +470,7 @@ associated metadata, downloaded from the ```{r se} library("SummarizedExperiment") -cancer <- readRDS(here("data/cancer_expression.rds") +cancer <- readRDS(here("data/cancer_expression.rds")) mat <- assay(cancer) metadata <- colData(cancer) ``` From a31c0047341119486bd7a4244735e3531296bea7 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:22:29 +0000 Subject: [PATCH 16/44] remove all mention of directions in pca description I think talk of projections/directions is slightly confusing and possibly unnecessary here --- .../04-principal-component-analysis.Rmd | 24 ++++++++----------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 627f914d..e1d183b1 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -113,25 +113,21 @@ associated principal component could also be used as an effect in further analys # relate to each other. ``` -PCA transforms data to new uncorrelated variables called "principal components". -Each principal component is a linear combination of the variables in the data -set. The first principal component is the direction of the data along which the -observations vary the most. In other words, the first principal component -explains the largest amount of the variability in the underlying dataset. -The second principal component is the direction of -the data along which the observations show the next highest amount of variation -(and explains the second largest amount of variability in the dataset), and so on. - -Mathematically, the first principal component values or _scores_, $Z_1$, are a linear combination -of variables in the dataset, $X_1...X_p$: +PCA transforms a dataset into a new set of uncorrelated variables called "principal components". +The first principal component is derived to explain the largest amount of the variability +in the underlying dataset. The second principal component is derived to explain the second largest amount of variability in the dataset, and so on. + +Mathematically, each principal component is a linear combination of the variables in the data +set. That is, the first principal component values or _scores_, $Z_1$, are a linear combination +of variables in the dataset, $X_1...X_p$, given by $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ -where $a_{11}...a_{p1}$ represent principal component _loadings_, -which can be thought of as the degree to -which each variable contributes to the calculation of the principal component scores. +where $a_{11}...a_{p1}$ represent principal component _loadings_. These loadings can +be thought of as the degree to which each original variable contributes to +the calculation of the principal component scores. The values of $a_{11}...a_{p1}$ are found so that the principal component scores, $Z_1$, explain most of the variation in the dataset. Once we have calculated the principal component scores by finding the loadings, we can use them as new variables. From 375f92bae3196afcfb6bff0704fa76e1d9a14c87 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:26:10 +0000 Subject: [PATCH 17/44] add foreshadowing --- _episodes_rmd/04-principal-component-analysis.Rmd | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index e1d183b1..c498ff23 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -159,9 +159,7 @@ This is explained in more detail on [this Q&A website](https://stats.stackexchan knitr::include_graphics(here("fig/pendulum.gif")) ``` - - -We will come back to principal component scores and loadings further below. +In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. # How do we perform a PCA? From 24101a439c246392833c685082d9e9b37b39bd31 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:27:02 +0000 Subject: [PATCH 18/44] remove "The the" --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c498ff23..60a2a520 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -60,7 +60,7 @@ explain most of the variability in the original dataset. As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect. This is useful from an exploratory point -of view, discovering how variables might be associated and combined. The the +of view, discovering how variables might be associated and combined. The associated principal component could also be used as an effect in further analysis (e.g. linear regression). From 0fad8019abc03b1062545ceb60a6a2fc1233a7df Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:37:57 +0000 Subject: [PATCH 19/44] separate mathematical description I don't think it's necessary if focusing on practical description (not used later as far as I can see) --- .../04-principal-component-analysis.Rmd | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 60a2a520..c65c25b7 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -113,10 +113,10 @@ associated principal component could also be used as an effect in further analys # relate to each other. ``` -PCA transforms a dataset into a new set of uncorrelated variables called "principal components". -The first principal component is derived to explain the largest amount of the variability -in the underlying dataset. The second principal component is derived to explain the second largest amount of variability in the dataset, and so on. +PCA transforms a dataset into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability +in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset, and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. +callout Mathematically, each principal component is a linear combination of the variables in the data set. That is, the first principal component values or _scores_, $Z_1$, are a linear combination of variables in the dataset, $X_1...X_p$, given by @@ -125,13 +125,13 @@ $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ -where $a_{11}...a_{p1}$ represent principal component _loadings_. These loadings can +where $a_{11}...a_{p1}$ represent principal component _loadings_. + +In summary, the principal components values are called _scores_. The loadings can be thought of as the degree to which each original variable contributes to -the calculation of the principal component scores. -The values of $a_{11}...a_{p1}$ are found so that the principal component scores, $Z_1$, -explain most of the variation in the dataset. Once we have calculated the principal component scores by finding the loadings, we can use them as new variables. +the principal component scores. -To see what these new principal component variables may look like, +To see what these new principal component variables (scores) may look like, Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component scores, which pass through the points with the greatest From 7babb16f4749c244221cf1c8f464ac7b43615a5a Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:41:23 +0000 Subject: [PATCH 20/44] change mathematical description to callout --- .../04-principal-component-analysis.Rmd | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c65c25b7..5f3d2ba9 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -116,16 +116,14 @@ associated principal component could also be used as an effect in further analys PCA transforms a dataset into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset, and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. -callout -Mathematically, each principal component is a linear combination of the variables in the data -set. That is, the first principal component values or _scores_, $Z_1$, are a linear combination -of variables in the dataset, $X_1...X_p$, given by - -$$ - Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, -$$ - -where $a_{11}...a_{p1}$ represent principal component _loadings_. +> ## Mathematical description of PCA +> Mathematically, each principal component is a linear combination +> of the variables in the dataset. That is, the first principal +> component values or _scores_, $Z_1$, are a linear combination +> of variables in the dataset, $X_1...X_p$, given by +> $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ +> where $a_{11}...a_{p1}$ represent principal component _loadings_. +{: .callout} In summary, the principal components values are called _scores_. The loadings can be thought of as the degree to which each original variable contributes to From 10907425f411d978c6e6fbf12713e9b1b1bae221 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:44:50 +0000 Subject: [PATCH 21/44] remove comma from description of PCA --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 5f3d2ba9..d16199a7 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -114,7 +114,7 @@ associated principal component could also be used as an effect in further analys ``` PCA transforms a dataset into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability -in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset, and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. +in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. > ## Mathematical description of PCA > Mathematically, each principal component is a linear combination From 29d5c0bc7f4d0b2f3de34f220a13597b8d21b108 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:47:46 +0000 Subject: [PATCH 22/44] add justification for low dimensional dataset --- _episodes_rmd/04-principal-component-analysis.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index d16199a7..c8c8cefa 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -161,9 +161,9 @@ In this episode, we will see how to perform PCA to summarise the information in # How do we perform a PCA? -## A prostate cancer dataset +## Prostate cancer dataset -The `prostate` dataset represents data from 97 +To illustrate how to perform PCA initially, we start with a low dimensional dataset. The `prostate` dataset represents data from 97 men who have prostate cancer. The data come from a study which examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. From c21187dd4d0704464baa2a1418e7ff7393e95b2d Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:49:09 +0000 Subject: [PATCH 23/44] change title to Loadings and principal component scores already explained what they are and this section doesn't really explain what they are --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c8c8cefa..00059b69 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -326,7 +326,7 @@ explain >70% of variance in the data. But what do these two principal components mean? -## What are loadings and principal component scores? +## Loadings and principal component scores Most PCA functions will produce two main output matrices: the *principal component scores* and the *loadings*. The matrix of principal component scores From 0614955d3112778f2f81cd3f00a6eb30548d866b Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:51:00 +0000 Subject: [PATCH 24/44] remove prostate data set title if using this as a first example, task 7 and 8 --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 -- 1 file changed, 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 00059b69..bf127831 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -161,8 +161,6 @@ In this episode, we will see how to perform PCA to summarise the information in # How do we perform a PCA? -## Prostate cancer dataset - To illustrate how to perform PCA initially, we start with a low dimensional dataset. The `prostate` dataset represents data from 97 men who have prostate cancer. The data come from a study which examined the correlation between the level of prostate specific antigen and a number of From 443adb6d3228791d62439350b42922a682c0a83e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:53:13 +0000 Subject: [PATCH 25/44] move information about continuous variables to early text, task 9 --- _episodes_rmd/04-principal-component-analysis.Rmd | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index bf127831..f0daa642 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -113,7 +113,7 @@ associated principal component could also be used as an effect in further analys # relate to each other. ``` -PCA transforms a dataset into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability +PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. > ## Mathematical description of PCA @@ -183,8 +183,7 @@ Here we will calculate principal component scores for each of the rows in this dataset, using five principal components (one for each variable included in the PCA). We will include five clinical variables in our PCA, each of the continuous variables in the prostate dataset, so that we can create fewer variables -representing clinical markers of cancer progression. Standard PCAs are carried -out using continuous variables only. +representing clinical markers of cancer progression. First, we will examine the `prostate` dataset (originally part of the **`lasso2`** package): From fc6771df60b895a76517bef5295eeea95b1dee19 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 13:56:55 +0000 Subject: [PATCH 26/44] simplify example motivation --- _episodes_rmd/04-principal-component-analysis.Rmd | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index f0daa642..26bd4f85 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -179,11 +179,8 @@ Columns include: - `lpsa` (log-tranformed prostate specific antigen; level of PSA in blood). - `age` (patient age in years). -Here we will calculate principal component scores for each of the rows in this -dataset, using five principal components (one for each variable included in the -PCA). We will include five clinical variables in our PCA, each of the continuous -variables in the prostate dataset, so that we can create fewer variables -representing clinical markers of cancer progression. +We will perform PCA on the five continuous clinical variables in our dataset +so that we can create fewer variables representing clinical markers of cancer progression. First, we will examine the `prostate` dataset (originally part of the **`lasso2`** package): From 7e378baf60b5bdb9ddf827818c5c4c2db18afb3d Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:15:22 +0000 Subject: [PATCH 27/44] add reason for standardisation in this section, task 10 --- _episodes_rmd/04-principal-component-analysis.Rmd | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 26bd4f85..bbce63cf 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -161,7 +161,7 @@ In this episode, we will see how to perform PCA to summarise the information in # How do we perform a PCA? -To illustrate how to perform PCA initially, we start with a low dimensional dataset. The `prostate` dataset represents data from 97 +To illustrate how to perform PCA initially, we start with a low-dimensional dataset. The `prostate` dataset represents data from 97 men who have prostate cancer. The data come from a study which examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. @@ -205,7 +205,9 @@ head(pros2) ## Do we need to standardise the data? -Now we compare the variances between variables in the dataset. +Since PCA derives principal components based on the variance they explain in the data, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Standardisation is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to standardise if we want variables with low variance to carry less weight in the PCA. + +For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to standardise our variables first: ```{r var-hist, fig.cap="Caption", fig.cap="Alt"} apply(pros2, 2, var) @@ -216,8 +218,8 @@ hist(pros2$lbph, breaks = "FD") Note that variance is greatest for `lbph` and lowest for `lweight`. It is clear from this output that we need to scale each of these variables before including -them in a PCA analysis to ensure that differences in variances between variables -do not drive the calculation of principal components. In this example we +them in a PCA analysis to ensure that differences in variances +do not drive the calculation of principal components. In this example, we standardise all five variables to have a mean of 0 and a standard deviation of 1. From d02fde36a679aefe525e3e2bf975619015cad5b0 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:17:54 +0000 Subject: [PATCH 28/44] clarify why we have concluded that we need to scale, task 11 --- _episodes_rmd/04-principal-component-analysis.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index bbce63cf..c91787e4 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -205,7 +205,7 @@ head(pros2) ## Do we need to standardise the data? -Since PCA derives principal components based on the variance they explain in the data, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Standardisation is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to standardise if we want variables with low variance to carry less weight in the PCA. +PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Standardisation is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to standardise if we want variables with low variance to carry less weight in the PCA. For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to standardise our variables first: @@ -216,9 +216,9 @@ hist(pros2$lweight, breaks = "FD") hist(pros2$lbph, breaks = "FD") ``` -Note that variance is greatest for `lbph` and lowest for `lweight`. It is clear -from this output that we need to scale each of these variables before including -them in a PCA analysis to ensure that differences in variances +Note that variance is greatest for `lbph` and lowest for `lweight`. Since we +want each of the variables to be treated equally in our PCA, but there are large differences in the variances of the variables, we need to scale each of the variables before including +them in a PCA to ensure that differences in variances do not drive the calculation of principal components. In this example, we standardise all five variables to have a mean of 0 and a standard deviation of 1. From 2ff628c371717d0be60f210b11d77556f490db75 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:20:38 +0000 Subject: [PATCH 29/44] standardise -> scale for consistency --- _episodes_rmd/04-principal-component-analysis.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c91787e4..b6215dd1 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -203,11 +203,11 @@ pros2 <- prostate[, c("lcavol", "lweight", "lbph", "lcp", "lpsa")] head(pros2) ``` -## Do we need to standardise the data? +## Do we need to scale the data? -PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Standardisation is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to standardise if we want variables with low variance to carry less weight in the PCA. +PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Scaling is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to scale if we want variables with low variance to carry less weight in the PCA. -For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to standardise our variables first: +For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to scale our variables first: ```{r var-hist, fig.cap="Caption", fig.cap="Alt"} apply(pros2, 2, var) From 5ba37e784272bfde9ba7db94441f515ae0de6cc4 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:23:01 +0000 Subject: [PATCH 30/44] swap back mathematical description --- .../04-principal-component-analysis.Rmd | 32 +++++++++---------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index b6215dd1..c1ea5855 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -116,25 +116,12 @@ associated principal component could also be used as an effect in further analys PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. -> ## Mathematical description of PCA -> Mathematically, each principal component is a linear combination -> of the variables in the dataset. That is, the first principal -> component values or _scores_, $Z_1$, are a linear combination -> of variables in the dataset, $X_1...X_p$, given by -> $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ -> where $a_{11}...a_{p1}$ represent principal component _loadings_. -{: .callout} - -In summary, the principal components values are called _scores_. The loadings can -be thought of as the degree to which each original variable contributes to -the principal component scores. - -To see what these new principal component variables (scores) may look like, +To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left fallow for 50 farms in southern England. The red line represents the first principal component scores, which pass through the points with the greatest -variability. The points along this line give the first principal component scores. -The second principal component scores explain the next highest amount of variability +variability. The points along this line give the first principal component. +The second principal component explains the next highest amount of variability in the data and are represented by the line perpendicular to the first (green line). @@ -157,7 +144,18 @@ This is explained in more detail on [this Q&A website](https://stats.stackexchan knitr::include_graphics(here("fig/pendulum.gif")) ``` -In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. +> ## Mathematical description of PCA +> Mathematically, each principal component is a linear combination +> of the variables in the dataset. That is, the first principal +> component values or _scores_, $Z_1$, are a linear combination +> of variables in the dataset, $X_1...X_p$, given by +> $$ Z_1 = a_{11}X_1 + a_{21}X_2 +....+a_{p1}X_p, $$ +> where $a_{11}...a_{p1}$ represent principal component _loadings_. +{: .callout} + +In summary, the principal components values are called _scores_. The loadings can +be thought of as the degree to which each original variable contributes to +the principal component scores. In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. # How do we perform a PCA? From 4ac16c07b32280009c14331fb097177292d9fd5e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:26:46 +0000 Subject: [PATCH 31/44] explain center=TRUE, task 14 scale=TRUE doesn't change the mean I think? --- _episodes_rmd/04-principal-component-analysis.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c1ea5855..393266ae 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -205,7 +205,7 @@ head(pros2) PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Scaling is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to scale if we want variables with low variance to carry less weight in the PCA. -For this dataset, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to scale our variables first: +In this example, we want each variable to be treated equally by the PCA since variables with lower values may be just as informative as variables with higher values. Let's therefore investigate the variables in our dataset to see if we need to scale our variables first: ```{r var-hist, fig.cap="Caption", fig.cap="Alt"} apply(pros2, 2, var) @@ -257,8 +257,8 @@ deviation of 1. {: .challenge} Next we will carry out a PCA using the `prcomp()` function in base R. The input -data (`pros2`) is in the form of a matrix. Note that the `scale = TRUE` argument -is used to standardise the variables to have a mean 0 and standard deviation of +data (`pros2`) is in the form of a matrix. Note that the `center = TRUE` and `scale = TRUE` arguments +are used to standardise the variables to have a mean 0 and standard deviation of 1. ```{r prcomp} From e6c0f76114ff7af2f741036dfd95376339ca608d Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:27:57 +0000 Subject: [PATCH 32/44] task 15 --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 393266ae..de826127 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -272,7 +272,7 @@ We have calculated one principal component for each variable in the original dataset. How do we choose how many of these are necessary to represent the true variation in the data, without having extra components that are unnecessary? -Let's look at the relative importance of each component using `summary`. +Let's look at the relative importance of (variance explained by) each component using `summary`. ```{r summ} summary(pca.pros) From 37071bcbb48af7d35392241f037e63d05f15930c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Mon, 4 Mar 2024 14:39:01 +0000 Subject: [PATCH 33/44] rewording to avoid repeating "also called", task 16 --- _episodes_rmd/04-principal-component-analysis.Rmd | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index de826127..88d89fa4 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -290,10 +290,7 @@ This returns the proportion of variance in the data explained by each of the PC3 a further `r prop.var[[3]]`%, PC4 approximately `r prop.var[[4]]`% and PC5 around `r prop.var[[5]]`%. -Let us visualise this. A plot of the amount of variance accounted for by each PC -is also called a scree plot. Note that the amount of variance accounted for by a principal -component is also called eigenvalue and thus the y-axis in scree plots if often -labelled “eigenvalue”. +Let us visualise this. A plot of the amount of variance accounted for by each PC is called a scree plot. Note that the amount of variance accounted for by a principal component is given by "eigenvalues". Thus, the y-axis in scree plots if often labelled "eigenvalue". Often, scree plots show a characteristic pattern where initially, the variance drops rapidly with each additional principal component. But then there is an “elbow” after which the From d31570507c0f4aeb4d4be7c60ccbfbe7534a082e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 09:47:22 +0000 Subject: [PATCH 34/44] associated to resulting Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 88d89fa4..48e7b7ba 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -61,7 +61,7 @@ PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect. This is useful from an exploratory point of view, discovering how variables might be associated and combined. The -associated principal component could also be used as an effect in further analysis +resulting principal component could also be used as an effect in further analysis (e.g. linear regression). From fdc03498006ba5c29f634f572c0b7cf95bf7d340 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 09:55:47 +0000 Subject: [PATCH 35/44] plural to singular Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 48e7b7ba..18d79e49 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -122,7 +122,7 @@ fallow for 50 farms in southern England. The red line represents the first principal component scores, which pass through the points with the greatest variability. The points along this line give the first principal component. The second principal component explains the next highest amount of variability -in the data and are represented by the line perpendicular to the first (green line). +in the data and is represented by the line perpendicular to the first (green line). ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} From 99c64395f4e9f1065fbb537af72a249a4d6c9d3a Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 09:56:09 +0000 Subject: [PATCH 36/44] remove iteratively Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 18d79e49..32369417 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -130,7 +130,7 @@ in the data and is represented by the line perpendicular to the first (green lin knitr::include_graphics(here("fig/bio_index_vs_percentage_fallow.png")) ``` -The animation below illustrates how principal components are calculated iteratively from +The animation below illustrates how principal components are calculated from data. You can imagine that the black line is a rod and each red dashed line is a spring. The energy of each spring is proportional to its squared length. The direction of the first principal component is the one that minimises the total From 3d3ed3a6368c123fcc89af9ff339647750da9157 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 09:57:26 +0000 Subject: [PATCH 37/44] change to possessive Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 32369417..db2401d9 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -153,7 +153,7 @@ knitr::include_graphics(here("fig/pendulum.gif")) > where $a_{11}...a_{p1}$ represent principal component _loadings_. {: .callout} -In summary, the principal components values are called _scores_. The loadings can +In summary, the principal components' values are called _scores_. The loadings can be thought of as the degree to which each original variable contributes to the principal component scores. In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. From e164deb39caaa421fc2d924953ea030dd2e6dd47 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 11:05:19 +0000 Subject: [PATCH 38/44] edit biodiversity explanation --- _episodes_rmd/04-principal-component-analysis.Rmd | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index db2401d9..8bbc9a9d 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -118,11 +118,13 @@ in the underlying dataset. The second principal component derived explains the s To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left -fallow for 50 farms in southern England. The red line represents the first -principal component scores, which pass through the points with the greatest -variability. The points along this line give the first principal component. -The second principal component explains the next highest amount of variability -in the data and is represented by the line perpendicular to the first (green line). +fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points called _scores_. +The red line on the plot represents the line passing through the scores (points) of the first principal component. +The angle that the first principal component line passes through the data points at is set to the direction with the highest +variability. The plotted first principal components can therefore be thought of reflecting the +effect in the data that has the highest variability. The second principal component explains the next highest amount of variability +in the data and is represented by the line perpendicular to the first (the green line). The second principal component can be thought of as +capturing the overall effect in the data that has the second-highest variability. ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} From 6f3a5a4e845cdf3535e4debbdbd64384a189eb94 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 15 Mar 2024 11:07:14 +0000 Subject: [PATCH 39/44] remove echo FALSE box with only code comments --- _episodes_rmd/04-principal-component-analysis.Rmd | 6 ------ 1 file changed, 6 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 8bbc9a9d..15f07078 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -107,12 +107,6 @@ resulting principal component could also be used as an effect in further analysi # Principal component analysis -```{r, eval=FALSE, echo=FALSE} -# A PCA is carried out by calculating a matrix of Pearson's correlations from -# the original dataset which shows how each of the variables in the dataset -# relate to each other. -``` - PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. From c43d9b8228e1271e5eefd4082a06361f493cbac6 Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Mon, 18 Mar 2024 20:30:34 +0000 Subject: [PATCH 40/44] Update _episodes_rmd/04-principal-component-analysis.Rmd --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 15f07078..dbf7e790 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -123,7 +123,7 @@ capturing the overall effect in the data that has the second-highest variability ```{r fig1, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} # ![Figure 1: Biodiversity index and percentage area fallow PCA](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Bio index vs percentage fallow.png) -knitr::include_graphics(here("fig/bio_index_vs_percentage_fallow.png")) +knitr::include_graphics("../fig/bio_index_vs_percentage_fallow.png") ``` The animation below illustrates how principal components are calculated from From 351a52ee6fe77fcd68272b3072b5a6e2a197613b Mon Sep 17 00:00:00 2001 From: Alan O'Callaghan Date: Mon, 18 Mar 2024 20:32:44 +0000 Subject: [PATCH 41/44] Update _episodes_rmd/04-principal-component-analysis.Rmd --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index dbf7e790..c68168f5 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -137,7 +137,7 @@ principal component. This is explained in more detail on [this Q&A website](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues). ```{r pendulum, echo=FALSE, fig.cap="Cap", fig.alt="Alt"} -knitr::include_graphics(here("fig/pendulum.gif")) +knitr::include_graphics("../fig/pendulum.gif") ``` > ## Mathematical description of PCA From b59028abf61cf38c698e3f0eb8da242fb91a9cef Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 21 Mar 2024 15:07:24 +0000 Subject: [PATCH 42/44] add that individual level scores --- _episodes_rmd/04-principal-component-analysis.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index c68168f5..ac6aae4f 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -112,8 +112,8 @@ in the underlying dataset. The second principal component derived explains the s To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left -fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points called _scores_. -The red line on the plot represents the line passing through the scores (points) of the first principal component. +fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points for each individual observation called _scores_. +The red line on the plot represents the line passing through the scores (points) of the first principal component for each observation. The angle that the first principal component line passes through the data points at is set to the direction with the highest variability. The plotted first principal components can therefore be thought of reflecting the effect in the data that has the highest variability. The second principal component explains the next highest amount of variability From 8aea374c1d404f84849479a652a8a2e07d767a9f Mon Sep 17 00:00:00 2001 From: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> Date: Fri, 22 Mar 2024 09:53:18 +0000 Subject: [PATCH 43/44] clarify pcs to keep Co-authored-by: Mary Llewellyn --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index ac6aae4f..8c80c811 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -108,7 +108,7 @@ resulting principal component could also be used as an effect in further analysi # Principal component analysis PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability -in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed to principal components, we can extract a subset of the principal components as new variables that sufficiently explain the variability in the dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. +in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed into principal components, we can extract a subset of the principal components in order of the variance they explain (starting with the first principal component that by definition explains the most variability, and then the second), giving new variables that explain a lot of the variability in the original dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left From 69e2f1d4e744ea5597929a796cbe05457fc116cf Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Fri, 22 Mar 2024 10:03:08 +0000 Subject: [PATCH 44/44] clarify individual scores relationship Co-authored-by: Ailith Ewing <54178580+ailithewing@users.noreply.github.com> --- _episodes_rmd/04-principal-component-analysis.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index 8c80c811..6cc67786 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -112,7 +112,7 @@ in the underlying dataset. The second principal component derived explains the s To see what these new principal components may look like, Figure 1 shows biodiversity index versus percentage area left -fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points for each individual observation called _scores_. +fallow for 50 farms in southern England. Principal components are a collection of new, artificial data points, one for each individual observation called _scores_. The red line on the plot represents the line passing through the scores (points) of the first principal component for each observation. The angle that the first principal component line passes through the data points at is set to the direction with the highest variability. The plotted first principal components can therefore be thought of reflecting the