changes to episode 5, all tasks (#146)

* add required library(here) before first visible code I think this is required outside of the include=FALSE code as required in visible code. * add comment about FA and interpretability, tasks 1, 3, 5 to differentiate from PCA also added latent variable definition and corrected "a-priori" * brackets around first latent ref * task 2 * remove data-driven, task 4 both CFA and EFA are data driven. * change title of student scores section, task 6 for consistency with other episodes * move advantages and disadvantages to the end, task 7 hard to understand without fully understanding FA * add reasoning for low-dim data set again, task 7 same as changes to other episodes * remove initially we don't use a high-dimensional dataset after * complete task 8 * elaborate that both use linear combinations and reason for interpretability do you agree? * add how many factors title for consistency with pca episode, task 10 * in practise to in practice, task 11 * previous commit task 12 * minor rewording to how to find the number of factors, task 13 * add "then", task 14 * rewording around hypothesis test, task 15 mostly careful wording since cannot accept a null hypothesis and removing hard threshold for the significance level * typo fix, task 16 * add consider for individual learners, task 18 * fig caption and alt text for first table, part task 19 * remove View(prostate) already have head() * alt text and captions for biplot, complete task 19 * emphasise a priori Co-authored-by: Alan O'Callaghan <[email protected]> --------- Co-authored-by: Alan O'Callaghan <[email protected]>
carpentries-incubator · Mar 6, 2024 · ca0d385 · ca0d385
1 parent f84ad67
commit ca0d385
Showing 1 changed file with 59 additions and 47 deletions.
diff --git a/_episodes_rmd/05-factor-analysis.Rmd b/_episodes_rmd/05-factor-analysis.Rmd
@@ -31,16 +31,22 @@ knitr_fig_path("06-")
 
 Biologists often encounter high-dimensional datasets from which they wish
 to extract underlying features – they need to carry out dimensionality
-reduction. The last episode dealt with one method to achieve this this,
-called principal component analysis (PCA). Here, we introduce more general
-set of methods called factor analysis (FA).
-
+reduction. The last episode dealt with one method to achieve this,
+called principal component analysis (PCA), which expressed new dimension-reduced components 
+as linear combinations of the original features in the dataset. Principal components can therefore 
+be difficult to interpret. Here, we introduce a related but more interpretable
+method called factor analysis (FA), which constructs new components, called _factors_, 
+that explicitly represent underlying _(latent)_ constructs in our data. Like PCA, FA uses 
+linear combinations, but uses latent constructs instead. FA is therefore often 
+more interpretable and useful when we would like to extract meaning from our dimension-reduced
+set of variables.
+
 There are two types of FA, called exploratory and confirmatory factor analysis
 (EFA and CFA). Both EFA and CFA aim to reproduce the observed relationships
 among a group of features with a smaller set of latent variables. EFA
-is used in a descriptive, data-driven manner to uncover which
+is used in a descriptive (exploratory) manner to uncover which
 measured variables are reasonable indicators of the various latent dimensions.
-In contrast, CFA is conducted in an a-priori,
+In contrast, CFA is conducted in an _a priori_,
 hypothesis-testing manner that requires strong empirical or theoretical foundations.
 We will mainly focus on EFA here, which is used to group features into a specified
 number of latent factors.
@@ -51,7 +57,7 @@ exploratory data analysis methods (including PCA) to provide an initial estimate
 of how many factors adequately explain the variation observed in a dataset.
 In practice, a range of different values is usually tested.
 
-## An example
+## Motivating example: student scores 
 
 One scenario for using FA would be whether student scores in different subjects
 can be summarised by certain subject categories. Take a look at the hypothetical
@@ -60,7 +66,7 @@ can be summarised well by two factors, which we can then interpret. We have
 labelled these hypothetical factors “mathematical ability” and “writing ability”.
 
 
-```{r table, echo = FALSE}
+```{r table, echo = FALSE, fig.cap="Student scores data across several subjects with hypothesised factors.", fig.alt="A table displaying data of student scores across several subjects. Each row displays the scores across different subjects for a given individual. The plot is annotated at the top with a curly bracket labelled Factor 1: mathematical ability and encompasses the data for the student scores is Arithmetic, Algebra, Geometry, and Statistics. Similarly, the subjects Creative Writing, Literature, Spelling/Grammar are encompassed by a different curly bracket with label Factor 2: writing ability."}
 knitr::include_graphics("../fig/table_for_fa.png")
 # ![Figure 1: Student exam scores per subject. Subjects can be split into two factors representing mathematical ability and writing ability](D:/Statistical consultancy/Consultancy/Grant applications/UKRI teaching grant 2021/Working materials/Table for FA.png)
 ```
@@ -71,38 +77,16 @@ as many principal components as there are features in the dataset, each
 component representing a different linear combination of features. The principal
 components are ordered by the amount of variance they account for.
 
-# Advantages and disadvantages of Factor Analysis
-
-There are several advantages and disadvantages of using FA as a
-dimensionality reduction method.
-
-Advantages:
-
-* FA is a useful way of combining different groups of data into known
-  representative factors, thus reducing dimensionality in a dataset.
-* FA can take into account researchers' expert knowledge when choosing
-  the number of factors to use, and can be used to identify latent or hidden
-  variables which may not be apparent from using other analysis methods.
-* It is easy to implement with many software tools available to carry out FA.
-* Confirmatory FA can be used to test hypotheses.
-
-Disadvantages:
-
-* Justifying the choice of
-  number of factors to use may be difficult if little is known about the
-  structure of the data before analysis is carried out.
-* Sometimes, it can be difficult to interpret what factors mean after
-  analysis has been completed. 
-* Like PCA, standard methods of carrying out FA assume that input variables
-  are continuous, although extensions to FA allow ordinal and binary
-  variables to be included (after transforming the input matrix).
 
 # Prostate cancer patient data
 
 The prostate dataset represents data from 97 men who have prostate cancer.
 The data come from a study which examined the correlation between the level
 of prostate specific antigen and a number of clinical measures in men who were
 about to receive a radical prostatectomy. The data have 97 rows and 9 columns.
+Although not strictly a high-dimensional dataset, as with other episodes,
+we use this dataset to explore the method.
+
 
 Columns are:
 
@@ -129,12 +113,10 @@ Let's subset the data to just include the log-transformed clinical variables
 for the purposes of this episode:
 
 ```{r prostate}
+library("here")
 prostate <- readRDS(here("data/prostate.rds"))
 ```
 
-```{r view, eval=FALSE}
-View(prostate)
-```
 
 ```{r dims}
 nrow(prostate)
@@ -193,18 +175,23 @@ factors, while negative values show a negative relationship between variables
 and factors. Loading values are missing for some variables because R does not
 print loadings less than 0.1. 
 
+# How many factors do we need?
+
 There are numerous ways to select the “best” number of factors. One is to use
 the minimum number of features that does not leave a significant amount of
-variance unaccounted for. In practise, we repeat the factor
-analysis using different values in the `factors` argument. If we have an
-idea of how many factors there will be before analysis, we can start with
-that number. The final section of the analysis output shows the results of
+variance unaccounted for. In practice, we repeat the factor
+analysis for different numbers of factors (by specifying different values 
+in the `factors` argument). If we have an idea of how many factors there 
+will be before analysis, we can start with that number. The final 
+section of the analysis output then shows the results of
 a hypothesis test in which the null hypothesis is that the number of factors
 used in the model is sufficient to capture most of the variation in the
-dataset. If the p-value is less than 0.05, we reject the null hypothesis
-and accept that the number of factors included is too small. If the p-value
-is greater than 0.05, we accept the null hypothesis that the number of
-factors used captures variation in the data.
+dataset. If the p-value is less than our significance level (for example 0.05),
+we reject the null hypothesis that the number of factors is sufficient and we repeat the analysis with 
+more factors. When the p-value is greater than our significance level, we do not reject 
+the null hypothesis that the number of factors used captures variation
+in the data. We may therefore conclude that 
+this number of factors is sufficient. 
 
 Like PCA, the fewer factors that can explain most of the variation in the
 dataset, the better. It is easier to explore and interpret results using a
@@ -219,7 +206,7 @@ for by the FA model.
 *Uniqueness* is the opposite of communality and represents the amount of
 variation in a variable that is not accounted for by the FA model. Uniqueness is
 calculated by subtracting the communality value from 1. If uniqueness is high for
-a given variable, that means this variable is not well explaind/accounted for
+a given variable, that means this variable is not well explained/accounted for
 by the factors identified.
 
 ```{r common-unique}
@@ -232,7 +219,7 @@ Similar to a biplot as we produced in the PCA episode, we can “plot the
 loadings”. This shows how each original variable contributes to each of
 the factors we chose to visualise.
 
-```{r biplot}
+```{r biplot, fig.cap = "Factor 2 loadings versus factor 1 loadings for each feature.", fig.alt="A scatter plot of the factor 2 loadings for each feature versus the factor 2 loadings for each feature. The lpsa, lcavol and lcp feature points are located in the east of the plot, indicating a high loading on factor 1 and close to zero loading on factor 2. The lbph and lweight features are located in the north of the plot, indicating a close to zero loading on factor 1 and a high loading on factor 2."}
 #First, carry out factor analysis using two factors
 pros_fa <- factanal(pros2, factors = 2)
 
@@ -264,7 +251,7 @@ text(
 > the results of your analysis.
 > 
 > What variables are most important in explaining each factor? Do you think
-> this makes sense biologically? Discuss in groups.
+> this makes sense biologically? Consider or discuss in groups.
 > 
 > > ## Solution
 > > 
@@ -282,6 +269,31 @@ text(
 > {: .solution}
 {: .challenge}
 
+# Advantages and disadvantages of Factor Analysis
+
+There are several advantages and disadvantages of using FA as a
+dimensionality reduction method.
+
+Advantages:
+
+* FA is a useful way of combining different groups of data into known
+  representative factors, thus reducing dimensionality in a dataset.
+* FA can take into account researchers' expert knowledge when choosing
+  the number of factors to use, and can be used to identify latent or hidden
+  variables which may not be apparent from using other analysis methods.
+* It is easy to implement with many software tools available to carry out FA.
+* Confirmatory FA can be used to test hypotheses.
+
+Disadvantages:
+
+* Justifying the choice of
+  number of factors to use may be difficult if little is known about the
+  structure of the data before analysis is carried out.
+* Sometimes, it can be difficult to interpret what factors mean after
+  analysis has been completed. 
+* Like PCA, standard methods of carrying out FA assume that input variables
+  are continuous, although extensions to FA allow ordinal and binary
+  variables to be included (after transforming the input matrix).