Merge pull request #144 from mallewellyn/mary-suggestions-tasks28plus…

…-ep4 changes to episode 4, tasks 28-29
carpentries-incubator · Mar 25, 2024 · e57d5d4 · e57d5d4
2 parents c6e2ed0 + 30b2846
commit e57d5d4
Showing 1 changed file with 96 additions and 89 deletions.
diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd
@@ -253,14 +253,44 @@ deviation of 1.
 > {: .solution}
 {: .challenge}
 
-Next we will carry out a PCA using the `prcomp()` function in base R. The input
-data (`pros2`) is in the form of a matrix. Note that the `center = TRUE` and `scale = TRUE` arguments
-are used to standardise the variables to have a mean 0 and standard deviation of
-1.
+## Performing PCA
+
+Throughout this episode, we will carry out PCA using the Bioconductor package **`PCAtools`**.
+This package provides several functions that are useful for exploring data via PCA and
+producing useful figures and analysis tools. The package is made for the somewhat unusual
+Bioconductor style of data tables (observations in columns, features in rows). When
+using Bioconductor data sets and **`PCAtools`**, it is thus not necessary to transpose the data.
+You can use the help files in PCAtools to find out about the `pca()`
+function (type `help("pca")` or `?pca` in R). Note that, although we focus on PCA implementation using **`PCAtools`**, there are several other
+options for performing PCA, including a base R function, `prcomp()`. Equivalent functions and 
+output labels for the common implementations are summarised below.
+
+| library::command()| PC scores | Loadings |
+|-------------------|-----------|----------|
+| stats::prcomp()   | $x         | $rotation |
+| stats::princomp() | $scores    | $loadings | 
+| PCAtools::pca()   | $rotated   | $loadings | 
+
+
+Let's use the **`PCAtools`** package to perform PCA on the scaled `prostate` data. First, let's load the **`PCAtools`** package.
+
+```{r pcatools}
+library("PCAtools")
+```
+
+
+The input data (`pros2`) is in the form of a matrix and we need to take the transpose of this 
+matrix and convert it to a data frame for use within the **`PCAtools`** package. We set the `scale = TRUE` argument to standardise the variables to have a mean 0 and standard deviation of 1 as above. 
+
 
 ```{r prcomp}
-pca.pros <- prcomp(pros2, scale = TRUE, center = TRUE)
-pca.pros
+pros2t <- data.frame(t(pros2))  #create transposed data frame
+colnames(pros2t)<-seq(1:dim(pros2t)[2]) #rename columns for use within PCAtools
+pmetadata= data.frame("M"=rep(1, dim(pros2t)[2]), row.names = colnames(pros2t)) #create fake metadata for use with PCA tools
+
+##implement PCA
+pca.pros <- pca(pros2t, scale = TRUE, center = TRUE,metadata = pmetadata)
+pca.pros$loadings #return pca loadings 
 ```
 
 # How many principal components do we need?
@@ -269,42 +299,34 @@ We have calculated one principal component for each variable in the original
 dataset. How do we choose how many of these are necessary to represent the true
 variation in the data, without having extra components that are unnecessary?
 
-Let's look at the relative importance of (variance explained by) each component using `summary`.
+Let's look at the relative importance of (percentage of the variance explained by) the components:
 
-```{r summ}
-summary(pca.pros)
-```
 
-```{r, echo = FALSE}
-# Get proportions of variance explained by each principal component (rounded to 2 decimal places)
-prop.var <- round(summary(pca.pros)$importance["Proportion of Variance", ], 2) *
-  100
+```{r summ}
+prop.var <- round(pca.pros$variance, 4) #round to 4 decimal places
+prop.var
 ```
 
 This returns the proportion of variance in the data explained by each of the
-(p = 5) principal components. In this example, PC1 explains approximately
-`r prop.var[[1]]`% of variance in the data, PC2 `r prop.var[[2]]`% of variance,
-PC3 a further `r prop.var[[3]]`%, PC4 approximately `r prop.var[[4]]`% and PC5
-around `r prop.var[[5]]`%.
-
-Let us visualise this. A plot of the amount of variance accounted for by each PC is called a scree plot. Note that the amount of variance accounted for by a principal component is given by "eigenvalues". Thus, the y-axis in scree plots if often labelled "eigenvalue".
-
-Often, scree plots show a characteristic pattern where initially, the variance drops
-rapidly with each additional principal component. But then there is an “elbow” after which the
+principal components. In this example, the first principal component, PC1, explains approximately
+`r round(prop.var[[1]])`% of variance in the data, PC2 `r round(prop.var[[2]])`% of variance,
+PC3 a further `r round(prop.var[[3]])`%, PC4 approximately `r round(prop.var[[4]])`% and PC5
+around `r round(prop.var[[5]])`%.
+
+Let's visualise this. A plot of the amount of variance accounted for by each PC
+is also called a scree plot. Often, scree plots show a characteristic pattern 
+where initially, the variance drops rapidly with each additional principal component. But then there is an “elbow” after which the
 variance decreases more slowly. The total variance explained up to the elbow point is sometimes
 interpreted as structural variance that is relevant and should be retained versus noise
 which may be discarded after the elbow. 
 
+We can create a scree plot using the **`PCAtools`** function `screeplot()`:
 
-```{r varexp}
-# calculate percentage variance explained using output from the PCA
-varExp <- (pca.pros$sdev^2) / sum(pca.pros$sdev^2) * 100
-# create new dataframe with five rows, one for each principal component
-varDF <- data.frame(Dimensions = 1:length(varExp), varExp = varExp)
-```
-
-```{r vardf-plot, fig.cap="Scree plot showing the percentage of the variance explained by the principal components calculated from the prostate data.", fig.alt="A scree plot showing the percentage of variance explained by each principal component versus the principal component number. The points are joined by lines to indicate where the elbow of the scree plot occurs."}
-plot(varDF, type="b")
+```{r varexp, fig.cap="Scree plot showing the percentage of the variance explained by the principal components calculated from the prostate data.", fig.alt="A scree plot showing the percentage of variance explained by each principal component versus the principal component number. The points are joined by lines to indicate where the elbow of the scree plot occurs."}
+screeplot(pca.pros, axisLabSize = 5, titleLabSize = 8, 
+          drawCumulativeSumLine = FALSE, drawCumulativeSumPoints = FALSE) +
+    geom_line(aes(x = 1:length(pca.pros$components), y = as.numeric(pca.pros$variance))) 
+#add line to scree plot to visualise the elbow
 ```
 
 The scree plot shows that the first principal component explains most of the
@@ -329,7 +351,6 @@ based on what is deemed a sufficient level of information retention for a specif
 dataset and question.
 
 
-
 ## Loadings and principal component scores
 
 Most PCA functions will produce two main output matrices: the
@@ -340,12 +361,13 @@ The matrix of loadings (also called rotation matrix) has as many rows as there
 are features in the original data. It contains information about how the
 (usually centered and scaled) original data relate to the principal component scores.
 
-When calling a PCA object generated with `prcomp()`, the loadings are printed by default:
+We can print the loadings for the **`PCAtools`** implementation using 
 
 ```{r pca-pros}
-pca.pros
+pca.pros$loadings
 ```
 
+
 The principal component scores are obtained by carrying out matrix multiplication of the
 (usually centered and scaled) original data times the loadings. The following
 callout demonstrates this.
@@ -364,7 +386,7 @@ callout demonstrates this.
 > my.pros2.pcs <- pros2.scaled %*% pros2.eigen$vectors
 > # compare results
 > par(mfrow=c(1,2))
-> plot(pca.pros$x[,1:2], main="prcomp()")
+> plot(pca.pros$rotated[,1:2], main="PCAtools")
 > abline(h=0, v=0, lty=2)
 > plot(my.pros2.pcs[,1:2], main="\"By hand\"", xlab="PC1", ylab="PC2")
 > abline(h=0, v=0, lty=2)
@@ -380,13 +402,16 @@ is by creating a biplot. Biplots usually show two principal components plotted
 against each other. Observations are sometimes labelled with numbers. The
 contribution of each original variable to the principal components displayed
 is then shown by arrows (generated from those two columns of the rotation matrix that
-correspond to the principal components shown). NB, there are several biplot
+correspond to the principal components shown). See `help("PCAtools::biplot")` for
+arguments and their meaning. For instance, `lab` or `colBy` may be useful.
+Note that there are several biplot
 implementations in different R libraries. It is thus a good idea to specify
 the desired package when calling `biplot()`. A biplot of the first two principal
 components can be generated as follows:
 
-```{r stats-biplot, fig.cap="Scatter plot of the first two principal components with observations numerically labelled. The loadings are shown by red arrows.", fig.alt="Scatter plot of the second principal component versus the first principal component. Observations as points, numerically labelled. The loadings are shown by red arrows, each labelled by their associated variable."}
-stats::biplot(pca.pros, xlim = c(-0.3, 0.3))
+
+```{r stats-biplot, fig.cap="Scatter plot of the first two principal components with observations numerically labelled.", fig.alt="Scatter plot of the second principal component versus the first principal component. Observations as points, numerically labelled."}
+PCAtools::biplot(pca.pros)
 ```
 
 This biplot shows the position of each patient on a 2-dimensional plot where
@@ -401,16 +426,6 @@ on the top and right of the plot are used to interpret the loadings, where
 loadings are scaled by the standard deviation of the principal components
 (`pca.pros$sdev`) times the square root the number of observations.
 
-Finally, you need to know that principal component scores and rotations may have different slot names, 
-depending on the PCA implementation you use. Here are some examples:
-
-| library::command()| principal component scores | Loadings |
-|-------------------|-----------|----------|
-| stats::prcomp()   | $x         | $rotation |
-| stats::princomp() | $scores    | $loadings | 
-| PCAtools::pca()   | $rotated   | $loadings | 
-
-
 
 # Advantages and disadvantages of PCA
 
@@ -434,15 +449,8 @@ Disadvantages:
   especially when including them in further analyses (e.g. inclusion in a linear
   regression).
 
-
 # Using PCA to analyse gene expression data 
 
-In this section you will carry out your own PCA using the Bioconductor package **`PCAtools`** 
-applied to gene expression data to explore the topics covered above. 
-**`PCAtools`** provides functions that can be used to explore data via PCA and
-produce useful figures and analysis tools. The package is made for the somewhat unusual
-Bioconductor style of data tables (observations in columns, features in rows). When
-using Bioconductor data sets and **`PCAtools`**, it is thus not necessary to transpose the data.
 
 ##  A gene expression dataset of cancer patients
 
@@ -453,11 +461,7 @@ The dataset we will be analysing in this lesson includes two subsets of data:
 * metadata associated with the gene expression results detailing information
   from patients from whom samples were taken.
 
-Let's load the **`PCAtools`** package and the data.
-
-```{r pcatools}
-library("PCAtools")
-```
+In this section, we will work through Challenges and you will apply your own PCA to the high-dimensional gene expression data using what we have learnt so far. 
 
 We will first load the microarray breast cancer gene expression data and
 associated metadata, downloaded from the
@@ -516,7 +520,7 @@ representing groups of genes would help visualise the data and address
 research questions regarding the effect different groups of genes have on
 disease progression.
 
-Using the **`PCAtools`** we will apply a PCA to the cancer
+Using **`PCAtools`**, we will perform PCA on the cancer
 gene expression data, plot the amount of variation in the data explained by
 each principal component and plot the most important principal components
 against each other as well as understanding what each principal component
@@ -526,8 +530,7 @@ represents.
 > ## Challenge 3
 > 
 > Apply a PCA to the cancer gene expression data using the `pca()` function from
-> **`PCAtools`**. You can use the help files in PCAtools to find out about the `pca()`
-> function (type `help("pca")` or `?pca` in R).
+> **`PCAtools`**. 
 > 
 > Let us assume we only care about the principal components accounting for the top
 > 80% of the variance in the dataset. Use the `removeVar` argument in `pca()` to remove
@@ -541,7 +544,7 @@ represents.
 > > ```{r pca-ex}
 > > pc <- pca(mat, metadata = metadata) # implement a PCA
 > > #We can check the scree plot to see that many principal components explain a very small amount of
-> > the total variance in the data
+> > #the total variance in the data
 > > 
 > > #Let's remove the principal components with lowest 20% of the variance
 > > pc <- pca(mat, metadata = metadata, removeVar = 0.2)
@@ -550,12 +553,14 @@ represents.
 > > pc$loadings[1:5, 1:5] #obtain the first 5 principal component loadings for the first 5 features
 > > 
 > > which.max(pc$loadings[, 1]) # index of the maximal loading for the first principal component
-> > pc$loadings[which.max(pc$loadings[, 1]), ] # principal component loadings for the feature
-> > with the maximal loading on the first principal component.
+> > # principal component loadings for the feature
+> > #with the maximal loading on the first principal component:
+> > pc$loadings[which.max(pc$loadings[, 1]), ] 
 > > 
 > > which.max(pc$loadings[, 2]) # index of the maximal loading for the second principal component
-> > pc$loadings[which.max(pc$loadings[, 2]), ] # principal component loadings for the feature
-> > with the maximal loading on the second principal component.
+> > # principal component loadings for the feature
+> > #with the maximal loading on the second principal component:
+> > pc$loadings[which.max(pc$loadings[, 2]), ] 
 > > ```
 > > The function `pca()` is used to perform PCA, and uses as input a matrix
 > > (`mat`) containing continuous numerical data
@@ -602,35 +607,37 @@ amount of the variation. The proportion of variance explained should sum to one.
 > 
 > Using the `screeplot()` function in **`PCAtools`**, create a scree plot to show 
 > proportion of variance explained by each principal component. Explain the
-> output of the scree plot in terms of the proportion of variance in the data explained
-> by each principal component and suggest an appropriate number of principal
-> components.
+> output of the screeplot in terms of proportion of the variance in the data explained
+> by each principal component.
 > 
 > > ## Solution
 > > 
-> > ```{r scree-ex}
-> > screeplot(pc, axisLabSize = 5, titleLabSize = 8)
-> > ```
-> > Note that first principal component (PC1) explains more variation than
-> > other principal components (which is always the case in PCA). The scree plot
-> > shows that the first principal component only explains ~33% of the total
-> > variation in the microarray data and many principal components explain very 
-> > little variation. The red line shows the cumulative percentage of explained
-> > variation with increasing principal components. Note that in this case 18
-> > principal components are needed to explain over 75% of variation in the
-> > data. This is not an unusual result for complex biological datasets
-> > including genetic information as clear relationships between groups are
-> > sometimes difficult to observe in the data. The scree plot shows that using
-> > a PCA we have reduced 91 predictors to 18 in order to explain a significant
-> > amount of variation in the data. See additional arguments in scree plot
-> > function for improving the appearance of the plot.
+> > ```{r scree-ex, fig.cap="Caption", fig.cap="Alt"}
+> > pc <- pca(mat, metadata = metadata)
+> > #Add line to scree plot to visualise the elbow
+> > screeplot(pc, axisLabSize = 5, titleLabSize = 8, drawCumulativeSumLine = FALSE, 
+> > drawCumulativeSumPoints = FALSE) +  geom_line(aes(x = 1:length(pc$components), y = 
+> > as.numeric(pc$variance))) 
+> > ``` 
+> > 
+> > The first principal component explains around 33% of the 
+> > variance in the microarray data, the first 4 principal components explain
+> > around 50%, and 20 principal components explain around 75%. Many principal 
+> > components explain very little variation. The
+> > 'elbow' appears to be around 4-5 principal components, indicating that this
+> > may be a suitable number of principal components. However, these principal components 
+> > cumulatively explain only 51-55% of the variance in the dataset. Although the fact we 
+> > are able to summarise most of the information in the complex dataset in 4-5 principal components
+> > may be a useful result, we may opt to retain more principal 
+> > components (for example, 20) to capture more of the variability 
+> > in the dataset depending on research question.
 > {: .solution}
 {: .challenge}
 
 ## Investigating the principal components 
 
 Once the most important principal components have been identified using
-`screeplot()`, these can be explored in more detail by plotting principal components
+`screeplot()` and explored, these can be explored in more detail by plotting principal components
 against each other and highlighting points based on variables in the metadata.
 This will allow any potential clustering of points according to demographic or
 phenotypic variables to be seen.