Review comments: Episode 4 - principal component analysis #117

mallewellyn · 2024-02-21T13:42:31Z

I really like this practical presentation of PCA - I can see this being genuinely very useful to someone actually wanting to implement it. I have made some comments below, with minor comments written at the bottom.

Again, where possible, I will submit pull requests for these changes.

Line 49/Introduction: propose a minor re-wording here just for clarity (also, if learners have completed previous episodes, they'll have a good idea what this looks like - "imagine" leads me to believe you're talking about something different). Something like:

"Suppose a dataset contains many variables ($p$), close to the total number of rows in the dataset ($n$). It is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."

Also, just checking this - the gene expression example later has p>>n. Maybe it's better to say something more vague here about the use cases of PCA so it's consistent with this. Something like:

"If a dataset contains many variables ($p$), it is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."

Line 65: Could add a small extension to this sentence just to make it clear that this single feature is capturing the overall
effect of the previous 3 variables, just to reinforce this is intuitively the goal of PCA. Something like:

"As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect."

Line 70/Advantages and disadvantages of PCA: I like this summary of the advantages and disadvantages, but I would propose moving this to the end of the episode as it's quite difficult to understand without first understanding what PCA is.
Line 75/Advantages and disadvantages of PCA: I would propose rewording "The calculations used in a PCA are easy to understand for statisticians and non-statisticians alike" as "The calculations used by PCA are simple to understand compared to other methods for dimension reduction".
Line 156/What is a principal component?: I really appreciate this description of PCA - I think it explains PCA is an extremely
understandable way and avoids the temptation to just present the maths. As such, I think this section deserves to be called "Principal component analysis" for signposting as it describes the whole process. A short sentence at the start saying that PCA describes the data by breaking it down to "principal components" could also help with this.
Line 203/What is a principal component: I think this formula could be linked with the description of the first PC above just to make it absolutely clear how this mathematical description comes about and how these two parts are linked (and what the PC "scores" are in the example above).
Line 216/A prostate cancer dataset: This prostate data is used throughout the episodes where it's perhaps more informative to demonstrate the methods on a non high-dimensional data set. I don't have a problem with this per se, but I think a brief statement making it clear that the data are not technically high-dimensional (and are simply used to illustrate the method
(as in episode 1)) could be included to avoid confusion. Could even say that we apply the method to a (very!!) high-dimensional data set later (the gene expression data).

Also, I'd be tempted to remove this title because there's no text between that and the title before. Could be combined into the title "How do we perform PCA" or removed since the subsequent text is clear that this is the data set

Line 240/A prostate cancer dataset: "Standard PCAs are carried out using continuous variables only."
I think this sort of information is better given in the section above explaining PCA. It may get lost in the example here. I'm thinking that people may back reference the section on PCA for all examples of this section/their own examples.
Line 264/Do we need to standardise the data: I think a brief sentence at the start of this section about why you would
standardise data for PCA would help the subsequent explanation and the justification for not standardising
in the next example. It may also help someone practically implement PCA on a new data set.

Something like:

"Since PCA derives principal components based on the variance they explain in the data, we may need to scale variables
in our data set if we want to ensure that each variable is considered equally in the PCA. This is particularly useful
if we don't want the PCA to ignore variables that may be important to our analysis just because they have low variance."

Line 277/Do we need to standardise the data: "It is clear from this output that we need to scale each.." would suggest removing "It is clear" as it may not be.

If editing this section as per the previous comment, could rewrite to "Since we want each of these variables to contribute equally to our analysis, but there are large differences in variance, we need to scale each of these variables before including them in the PCA. In this example, we standardise all five variables to have a mean of 0 and a standard..."

Then the challenge just reinforces this.

Line 318/A prostate cancer dataset: Query - why is a different package for PCA used now?
Line 324/A prostate cancer dataset: I don't think the scale=TRUE argument changes the mean - perhaps should say
"Note that the [center = TRUE and] scale = TRUE arguments are used to standardise the variables to have a mean 0 and standard deviation of 1."
Line 373/How many principal components do we need?: Adding lines to this scree plot would really help in visualising the elbow.
Line 380/How many principal components do we need?: A brief sentence explaining how many PCs we would choose from this scree plot as we haven't addressed this yet despite the section heading.
Line 467/Using PCA to analyse gene expression data: It's not clear why we're using another package again here.
Line 527/A gene expression dataset of cancer patients: I think swapping the order of the first two points in this paragraph may help with flow.
I think it needs to be stated somewhere that choosing <p (or <n if high-dim) PCs results in loss of information from the model/data set.
Line 656/Challenge 4: "...and suggest an appropriate number of principal components." to test how well people have understood?

Minor changes

Line 278/Do we need to standardise the data: "In this example ..." -> "In this example, ..."
Line 334/A prostate cancer dataset: "importance of each component" -> "importance of (variance explained by) each component"
Line 354/A prostate cancer dataset: repetition of "also called". Could reword as "A plot of the amount of variance accounted for by each PC is called a scree plot. Note that the amount of variance accounted for by a principal component is given by "eigenvalues". Thus, the y-axis in scree plots if often labelled “eigenvalue”."
Line 376/How many principal components do we need?: "scree plot" -> "screeplot".
Line 529/A gene expression dataset of cancer patients: "high dimensional data" -> "high-dimensional data".
Line 751: "prooces" -> "produces"
Line 768/Principal component regression: Repetition of "This is called PC regression"
Captions/alt text to be filled.

The text was updated successfully, but these errors were encountered:

mallewellyn · 2024-03-01T09:14:28Z

Task list:

1. Line 49/Introduction: propose a minor re-wording here just for clarity (also, if learners have completed previous episodes, they'll have a good idea what this looks like - "imagine" leads me to believe you're talking about something different).

Something like:
"If a dataset contains many variables ($p$), it is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."

2. Line 65: Could add a small extension to this sentence just to make it clear that this single feature is capturing the overall effect of the previous 3 variables, just to reinforce this is intuitively the goal of PCA.

Something like:
"As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect."

3. Line 70/Advantages and disadvantages of PCA: I like this summary of the advantages and disadvantages, but I would propose moving this to the end of the episode as it's quite difficult to understand without first understanding what PCA is.
4. Line 75/Advantages and disadvantages of PCA: I would propose rewording "The calculations used in a PCA are easy to understand for statisticians and non-statisticians alike" as "The calculations used by PCA are simple to understand compared to other methods for dimension reduction".
5. Line 156/What is a principal component?: I really appreciate this description of PCA - I think it explains PCA is an extremely understandable way and avoids the temptation to just present the maths. As such, I think this section deserves to be called "Principal component analysis" for signposting as it describes the whole process. A short sentence at the start saying that PCA describes the data by breaking it down to "principal components" could also help with this.
6. Line 203/What is a principal component: I think this formula could be linked with the description of the first PC above just to make it absolutely clear how this mathematical description comes about and how these two parts are linked (and what the PC "scores" are in the example above).
7. Line 216/A prostate cancer dataset: This prostate data is used throughout the episodes where it's perhaps more informative to demonstrate the methods on a non high-dimensional data set. I don't have a problem with this per se, but I think a brief statement making it clear that the data are not technically high-dimensional (and are simply used to illustrate the method (as in episode 1)) could be included to avoid confusion. Could even say that we apply the method to a (very!!) high-dimensional data set later (the gene expression data).
8. Line 216/A prostate cancer dataset: I'd be tempted to remove this title because there's no text between that and the title before. Could be combined into the title "How do we perform PCA" or removed since the subsequent text is clear that this is the data set
9. Line 240/A prostate cancer dataset: "Standard PCAs are carried out using continuous variables only."
I think this sort of information is better given in the section above explaining PCA. It may get lost in the example here. I'm thinking that people may back reference the section on PCA for all examples of this section/their own examples.
10. Line 264/Do we need to standardise the data: I think a brief sentence at the start of this section about why you would standardise data for PCA would help the subsequent explanation and the justification for not standardising
in the next example. It may also help someone practically implement PCA on a new data set.

Something like:
"Since PCA derives principal components based on the variance they explain in the data, we may need to scale variables
in our data set if we want to ensure that each variable is considered equally in the PCA. This is particularly useful
if we don't want the PCA to ignore variables that may be important to our analysis just because they have low variance."

alanocallaghan · 2024-03-01T10:59:14Z

The View calls here should be replaced by head:

high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd

Line 507 in 57f2f5b

View(metadata)

alanocallaghan · 2024-03-01T11:06:23Z

The code here:

high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd

Lines 569 to 583 in 57f2f5b

    
           > > ```{r pca-ex} 
        
           > > pc <- pca(mat, metadata = metadata) 
        
           > > #Many PCs explain a very small amount of the total variance in the data 
        
           > > #Remove the lower 20% of PCs with lower variance 
        
           > > pc <- pca(mat, metadata = metadata, removeVar = 0.2) 
        
           > > #Explore other arguments provided in pca 
        
           > > pc$rotated[1:5, 1:5] 
        
           > > pc$loadings[1:5, 1:5] 
        
           > >  
        
           > > which.max(pc$loadings[, 1]) 
        
           > > pc$loadings[49, ] 
        
           > >  
        
           > > which.max(pc$loadings[, 2]) 
        
           > > pc$loadings[27, ] 
        
           > > ```

just does a bunch of stuff without explaining it

alanocallaghan · 2024-03-01T11:17:44Z

Will rewrite this section to be more clear:

high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd

Lines 603 to 613 in 57f2f5b

    
           > ## Scaling variables for PCA 
        
           > 
        
           > When running `pca()` above, we kept the default setting, `scale=FALSE`. That means genes with higher variation in 
        
           > their expression levels should have higher loadings, which is what we are interested in. 
        
           > Whether or not to scale variables for PCA will depend on your data and research question.   
        
           > 
        
           > Note that this is different from normalising gene expression data. Gene expression 
        
           > data have to be normalised before donwstream analyses can be 
        
           > carried out. This is to reduce to effect technical and other potentially confounding 
        
           > factors. We assume that the expression data we use had been noralised previously. 
        
           {: .callout}

noralised -> normalised

high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd

Line 612 in 57f2f5b

    
           > factors. We assume that the expression data we use had been noralised previously.

alanocallaghan · 2024-03-01T11:52:48Z

high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd

Lines 650 to 652 in 57f2f5b

    
           > > data. This is not an unusual result for complex biological datasets 
        
           > > including genetic information as clear relationships between groups are 
        
           > > sometimes difficult to observe in the data. The screeplot shows that using

I don't know about this phrasing. Is 18 a lot of PCs to summarise 75% of the variation in like 20k genes?

Also not clear why we're cutting off at 75%, seems mega arbitrary

alanocallaghan · 2024-03-01T11:57:38Z

Typo, colby not colBy

high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd

Line 676 in 57f2f5b

> arguments and their meaning. For instance, `lab` or `colBy` may be useful.

mallewellyn · 2024-03-04T14:23:44Z

Just to query - why are different packages for PCA used throughout this episode?

alanocallaghan · 2024-03-04T14:27:20Z

Gail wrote the episode, so I'm mostly going from memory of what we discussed in meetings at the time, but the stats implementation (prcomp) is used because it's the in-built and probably most widely used version.

PCAtools is used because it provides a bunch of nice options (eg removeVar) and plots

Might be simpler to just use PCAtools and then explain the corresponding aspects of the stats implementation(s).

mallewellyn · 2024-03-04T14:30:02Z

Ah I see. That makes sense. I'll have a look and see if I can maybe streamline how they're used a bit!

mallewellyn · 2024-03-04T18:31:45Z

Have made all the changes above in the pull requests above, apart from:

28. Re-phrase as below

high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd

Lines 650 to 652 in 57f2f5b

> > data. This is not an unusual result for complex biological datasets

> > including genetic information as clear relationships between groups are

> > sometimes difficult to observe in the data. The screeplot shows that using

I don't know about this phrasing. Is 18 a lot of PCs to summarise 75% of the variation in like 20k genes?

Also not clear why we're cutting off at 75%, seems mega arbitrary

29. streamline package use
30. Clarify scaling for gene expression data as below

Will rewrite this section to be more clear:

high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd

Lines 603 to 613 in 57f2f5b

> ## Scaling variables for PCA

>

> When running `pca()` above, we kept the default setting, `scale=FALSE`. That means genes with higher variation in

> their expression levels should have higher loadings, which is what we are interested in.

> Whether or not to scale variables for PCA will depend on your data and research question.

>

> Note that this is different from normalising gene expression data. Gene expression

> data have to be normalised before donwstream analyses can be

> carried out. This is to reduce to effect technical and other potentially confounding

> factors. We assume that the expression data we use had been noralised previously.

{: .callout}

Note that I think the text I've added re scaling for the prostate data may help with this point. Perhaps requires less explanation and can possibly just reference back to the prostate example.

alanocallaghan · 2024-03-04T18:32:45Z

Sounds good, thanks!

mallewellyn · 2024-03-05T18:08:32Z

31. Edit alt text and captions for new PCAtools figures in prostate example

mallewellyn · 2024-03-05T18:10:14Z

Just to check - do the pngs in the figs directory 'appear' when rendering the website? Introducing PCAtools throughout the episode will change a lot of the plots.

alanocallaghan · 2024-03-05T22:41:03Z

Yeah, running make site re-generates all the figures automatically. Would ideally happen with every push but unsure if this is working still

mallewellyn · 2024-03-06T09:19:53Z

Ok great! I can try to upload manually if all else fails.

Could be a little more reliable if we convert to workbench?

alanocallaghan · 2024-03-06T09:25:31Z

Yes hopefully, I had some success in demoing the transition in #139, happy to repeat when there's not big open PRs that would need redoing or to walk you through how I got there

mallewellyn · 2024-03-06T09:55:19Z

Sounds good and fair point re merging the many open pull requests after the current workshop delivery (sorry about that!). Would be good to work through it at some point for sure.

alanocallaghan · 2024-03-06T10:01:52Z

The open PRs aren't a problem for the next workshop as I'm sure they'll all improve the lessons! More so that I wouldn't want to start translating the site to a new build system just before a delivery in case it ends up broken

This was referenced Mar 4, 2024

changes to episode 4, tasks 1-16 #141

Merged

changes to episode 4, tasks 17-27 #143

Merged

mallewellyn mentioned this issue Mar 5, 2024

changes to episode 4, tasks 28-29 #144

Merged

mallewellyn closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review comments: Episode 4 - principal component analysis #117

Review comments: Episode 4 - principal component analysis #117

mallewellyn commented Feb 21, 2024

mallewellyn commented Mar 1, 2024 •

edited

Loading

alanocallaghan commented Mar 1, 2024

alanocallaghan commented Mar 1, 2024

alanocallaghan commented Mar 1, 2024

alanocallaghan commented Mar 1, 2024 •

edited

Loading

alanocallaghan commented Mar 1, 2024

mallewellyn commented Mar 4, 2024

alanocallaghan commented Mar 4, 2024

mallewellyn commented Mar 4, 2024

mallewellyn commented Mar 4, 2024 •

edited

Loading

alanocallaghan commented Mar 4, 2024

mallewellyn commented Mar 5, 2024

mallewellyn commented Mar 5, 2024

alanocallaghan commented Mar 5, 2024

mallewellyn commented Mar 6, 2024

alanocallaghan commented Mar 6, 2024

mallewellyn commented Mar 6, 2024

alanocallaghan commented Mar 6, 2024

Review comments: Episode 4 - principal component analysis #117

Review comments: Episode 4 - principal component analysis #117

Comments

mallewellyn commented Feb 21, 2024

mallewellyn commented Mar 1, 2024 • edited Loading

alanocallaghan commented Mar 1, 2024

alanocallaghan commented Mar 1, 2024

alanocallaghan commented Mar 1, 2024

alanocallaghan commented Mar 1, 2024 • edited Loading

alanocallaghan commented Mar 1, 2024

mallewellyn commented Mar 4, 2024

alanocallaghan commented Mar 4, 2024

mallewellyn commented Mar 4, 2024

mallewellyn commented Mar 4, 2024 • edited Loading

alanocallaghan commented Mar 4, 2024

mallewellyn commented Mar 5, 2024

mallewellyn commented Mar 5, 2024

alanocallaghan commented Mar 5, 2024

mallewellyn commented Mar 6, 2024

alanocallaghan commented Mar 6, 2024

mallewellyn commented Mar 6, 2024

alanocallaghan commented Mar 6, 2024

mallewellyn commented Mar 1, 2024 •

edited

Loading

alanocallaghan commented Mar 1, 2024 •

edited

Loading

mallewellyn commented Mar 4, 2024 •

edited

Loading