Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review comments: Episode 4 - principal component analysis #117

Closed
mallewellyn opened this issue Feb 21, 2024 · 18 comments
Closed

Review comments: Episode 4 - principal component analysis #117

mallewellyn opened this issue Feb 21, 2024 · 18 comments

Comments

@mallewellyn
Copy link
Contributor

Episode 4

I really like this practical presentation of PCA - I can see this being genuinely very useful to someone actually wanting to implement it. I have made some comments below, with minor comments written at the bottom.

Again, where possible, I will submit pull requests for these changes.

  • Line 49/Introduction: propose a minor re-wording here just for clarity (also, if learners have completed previous episodes, they'll have a good idea what this looks like - "imagine" leads me to believe you're talking about something different). Something like:

"Suppose a dataset contains many variables ($p$), close to the total number of rows in the dataset ($n$). It is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."

Also, just checking this - the gene expression example later has p>>n. Maybe it's better to say something more vague here about the use cases of PCA so it's consistent with this. Something like:

"If a dataset contains many variables ($p$), it is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."

  • Line 65: Could add a small extension to this sentence just to make it clear that this single feature is capturing the overall
    effect of the previous 3 variables, just to reinforce this is intuitively the goal of PCA. Something like:

"As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect."

  • Line 70/Advantages and disadvantages of PCA: I like this summary of the advantages and disadvantages, but I would propose moving this to the end of the episode as it's quite difficult to understand without first understanding what PCA is.

  • Line 75/Advantages and disadvantages of PCA: I would propose rewording "The calculations used in a PCA are easy to understand for statisticians and non-statisticians alike" as "The calculations used by PCA are simple to understand compared to other methods for dimension reduction".

  • Line 156/What is a principal component?: I really appreciate this description of PCA - I think it explains PCA is an extremely
    understandable way and avoids the temptation to just present the maths. As such, I think this section deserves to be called "Principal component analysis" for signposting as it describes the whole process. A short sentence at the start saying that PCA describes the data by breaking it down to "principal components" could also help with this.

  • Line 203/What is a principal component: I think this formula could be linked with the description of the first PC above just to make it absolutely clear how this mathematical description comes about and how these two parts are linked (and what the PC "scores" are in the example above).

  • Line 216/A prostate cancer dataset: This prostate data is used throughout the episodes where it's perhaps more informative to demonstrate the methods on a non high-dimensional data set. I don't have a problem with this per se, but I think a brief statement making it clear that the data are not technically high-dimensional (and are simply used to illustrate the method
    (as in episode 1)) could be included to avoid confusion. Could even say that we apply the method to a (very!!) high-dimensional data set later (the gene expression data).

Also, I'd be tempted to remove this title because there's no text between that and the title before. Could be combined into the title "How do we perform PCA" or removed since the subsequent text is clear that this is the data set

  • Line 240/A prostate cancer dataset: "Standard PCAs are carried out using continuous variables only."
    I think this sort of information is better given in the section above explaining PCA. It may get lost in the example here. I'm thinking that people may back reference the section on PCA for all examples of this section/their own examples.

  • Line 264/Do we need to standardise the data: I think a brief sentence at the start of this section about why you would
    standardise data for PCA would help the subsequent explanation and the justification for not standardising
    in the next example. It may also help someone practically implement PCA on a new data set.

Something like:

"Since PCA derives principal components based on the variance they explain in the data, we may need to scale variables
in our data set if we want to ensure that each variable is considered equally in the PCA. This is particularly useful
if we don't want the PCA to ignore variables that may be important to our analysis just because they have low variance."

  • Line 277/Do we need to standardise the data: "It is clear from this output that we need to scale each.." would suggest removing "It is clear" as it may not be.

If editing this section as per the previous comment, could rewrite to "Since we want each of these variables to contribute equally to our analysis, but there are large differences in variance, we need to scale each of these variables before including them in the PCA. In this example, we standardise all five variables to have a mean of 0 and a standard..."

Then the challenge just reinforces this.

  • Line 318/A prostate cancer dataset: Query - why is a different package for PCA used now?

  • Line 324/A prostate cancer dataset: I don't think the scale=TRUE argument changes the mean - perhaps should say
    "Note that the [center = TRUE and] scale = TRUE arguments are used to standardise the variables to have a mean 0 and standard deviation of 1."

  • Line 373/How many principal components do we need?: Adding lines to this scree plot would really help in visualising the elbow.

  • Line 380/How many principal components do we need?: A brief sentence explaining how many PCs we would choose from this scree plot as we haven't addressed this yet despite the section heading.

  • Line 467/Using PCA to analyse gene expression data: It's not clear why we're using another package again here.

  • Line 527/A gene expression dataset of cancer patients: I think swapping the order of the first two points in this paragraph may help with flow.

  • I think it needs to be stated somewhere that choosing <p (or <n if high-dim) PCs results in loss of information from the model/data set.

  • Line 656/Challenge 4: "...and suggest an appropriate number of principal components." to test how well people have understood?

Minor changes

  • Line 278/Do we need to standardise the data: "In this example ..." -> "In this example, ..."

  • Line 334/A prostate cancer dataset: "importance of each component" -> "importance of (variance explained by) each component"

  • Line 354/A prostate cancer dataset: repetition of "also called". Could reword as "A plot of the amount of variance accounted for by each PC is called a scree plot. Note that the amount of variance accounted for by a principal component is given by "eigenvalues". Thus, the y-axis in scree plots if often labelled “eigenvalue”."

  • Line 376/How many principal components do we need?: "scree plot" -> "screeplot".

  • Line 529/A gene expression dataset of cancer patients: "high dimensional data" -> "high-dimensional data".

  • Line 751: "prooces" -> "produces"

  • Line 768/Principal component regression: Repetition of "This is called PC regression"

  • Captions/alt text to be filled.

@mallewellyn
Copy link
Contributor Author

mallewellyn commented Mar 1, 2024

Task list:

  • 1. Line 49/Introduction: propose a minor re-wording here just for clarity (also, if learners have completed previous episodes, they'll have a good idea what this looks like - "imagine" leads me to believe you're talking about something different).

Something like:
"If a dataset contains many variables ($p$), it is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."

  • 2. Line 65: Could add a small extension to this sentence just to make it clear that this single feature is capturing the overall effect of the previous 3 variables, just to reinforce this is intuitively the goal of PCA.

Something like:
"As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect."

  • 3. Line 70/Advantages and disadvantages of PCA: I like this summary of the advantages and disadvantages, but I would propose moving this to the end of the episode as it's quite difficult to understand without first understanding what PCA is.

  • 4. Line 75/Advantages and disadvantages of PCA: I would propose rewording "The calculations used in a PCA are easy to understand for statisticians and non-statisticians alike" as "The calculations used by PCA are simple to understand compared to other methods for dimension reduction".

  • 5. Line 156/What is a principal component?: I really appreciate this description of PCA - I think it explains PCA is an extremely understandable way and avoids the temptation to just present the maths. As such, I think this section deserves to be called "Principal component analysis" for signposting as it describes the whole process. A short sentence at the start saying that PCA describes the data by breaking it down to "principal components" could also help with this.

  • 6. Line 203/What is a principal component: I think this formula could be linked with the description of the first PC above just to make it absolutely clear how this mathematical description comes about and how these two parts are linked (and what the PC "scores" are in the example above).

  • 7. Line 216/A prostate cancer dataset: This prostate data is used throughout the episodes where it's perhaps more informative to demonstrate the methods on a non high-dimensional data set. I don't have a problem with this per se, but I think a brief statement making it clear that the data are not technically high-dimensional (and are simply used to illustrate the method (as in episode 1)) could be included to avoid confusion. Could even say that we apply the method to a (very!!) high-dimensional data set later (the gene expression data).

  • 8. Line 216/A prostate cancer dataset: I'd be tempted to remove this title because there's no text between that and the title before. Could be combined into the title "How do we perform PCA" or removed since the subsequent text is clear that this is the data set

  • 9. Line 240/A prostate cancer dataset: "Standard PCAs are carried out using continuous variables only."
    I think this sort of information is better given in the section above explaining PCA. It may get lost in the example here. I'm thinking that people may back reference the section on PCA for all examples of this section/their own examples.

  • 10. Line 264/Do we need to standardise the data: I think a brief sentence at the start of this section about why you would standardise data for PCA would help the subsequent explanation and the justification for not standardising
    in the next example. It may also help someone practically implement PCA on a new data set.

Something like:
"Since PCA derives principal components based on the variance they explain in the data, we may need to scale variables
in our data set if we want to ensure that each variable is considered equally in the PCA. This is particularly useful
if we don't want the PCA to ignore variables that may be important to our analysis just because they have low variance."

  • 11. Line 277/Do we need to standardise the data: "It is clear from this output that we need to scale each.." would suggest removing "It is clear" as it may not be. If editing this section as per the previous comment, could rewrite to "Since we want each of these variables to contribute equally to our analysis, but there are large differences in variance, we need to scale each of these variables before including them in the PCA. In this example, we standardise all five variables to have a mean of 0 and a standard..." Then the challenge just reinforces this.

  • 12. Line 278/Do we need to standardise the data: "In this example ..." -> "In this example, ..."

  • 13. Line 318/A prostate cancer dataset: Query - why is a different package for PCA used now?

  • 14. Line 324/A prostate cancer dataset: I don't think the scale=TRUE argument changes the mean - perhaps should say
    "Note that the [center = TRUE and] scale = TRUE arguments are used to standardise the variables to have a mean 0 and standard deviation of 1."

  • 15. Line 334/A prostate cancer dataset: "importance of each component" -> "importance of (variance explained by) each component"

  • 16. Line 354/A prostate cancer dataset: repetition of "also called". Could reword as "A plot of the amount of variance accounted for by each PC is called a scree plot. Note that the amount of variance accounted for by a principal component is given by "eigenvalues". Thus, the y-axis in scree plots if often labelled “eigenvalue”."

  • 17. Line 373/How many principal components do we need?: Adding lines to this scree plot would really help in visualising the elbow.

  • 18. Line 376/How many principal components do we need?: "scree plot" -> "screeplot".

  • 19. Line 380/How many principal components do we need?: A brief sentence explaining how many PCs we would choose from this scree plot as we haven't addressed this yet despite the section heading.

  • 20. Line 467/Using PCA to analyse gene expression data: It's not clear why we're using another package again here.

  • 21. Line 527/A gene expression dataset of cancer patients: I think swapping the order of the first two points in this paragraph may help with flow.

  • 22. Line 529/A gene expression dataset of cancer patients: "high dimensional data" -> "high-dimensional data".

  • 23. I think it needs to be stated somewhere that choosing <p (or <n if high-dim) PCs results in loss of information from the model/data set.

  • 24. Line 656/Challenge 4: "...and suggest an appropriate number of principal components." to test how well people have understood?

  • 25. Line 751: "prooces" -> "produces"

  • 26. Line 768/Principal component regression: Repetition of "This is called PC regression"

  • 27. Captions/alt text.

@alanocallaghan
Copy link
Collaborator

The View calls here should be replaced by head:

@alanocallaghan
Copy link
Collaborator

The code here:

> > ```{r pca-ex}
> > pc <- pca(mat, metadata = metadata)
> > #Many PCs explain a very small amount of the total variance in the data
> > #Remove the lower 20% of PCs with lower variance
> > pc <- pca(mat, metadata = metadata, removeVar = 0.2)
> > #Explore other arguments provided in pca
> > pc$rotated[1:5, 1:5]
> > pc$loadings[1:5, 1:5]
> >
> > which.max(pc$loadings[, 1])
> > pc$loadings[49, ]
> >
> > which.max(pc$loadings[, 2])
> > pc$loadings[27, ]
> > ```

just does a bunch of stuff without explaining it

@alanocallaghan
Copy link
Collaborator

Will rewrite this section to be more clear:

> ## Scaling variables for PCA
>
> When running `pca()` above, we kept the default setting, `scale=FALSE`. That means genes with higher variation in
> their expression levels should have higher loadings, which is what we are interested in.
> Whether or not to scale variables for PCA will depend on your data and research question.
>
> Note that this is different from normalising gene expression data. Gene expression
> data have to be normalised before donwstream analyses can be
> carried out. This is to reduce to effect technical and other potentially confounding
> factors. We assume that the expression data we use had been noralised previously.
{: .callout}

noralised -> normalised

> factors. We assume that the expression data we use had been noralised previously.

@alanocallaghan
Copy link
Collaborator

alanocallaghan commented Mar 1, 2024

> > data. This is not an unusual result for complex biological datasets
> > including genetic information as clear relationships between groups are
> > sometimes difficult to observe in the data. The screeplot shows that using

I don't know about this phrasing. Is 18 a lot of PCs to summarise 75% of the variation in like 20k genes?

Also not clear why we're cutting off at 75%, seems mega arbitrary

@alanocallaghan
Copy link
Collaborator

Typo, colby not colBy

> arguments and their meaning. For instance, `lab` or `colBy` may be useful.

@mallewellyn
Copy link
Contributor Author

Just to query - why are different packages for PCA used throughout this episode?

@alanocallaghan
Copy link
Collaborator

Gail wrote the episode, so I'm mostly going from memory of what we discussed in meetings at the time, but the stats implementation (prcomp) is used because it's the in-built and probably most widely used version.

PCAtools is used because it provides a bunch of nice options (eg removeVar) and plots

Might be simpler to just use PCAtools and then explain the corresponding aspects of the stats implementation(s).

@mallewellyn
Copy link
Contributor Author

Ah I see. That makes sense. I'll have a look and see if I can maybe streamline how they're used a bit!

@mallewellyn
Copy link
Contributor Author

mallewellyn commented Mar 4, 2024

Have made all the changes above in the pull requests above, apart from:

  • 28. Re-phrase as below

> > data. This is not an unusual result for complex biological datasets
> > including genetic information as clear relationships between groups are
> > sometimes difficult to observe in the data. The screeplot shows that using

I don't know about this phrasing. Is 18 a lot of PCs to summarise 75% of the variation in like 20k genes?

Also not clear why we're cutting off at 75%, seems mega arbitrary

  • 29. streamline package use
  • 30. Clarify scaling for gene expression data as below

Will rewrite this section to be more clear:

> ## Scaling variables for PCA
>
> When running `pca()` above, we kept the default setting, `scale=FALSE`. That means genes with higher variation in
> their expression levels should have higher loadings, which is what we are interested in.
> Whether or not to scale variables for PCA will depend on your data and research question.
>
> Note that this is different from normalising gene expression data. Gene expression
> data have to be normalised before donwstream analyses can be
> carried out. This is to reduce to effect technical and other potentially confounding
> factors. We assume that the expression data we use had been noralised previously.
{: .callout}

Note that I think the text I've added re scaling for the prostate data may help with this point. Perhaps requires less explanation and can possibly just reference back to the prostate example.

@alanocallaghan
Copy link
Collaborator

Sounds good, thanks!

@mallewellyn
Copy link
Contributor Author

  • 31. Edit alt text and captions for new PCAtools figures in prostate example

@mallewellyn
Copy link
Contributor Author

Just to check - do the pngs in the figs directory 'appear' when rendering the website? Introducing PCAtools throughout the episode will change a lot of the plots.

@alanocallaghan
Copy link
Collaborator

Yeah, running make site re-generates all the figures automatically. Would ideally happen with every push but unsure if this is working still

@mallewellyn
Copy link
Contributor Author

Ok great! I can try to upload manually if all else fails.

Could be a little more reliable if we convert to workbench?

@alanocallaghan
Copy link
Collaborator

Yes hopefully, I had some success in demoing the transition in #139, happy to repeat when there's not big open PRs that would need redoing or to walk you through how I got there

@mallewellyn
Copy link
Contributor Author

Sounds good and fair point re merging the many open pull requests after the current workshop delivery (sorry about that!). Would be good to work through it at some point for sure.

@alanocallaghan
Copy link
Collaborator

The open PRs aren't a problem for the next workshop as I'm sure they'll all improve the lessons! More so that I wouldn't want to start translating the site to a new build system just before a delivery in case it ends up broken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants