-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review comments: Episode 4 - principal component analysis #117
Comments
Task list:
Something like:
Something like:
Something like:
|
The
|
The code here: high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd Lines 569 to 583 in 57f2f5b
just does a bunch of stuff without explaining it |
Will rewrite this section to be more clear: high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd Lines 603 to 613 in 57f2f5b
noralised -> normalised
|
high-dimensional-stats-r/_episodes_rmd/04-principal-component-analysis.Rmd Lines 650 to 652 in 57f2f5b
I don't know about this phrasing. Is 18 a lot of PCs to summarise 75% of the variation in like 20k genes? Also not clear why we're cutting off at 75%, seems mega arbitrary |
Typo,
|
Just to query - why are different packages for PCA used throughout this episode? |
Gail wrote the episode, so I'm mostly going from memory of what we discussed in meetings at the time, but the stats implementation (prcomp) is used because it's the in-built and probably most widely used version. PCAtools is used because it provides a bunch of nice options (eg removeVar) and plots Might be simpler to just use PCAtools and then explain the corresponding aspects of the stats implementation(s). |
Ah I see. That makes sense. I'll have a look and see if I can maybe streamline how they're used a bit! |
Have made all the changes above in the pull requests above, apart from:
Note that I think the text I've added re scaling for the prostate data may help with this point. Perhaps requires less explanation and can possibly just reference back to the prostate example. |
Sounds good, thanks! |
|
Just to check - do the pngs in the figs directory 'appear' when rendering the website? Introducing PCAtools throughout the episode will change a lot of the plots. |
Yeah, running |
Ok great! I can try to upload manually if all else fails. Could be a little more reliable if we convert to workbench? |
Yes hopefully, I had some success in demoing the transition in #139, happy to repeat when there's not big open PRs that would need redoing or to walk you through how I got there |
Sounds good and fair point re merging the many open pull requests after the current workshop delivery (sorry about that!). Would be good to work through it at some point for sure. |
The open PRs aren't a problem for the next workshop as I'm sure they'll all improve the lessons! More so that I wouldn't want to start translating the site to a new build system just before a delivery in case it ends up broken |
Episode 4
I really like this practical presentation of PCA - I can see this being genuinely very useful to someone actually wanting to implement it. I have made some comments below, with minor comments written at the bottom.
Again, where possible, I will submit pull requests for these changes.
"Suppose a dataset contains many variables ($p$ ), close to the total number of rows in the dataset ($n$ ). It is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."
Also, just checking this - the gene expression example later has p>>n. Maybe it's better to say something more vague here about the use cases of PCA so it's consistent with this. Something like:
"If a dataset contains many variables ($p$ ), it is likely that some of these variables are highly correlated. Variables may even be so highly correlated that they represent the same overall effect."
effect of the previous 3 variables, just to reinforce this is intuitively the goal of PCA. Something like:
"As an example, PCA might reduce several variables representing aspects of patient health (blood pressure, heart rate, respiratory rate) into a single feature capturing an overarching "patient health" effect."
Line 70/Advantages and disadvantages of PCA: I like this summary of the advantages and disadvantages, but I would propose moving this to the end of the episode as it's quite difficult to understand without first understanding what PCA is.
Line 75/Advantages and disadvantages of PCA: I would propose rewording "The calculations used in a PCA are easy to understand for statisticians and non-statisticians alike" as "The calculations used by PCA are simple to understand compared to other methods for dimension reduction".
Line 156/What is a principal component?: I really appreciate this description of PCA - I think it explains PCA is an extremely
understandable way and avoids the temptation to just present the maths. As such, I think this section deserves to be called "Principal component analysis" for signposting as it describes the whole process. A short sentence at the start saying that PCA describes the data by breaking it down to "principal components" could also help with this.
Line 203/What is a principal component: I think this formula could be linked with the description of the first PC above just to make it absolutely clear how this mathematical description comes about and how these two parts are linked (and what the PC "scores" are in the example above).
Line 216/A prostate cancer dataset: This prostate data is used throughout the episodes where it's perhaps more informative to demonstrate the methods on a non high-dimensional data set. I don't have a problem with this per se, but I think a brief statement making it clear that the data are not technically high-dimensional (and are simply used to illustrate the method
(as in episode 1)) could be included to avoid confusion. Could even say that we apply the method to a (very!!) high-dimensional data set later (the gene expression data).
Also, I'd be tempted to remove this title because there's no text between that and the title before. Could be combined into the title "How do we perform PCA" or removed since the subsequent text is clear that this is the data set
Line 240/A prostate cancer dataset: "Standard PCAs are carried out using continuous variables only."
I think this sort of information is better given in the section above explaining PCA. It may get lost in the example here. I'm thinking that people may back reference the section on PCA for all examples of this section/their own examples.
Line 264/Do we need to standardise the data: I think a brief sentence at the start of this section about why you would
standardise data for PCA would help the subsequent explanation and the justification for not standardising
in the next example. It may also help someone practically implement PCA on a new data set.
Something like:
"Since PCA derives principal components based on the variance they explain in the data, we may need to scale variables
in our data set if we want to ensure that each variable is considered equally in the PCA. This is particularly useful
if we don't want the PCA to ignore variables that may be important to our analysis just because they have low variance."
If editing this section as per the previous comment, could rewrite to "Since we want each of these variables to contribute equally to our analysis, but there are large differences in variance, we need to scale each of these variables before including them in the PCA. In this example, we standardise all five variables to have a mean of 0 and a standard..."
Then the challenge just reinforces this.
Line 318/A prostate cancer dataset: Query - why is a different package for PCA used now?
Line 324/A prostate cancer dataset: I don't think the scale=TRUE argument changes the mean - perhaps should say
"Note that the [
center = TRUE
and] scale = TRUE arguments are used to standardise the variables to have a mean 0 and standard deviation of 1."Line 373/How many principal components do we need?: Adding lines to this scree plot would really help in visualising the elbow.
Line 380/How many principal components do we need?: A brief sentence explaining how many PCs we would choose from this scree plot as we haven't addressed this yet despite the section heading.
Line 467/Using PCA to analyse gene expression data: It's not clear why we're using another package again here.
Line 527/A gene expression dataset of cancer patients: I think swapping the order of the first two points in this paragraph may help with flow.
I think it needs to be stated somewhere that choosing <p (or <n if high-dim) PCs results in loss of information from the model/data set.
Line 656/Challenge 4: "...and suggest an appropriate number of principal components." to test how well people have understood?
Minor changes
Line 278/Do we need to standardise the data: "In this example ..." -> "In this example, ..."
Line 334/A prostate cancer dataset: "importance of each component" -> "importance of (variance explained by) each component"
Line 354/A prostate cancer dataset: repetition of "also called". Could reword as "A plot of the amount of variance accounted for by each PC is called a scree plot. Note that the amount of variance accounted for by a principal component is given by "eigenvalues". Thus, the y-axis in scree plots if often labelled “eigenvalue”."
Line 376/How many principal components do we need?: "scree plot" -> "screeplot".
Line 529/A gene expression dataset of cancer patients: "high dimensional data" -> "high-dimensional data".
Line 751: "prooces" -> "produces"
Line 768/Principal component regression: Repetition of "This is called PC regression"
Captions/alt text to be filled.
The text was updated successfully, but these errors were encountered: