Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Third delivery suggested changes #64

Closed
61 of 68 tasks
ailithewing opened this issue May 20, 2022 · 7 comments
Closed
61 of 68 tasks

Third delivery suggested changes #64

ailithewing opened this issue May 20, 2022 · 7 comments
Assignees

Comments

@ailithewing
Copy link
Collaborator

ailithewing commented May 20, 2022

A list of proposed changes following the May delivery of HDS

These are in addition to the changes in the pull request ailith_delivery3 and to the changes that Hannes made that have yet to be pushed to the main course materials.

Throughout

  • bold package names and include () for functions

Intro

  • Change high-dimensional data definition
  • Switch out prostate dataset or make it much clearer that it's a toy dataset for the purposes of explanation
  • Change view to head and dim
  • Expand challenge 1 solution
  • More specific question than examine the dataset in challenge 2 (from Emma's review in Review comments: Introduction to high-dimensional data #39)
  • Check how we're referring to figures e.g not by number if there's no number
  • Could add a challenge question to show what happens with correlated variables (see Emma's review in Review comments: Introduction to high-dimensional data #39)
  • Take out bioconductor intro as we never teach it (maybe condense and put in a callout box?)
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in Review comments: Introduction to high-dimensional data #39)
  • Explain why you are using here? (from Emma's review in Review comments: Introduction to high-dimensional data #39)
  • STRUCTURAL Challenges section focus on two things: (a) ill-defined model (more predictors than observations) can add figure with one dot only, and (b) correlated predictors perhaps add code and show unstable coefficient estimates.
  • STRUCTURAL Rewirtre section on which statistical methods are used to give an overview of the course. Focus on problems and what analysis is used when (exploring one outcome with many similar features (methylation/expression) / predicting outcomes with more features than observations / reducing dimensionality/grouping/making sense of similar predictors / clustering observations)

Regression with many features (many outcomes)

  • rank results in toptable by effect size
  • include small intro to feature selection to motivate why these techniques are useful as we took the feature selection lesson out of the 2-day course.
  • check exercises aren't introducing new concepts
  • check direction of smoker is consistent between model and plot
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in Review comments: Introduction to high-dimensional data #39)
  • Explore whether the episode can be made shorter or divided (from Emma's review in Review comments: Regression with many features #47)
  • Add a reference for the source of the methylation data
  • Change title to regression with many outcomes and add a brief comment to distinguish between dealing with many outcomes and/or many features (we can mention that the regularisation episode will address that). Potentially, we can create a separate episode Regression in high-dimensional settings where we introduce the methylation data and the two different types of problems. However, this is outside the scope for this round of changes. Creating this separate episode would also address some of Emma's concerns.
  • Add mention of dream() from VariancePartition which is similar to limma but can handle grouping (random effects)

Regularisation

  • needs split up
    • motivation & rationale - in expanded intro
    • intro to model selection/cross validation
    • what is regularisation in general?
    • ridge and lasso
  • more explanation of Horvath
  • greater figure explanation in the materials
  • fix overuse of Xi
  • more detail on extracting coefficients and model interpretation
  • glossary of jargon
  • add link to ML course for related materials (from Self-review notes #7)

CAV (20220206) Link added to episode 1 instead as it's general across different types of ML approaches.

CAV (20220206) I can't recall what the specific issue was, but the episode has been extensively revised and labels look ok.

CAV (20220206) Paragraph was revised, so hopefully OK now.

CAV (20220206) Notation review.

  • in exercise 2, maybe ask why mean squared rather than sum of squared (from Self-review notes #7)
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in Review comments: Introduction to high-dimensional data #39)
  • move up the section "Using regularisation to impove generalisability"
  • add reason for training and test intro, like: "Before we move on to regularised regression, we have to introduce..."
  • when talking about elastic net, say we've used it all along - lasso and ridge are special cases with alpha=0/1

PCA

  • consider removing scaling from gene expression pca (include box about gene expression normalisation to emphasise that that's not what we're talking about)
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in Review comments: Introduction to high-dimensional data #39)
  • Equation half way down needed at all (which refers to original exaple?
  • add note the PCAtools taks data in the Bioconductor orientation
  • STRUCTURAL add table comparing terms for loadings and scores used in different packages

FA

  • move advantages and disadvantages of FA up so it's in the introduction
  • more detail on communality and uniqueness
  • mention confirmatory factor analysis
  • discuss ways of determining number of factors
  • Add brackets for function names in text, e.g. pairs() (from Emma's review in Review comments: Introduction to high-dimensional data #39)

K means

Hierarchical clusters

Other

@ailithewing
Copy link
Collaborator Author

@catavallejos @nathansam @hwarden162 @alanocallaghan Please add any additional things that I've missed.

@nathansam
Copy link
Contributor

nathansam commented May 20, 2022

kmeans: set seed for heatmap code chunk starting library("pheatmap") (which might be covered by the coloured blocks to do)

@hannesbecher
Copy link
Collaborator

Challenge 1 in episode 1. Not sure about question 4. Is this a good example of high-dim data? Because it is one observation and so many features?

  1. Predicting probability of a patient's cancer progressing using gene
    expression data from 20,000 genes, as well as data associated with general patient health
    (age, weight, BMI, blood pressure) and cancer growth (tumour size,
    localised spread, blood test results).

@alanocallaghan
Copy link
Collaborator

Changing that challenge from singular to plural patients would also be good to avoid implying high precision from generic prediction models (ie precision med hype)

@hannesbecher
Copy link
Collaborator

Current uniqueness/communality explanations contradicts Wikipedia I think: https://en.wikipedia.org/wiki/Factor_analysis#Terminology

@alanocallaghan
Copy link
Collaborator

One way of reducing the number of dep packages is to move all the data wrangling stuff to a data package and then just remotes::install_github it.

@hannesbecher
Copy link
Collaborator

Glossary still open, but covered by issue #89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants