Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review comments: Episode 3 - regularised regression #115

Closed
mallewellyn opened this issue Feb 20, 2024 · 5 comments
Closed

Review comments: Episode 3 - regularised regression #115

mallewellyn opened this issue Feb 20, 2024 · 5 comments

Comments

@mallewellyn
Copy link
Contributor

Episode 3

Another nice episode. Although it's quite long, I think it covers what are often quite challenging ideas in a very approachable way. Most of the comments I have relate to how regularisation is motivated and some minor re-wording to clarify. I do, however, have a more challenging query about the placement of the linear regression background information. I have highlighted this in bold below!

Again, I will submit pull requests where possible and very happy to discuss anything.

  • Line 22/Key points: I can't see where this last key point re computational speed is discussed. Potentially omit or add a brief comment about this in the text?

  • Line 46/Introduction: The end of this sentence could clarify the differences between the approach of this episode and the previous one. Something like: "that instead [what regularisation does and how it differs from information sharing]".

  • Line 79/Introduction: I like the example to show what happens when you fit a model using all features. Could add a sentence like "We explain what singularities are and why they appear when fitting high-dimensional data below" to just clarify that this is addressed!

Also, a brief explanation of why some effect sizes are very high as this doesn't seem to be addressed.

  • Line 84/Singularities: I think it needs to be clearer why high-dimensional data result in singularities (i.e., explicitly saying that determinant of matrix is zero for high-dimensional data, thus singularities are always present when fitting models naively to high-dimensional data and R often cannot fit the model).

  • Line 141/Correlated features -- common in high-dimensional data: Perhaps this summary of the challenges could be placed outside of the pinned box. I think it deserves more attention as it succinctly summarises both of the above points and motivates the need for regularisation.

Also, the fact that p>n is problematic is discussed in a lot of detail (resulting in singularities), and so I think should be added to this point to justify the use of regularisation (or else the discussion on high-dimensional data being problematic by its very size could be removed and just discussion of correlations retained)!

Something like: "Regularisation can help us to deal with correlated features." -> "Regularisation can help us to deal with correlated features, as well as effectively reduce the number of features (dimension) in our model, and thus addresses these issues".

  • Line 152/Challenge 1: "Discuss in groups:" could be "Consider or discuss in groups:" to make suitable for an individual learner.

  • Lines 175 & Line 311: Query - it feels, from the section above, as though you are about to describe regularisation as this is now fully motivated. These regression sections therefore confused me a bit. There's a similar interlude in the previous episode. Could it be possible to cover this as preliminary material at the start of the episode, reference another tutorial on linear regression + training and testing models and reference this throughout all episodes, or use this as a "bridging" episode at the start of the formal set of episodes and back-reference?

Could then provide the information re restricting the model after the example (where we try to fit a linear model on the whole data set) as further motivation and explain how this relates to generalisability.

  • Line 515/Using regularisation to improve generalisability: A brief few words about the trade-off here with the extent of regularisation would help to clarify.

  • Line 516/Why would we want to restrict our model?: "This type of regularisation is called ridge regression" makes it sound as though ridge regression=OLS. Slight re-wording to tie ridge regression to the non-zero penalty may help.

  • Line 518: Somewhere in this section, could possibly add a title to improve signposting of the ridge method (since there's one for lasso).

  • Line 520: These problems already feel explained and thus I would propose a minor re-wording to make this a recap and demonstration of the fact that regularisation actually works.

  • Line 742/Cross-validation to find the best value of $\lambda$: Haven't defined what a 'good' value for lambda looks like (to define what the 'best' looks like)

  • Line 749: Cross-validation to find the best value of $\lambda$: "Cross-validation is a really deep topic that we're not going to cover in more detail today, though!" possibly a slight re-wording as the subsequent analysis seems to cover cross-validation in a bit more detail.

  • Line 795/Blending ridge regression and the LASSO - elastic nets: Would possibly re-order this title to put elastic nets first if also using lasso and ridge titles (just for signposting).

  • Line 1019/Other types of outcomes: It is a little unclear to me that the intercept-only model is selected. Could annotate the code to demonstrate how this happened.

Minor comments

  • Line 87/Singularities: remove comma before "and why are they..."

  • Line 441/Using regularisation to impove generalisability: "This can be done with regularisation. The idea to add another condition to the problem we’re solving with linear regression." -> "This can be done with regularisation: adding another condition to the problem we’re solving with linear regression."

  • Line 512/Using regularisation to impove generalisability: "punished" -> "punishes"

  • Line 548/Why would we want to restrict our model?: "trend" -> "tend".

  • Line 756/Cross-validation to find the best value of $\lambda$: "We can use this new idea to choose a lambda value, by" -> "We can use this new idea to choose a lambda value by"

  • Line 798/Blending ridge regression and the LASSO - elastic nets: "So far, we've used ridge regression, where alpha = 0, and LASSO regression, where alpha = 1." -> "So far, we've used ridge regression (where alpha = 0) and LASSO regression (where alpha = 1)."

@mallewellyn
Copy link
Contributor Author

mallewellyn commented Feb 29, 2024

Task list:

  • 1. Line 22/Key points: I can't see where this last key point re computational speed is discussed. Potentially omit or add a brief comment about this in the text?

  • 2. Line 46/Introduction: The end of this sentence could clarify the differences between the approach of this episode and the previous one. Something like: "that instead [what regularisation does and how it differs from information sharing]".

  • 3. Line 79/Introduction: I like the example to show what happens when you fit a model using all features. Could add a sentence like "We explain what singularities are and why they appear when fitting high-dimensional data below" to just clarify that this is addressed!

  • 4. Line 79/Introduction: Also, a brief explanation of why some effect sizes are very high as this doesn't seem to be addressed.

  • 5. Line 84/Singularities: I think it needs to be clearer why high-dimensional data result in singularities (i.e., explicitly saying that determinant of matrix is zero for high-dimensional data, thus singularities are always present when fitting models naively to high-dimensional data and R often cannot fit the model).

  • 6. Line 87/Singularities: remove comma before "and why are they..."

  • 7. Line 141/Correlated features -- common in high-dimensional data: Perhaps this summary of the challenges could be placed outside of the pinned box. I think it deserves more attention as it succinctly summarises both of the above points and motivates the need for regularisation.

  • 8. Lin 141/Correlated features: Also, the fact that p>n is problematic is discussed in a lot of detail (resulting in singularities), and so I think should be added to this point to justify the use of regularisation (or else the discussion on high-dimensional data being problematic by its very size could be removed and just discussion of correlations retained)!

Something like: "Regularisation can help us to deal with correlated features." -> "Regularisation can help us to deal with correlated features, as well as effectively reduce the number of features (dimension) in our model, and thus addresses these issues".

  • 9. Line 152/Challenge 1: "Discuss in groups:" could be "Consider or discuss in groups:" to make suitable for an individual learner.

  • 10. Lines 175 & Line 311: Query - it feels, from the section above, as though you are about to describe regularisation as this is now fully motivated. These regression sections therefore confused me a bit. There's a similar interlude in the previous episode. Could it be possible to cover this as preliminary material at the start of the episode, reference another tutorial on linear regression + training and testing models and reference this throughout all episodes, or use this as a "bridging" episode at the start of the formal set of episodes and back-reference?

Could then provide the information re restricting the model after the example (where we try to fit a linear model on the whole data set) as further motivation and explain how this relates to generalisability.

  • 11. Line 441/Using regularisation to impove generalisability: "This can be done with regularisation. The idea to add another condition to the problem we’re solving with linear regression." -> "This can be done with regularisation: adding another condition to the problem we’re solving with linear regression."

  • 12. Line 512/Using regularisation to impove generalisability: "punished" -> "punishes"

  • 13. Line 515/Using regularisation to improve generalisability: A brief few words about the trade-off here with the extent of regularisation would help to clarify.

  • 14. Line 516/Why would we want to restrict our model?: "This type of regularisation is called ridge regression" makes it sound as though ridge regression=OLS. Slight re-wording to tie ridge regression to the non-zero penalty may help.

  • 15. Line 518: Somewhere in this section, could possibly add a title to improve signposting of the ridge method (since there's one for lasso).

  • 16. Line 520: These problems already feel explained and thus I would propose a minor re-wording to make this a recap and demonstration of the fact that regularisation actually works.

  • 17. Line 548/Why would we want to restrict our model?: "trend" -> "tend".

  • 18. Line 742/Cross-validation: Haven't defined what a 'good' value for lambda looks like (to define what the 'best' looks like)

  • 19. Line 749: Cross-validation: "Cross-validation is a really deep topic that we're not going to cover in more detail today, though!" possibly a slight re-wording as the subsequent analysis seems to cover cross-validation in a bit more detail.

  • 20. Line 756/Cross-validation: "We can use this new idea to choose a lambda value, by" -> "We can use this new idea to choose a lambda value by"

  • 21. Line 795/Blending ridge regression and the LASSO - elastic nets: Would possibly re-order this title to put elastic nets first if also using lasso and ridge titles (just for signposting).

  • 22. Line 798/Blending ridge regression and the LASSO - elastic nets: "So far, we've used ridge regression, where alpha = 0, and LASSO regression, where alpha = 1." -> "So far, we've used ridge regression (where alpha = 0) and LASSO regression (where alpha = 1)."

  • 23. Line 1019/Other types of outcomes: It is a little unclear to me that the intercept-only model is selected. Could annotate the code to demonstrate how this happened.

@alanocallaghan

This comment was marked as resolved.

@alanocallaghan
Copy link
Collaborator

Going by carpentries principles, challenge 3 here:

> ## Challenge 3
>
> 1. Split the methylation data matrix and the age vector
> into training and test sets.
> 2. Fit a model on the training data matrix and training age
> vector.
> 3. Check the mean squared error on this model.
>
>
> > ## Solution
> >
> > 1. Splitting the data involves using our index to split up the matrix and
> > the age vector into two each. We can use a negative subscript to create
> > the test data.
> >
> > ```{r testsplit}
> > train_mat <- horvath_mat[train_ind, ]
> > train_age <- age[train_ind]
> > test_mat <- horvath_mat[-train_ind, ]
> > test_age <- age[-train_ind]
> > ```
> > The solution to this exercise is important because the generated objects
> > (`train_mat`, `train_age`, `test_mat` and `test_age`) will be used later in
> > this episode. Please make sure that you use the same object names.
> >
> > 2. To
> >
> > ```{r trainfit}
> > # as.data.frame() converts train_mat into a data.frame
> > fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
> > ```
> >
> > Using the `.` syntax above together with a `data` argument will lead to
> > the same result as usign `train_age ~ tran_mat`: R will fit a multivariate
> > regression model in which each of the colums in `train_mat` is used as
> > a predictor. We opted to use the `.` syntax because it will help us to
> > obtain model predictions using the `predict()` function.
> >
> > 3. The mean squared error of the model is the mean of the square of the
> > residuals. This seems very low here -- on average we're only off by
> > about a year!
> > ```{r testerror}
> > mean(residuals(fit_horvath)^2)
> > ```
> >
> {: .solution}
{: .challenge}

Shouldn't ask learners to do something new. Instead, we should show them how to calculate MSE on the training data, by moving the MSE code from here:

```{r}
mse <- function(true, prediction) {
mean((true - prediction)^2)
}
pred_lm <- predict(fit_horvath, newdata = as.data.frame(test_mat))
err_lm <- mse(test_age, pred_lm)
err_lm
```

to the previous code along block here:

```{r coefhorvath}
coef_horvath <- readRDS(here::here("data/coefHorvath.rds"))
methylation <- readRDS(here::here("data/methylation.rds"))
library("SummarizedExperiment")
age <- methylation$Age
methyl_mat <- t(assay(methylation))
coef_horvath <- coef_horvath[1:20, ]
features <- coef_horvath$CpGmarker
horvath_mat <- methyl_mat[, features]
## Generate an index to split the data
set.seed(42)
train_ind <- sample(nrow(methyl_mat), 25)
```

then, in the exercise, ask them to do the same things they've done, but with the test data. Otherwise this challenge will have low completion rate and lead to cognitive overload/muddled learning outcomes

@alanocallaghan
Copy link
Collaborator

And this code block could be split up to be explained more thoroughly:

```{r pred-ridge-lm, fig.width = 10, fig.cap="Cap", fig.alt="Alt"}
pred_ridge <- predict(ridge_fit, newx = test_mat)
err_ridge <- apply(pred_ridge, 2, function(col) mse(test_age, col))
min(err_ridge)
err_lm
which_min_err <- which.min(err_ridge)
min_err_ridge <- min(err_ridge)
pred_min_ridge <- pred_ridge[, which_min_err]
```

as could this one:

```{r lasso-cv, fig.cap="Cross-validated mean squared error for different values of lambda under a LASSO penalty.", fig.alt="Alt"}
lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)
plot(lasso)
coefl <- coef(lasso, lasso$lambda.min)
selected_coefs <- as.matrix(coefl)[which(coefl != 0), 1]
## load the horvath signature to compare features
coef_horvath <- readRDS(here::here("data/coefHorvath.rds"))
## We select some of the same features! Hooray
intersect(names(selected_coefs), coef_horvath$CpGmarker)
```

mallewellyn added a commit to mallewellyn/high-dimensional-stats-r that referenced this issue Mar 1, 2024
mallewellyn added a commit to mallewellyn/high-dimensional-stats-r that referenced this issue Mar 1, 2024
mallewellyn added a commit to mallewellyn/high-dimensional-stats-r that referenced this issue Mar 1, 2024
@mallewellyn
Copy link
Contributor Author

I've proposed some changes in response to both of these comments in the commits above :)

alanocallaghan pushed a commit to alanocallaghan/high-dimensional-stats-r that referenced this issue Apr 16, 2024
alanocallaghan pushed a commit to alanocallaghan/high-dimensional-stats-r that referenced this issue Apr 16, 2024
alanocallaghan pushed a commit to alanocallaghan/high-dimensional-stats-r that referenced this issue Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants