Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to episode 3 #100

Merged
merged 13 commits into from
Feb 6, 2023
12 changes: 12 additions & 0 deletions _episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -393,5 +393,17 @@ episode will explore a specific method to perform clustering analysis.
- [Buhlman, P., Kalisch, M. & Meier, L. (2014) High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application](https://doi.org/10.1146/annurev-statistics-022513-115545).
- Johnstone, I.M. & Titterington, D.M. (2009) Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A 367:4237-4253.
- [Bioconductor ethylation array analysis vignette](https://www.bioconductor.org/packages/release/workflows/vignettes/methylationArrayAnalysis/inst/doc/methylationArrayAnalysis.html).
- The *Introduction to Machine Learning with Python* course covers additional
methods that could be used to analyse high-dimensional data. See
[Introduction to machine learning](https://carpentries-incubator.github.io/machine-learning-novice-python/),
[Tree models](https://carpentries-incubator.github.io/machine-learning-trees-python/) and
[Neural networks](https://carpentries-incubator.github.io/machine-learning-neural-python/).
Some related (an important!) content is also available in
[Responsible machine learning](https://carpentries-incubator.github.io/machine-learning-responsible-python/).

# Other resources suggested by former students

- [Josh Starmer's](https://www.youtube.com/c/joshstarmer) youtube channel.


{% include links.md %}
134 changes: 68 additions & 66 deletions _episodes_rmd/03-regression-regularisation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ knitr_fig_path("03-")

# Introduction


This episode is about **regularisation**, also called **regularised regression**
or **penalised regression**. This approach can be used for prediction and for
feature selection and it is particularly useful when dealing with high-dimensional data.

One reason that we need special statistical tools for high-dimensional data is
that standard linear models cannot handle high-dimensional data sets -- one cannot fit
a linear model where there are more features (predictor variables) than there are observations
Expand Down Expand Up @@ -103,60 +108,28 @@ than observations.
> det(xtx)
> ```



> ## Correlated features -- common in high-dimensional data
>
> So, we can't fit a standard linear model to high-dimensional data. But there
> is another issue. In high-dimensional datasets, there
> are often multiple features that contain redundant information (correlated features).
>
> We have seen in the first episode that correlated features can make it hard
> (or impossible) to correctly infer parameters. If we visualise the level of
> correlation between sites in the methylation dataset, we can see that many
> of the features essentially represent the same information - there are many
> off-diagonal cells, which are deep red or blue. For example, the following
> heatmap visualises the correlations for the first 500 features in the
> `methylation` dataset (we selected 500 features only as it can be hard to
> visualise patterns when there are too many features!).
>
> To investigate the level of correlation between features, we can compute correlation coefficients
> for all pairs of features in a dataset, for instance using Pearson's correlation
> coefficient. This measures how similar the patterns of variation in these features.
> We can see that in the `prostate` data (originally from the *`lasso2`* package),
> there are relatively few features and they are largely independent.
>
> ```{r corr-mat-prostate, echo = FALSE, fig.cap="Cap", fig.alt="Alt"}
> library("ComplexHeatmap")
> prostate <- readRDS(here("data/prostate.rds"))
>
> cor_mat <- cor(Prostate[, -1])
> col <- circlize::colorRamp2(
> breaks = seq(-1, 1, length.out = 9),
> colors = rev(RColorBrewer::brewer.pal(9, "RdYlBu"))
> )
> Heatmap(cor_mat,
> column_title = "Feature-feature correlation in prostate data",
> name = "Pearson correlation",
> col = col,
> # cluster_rows = FALSE, cluster_columns = FALSE,
> show_row_dend = FALSE, show_column_dend = FALSE,
> show_row_names = FALSE, show_column_names = FALSE
> )
> ```
>
> If we do the same for the methylation dataset, in contrast, we can see that many
> of the features essentially represent the same information (there are many off-diagonal
> cells which are deep red or blue). This can present problems for a lot of the mathematical
> techniques we will use to calculate a linear regression model.
>
> ```{r corr-mat-meth, echo = FALSE, fig.cap="Cap", fig.alt="Alt"}
> library("minfi")
> library("here")
> library("ComplexHeatmap")
>
> methylation <- readRDS(here("data/methylation.rds"))
>
> age <- methylation$Age
> methyl_mat <- t(assay(methylation))
> ```{r corr-mat-meth, fig.cap="Cap", fig.alt="Alt"}
> small <- methyl_mat[, 1:500]
> cor_mat <- cor(small)
> Heatmap(cor_mat,
> column_title = "Feature-feature correlation in methylation data",
> name = "Pearson correlation",
> col = col,
> # cluster_rows = FALSE, cluster_columns = FALSE,
> show_row_dend = FALSE, show_column_dend = FALSE,
> show_row_names = FALSE, show_column_names = FALSE
> )
Expand All @@ -171,7 +144,7 @@ than observations.



> ## Exercise
> ## Challenge 1
>
> Discuss in groups:
>
Expand Down Expand Up @@ -279,22 +252,33 @@ points(
Mathematically, we can write the sum of squared residuals as

$$
\sum_{i=1}^N ( y_i-X_i\beta)^2
\sum_{i=1}^N ( y_i-x'_i\beta)^2
$$

where $\hat{y}_i$ is the predicted y value for each input data
point $X_i$, and $y_i$ is the true observed value.
This line is the line of best fit through our data when considering this
goal of minimising the sum of squared error. However, it is not the only
where $\beta$ is a vector of (unknown) covariate effects which we want to learn
by fitting a regression model: the $j$-th element of $\beta$, which we denote as
$\beta_j$ quantifies the effect of the $j$-th covariate. For each individual
$i$, $x_i$ is a vector of covariate values, $y_i$ is the true observed value for
the outcome and $\hat{y}_i$ is the predicted outcome value. The notation
$x'_i\beta$ indicates matrix multiplication. In this case, the result is
equivalent to multiplying each element of $x_i$ by its corresponding element in
$\beta$ and then calculating the sum across all of those values. To quantify
the total error across all individuals, we sum the square residuals
$( y_i-x'_i\beta)^2$ across all the individuals in our data.


Finding the value of $\beta$ that minimises
the sum above is the line of best fit through our data when considering
this goal of minimising the sum of squared error. However, it is not the only
possible line we could use! For example, we might want to err on the side of
caution when estimating effect sizes (coefficients). That is, we might want to avoid estimating
very large effect sizes. This can help us to create *generalisable*
caution when estimating effect sizes (coefficients). That is, we might want to
avoid estimating very large effect sizes. This can help us to create *generalisable*
models. This is important when models that are fitted (trained) on one dataset
and then used to predict outcomes from a new dataset. Restricting parameter
estimates is particularly important when analysing high-dimensional data.


> ## Exercise
> ## Challenge 2
>
> Discuss in groups:
>
Expand Down Expand Up @@ -345,7 +329,7 @@ coefficient values minimise the training error, but they don't minimise the
test error on unseen data. First, we'll go through an example of what exactly
this means.

For the next few exercises, we'll work with a set of features
For the next few challenges, we'll work with a set of features
known to be associated with age from a paper by Horvath et al.[^2]. Horvath et al
use methylation markers alone to predict the biological age of an individual.
This is useful in studying age-related disease amongst many other things.
Expand All @@ -367,13 +351,14 @@ set.seed(42)
train_ind <- sample(nrow(methyl_mat), 25)
```

> ## Exercise
> ## Challenge 3
>
> 1. Split the methylation data matrix and the age vector
> into training and test sets.
> 2. Fit a model on the training data matrix and training age
> vector.
> 3. Check the mean squared error on this model.
>
>
> > ## Solution
> >
Expand All @@ -387,13 +372,23 @@ train_ind <- sample(nrow(methyl_mat), 25)
> > test_mat <- horvath_mat[-train_ind, ]
> > test_age <- age[-train_ind]
> > ```
> > The solution to this exercise is important because the generated objects
> > (`train_mat`, `train_age`, `test_mat` and `test_age`) will be used later in
> > this episode. Please make sure that you use the same object names.
> >
> > 2. To
> >
> > ```{r trainfit}
> > # as.data.frame() converts train_mat into a data.frame
> > fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
> > ```
> >
> > Using the `.` syntax above together with a `data` argument will lead to
> > the same result as usign `train_age ~ tran_mat`: R will fit a multivariate
> > regression model in which each of the colums in `train_mat` is used as
> > a predictor. We opted to use the `.` syntax because it will help us to
> > obtain model predictions using the `predict()` function.
> >
> > 3. The mean squared error of the model is the mean of the square of the
> > residuals. This seems very low here -- on average we're only off by
> > about a year!
Expand Down Expand Up @@ -430,16 +425,22 @@ plot(test_age, pred_lm, pch = 19)
abline(coef = 0:1, lty = "dashed")
```

This figure shows the predicted ages obtained from a linear model fit plotted
against the true ages, which we kept in the test dataset. If the prediction were
good, the dots should follow a line. Regularisation can help us to make the
model more generalisable, improving predictions for the test dataset (or any
other dataset that is not used when fitting our model).

# Using regularisation to impove generalisability

As stated above, restricting model parameter estimates can improve a model's
generalisability. This can be done with *regularisation*. The idea to add another
condition to the problem we're solving with linear regression. This condition
controls the total size of the coefficients that come out.
For example, we might say that the point representing the slope and intercept
must fall within a certain distance of the origin, $(0, 0)$. Note that we are still trying to
solve for the line that minimises the square of the residuals; we are just
adding this extra constraint to our solution.
must fall within a certain distance of the origin, $(0, 0)$. Note that we are
still trying to solve for the line that minimises the square of the residuals;
we are just adding this extra constraint to our solution.

For the 2-parameter model (slope and intercept), we could
visualise this constraint as a circle with a given radius. We
Expand Down Expand Up @@ -497,7 +498,7 @@ we try to minimise a function that includes our $L^2$ norm scaled by a
factor that is usually written $\lambda$.

$$
\sum_{i=1}^N \biggl( y_i - X_i\beta\biggr)^2 + \lambda \left\lVert \beta \right\lVert_2 ^2
\sum_{i=1}^N \biggl( y_i - x'_i\beta\biggr)^2 + \lambda \left\lVert \beta \right\lVert_2 ^2
$$

Another way of thinking about this is that when finding the best model, we're
Expand All @@ -510,12 +511,13 @@ just using ordinary least squares. This type of regularisation is called *ridge

# Why would we want to restrict our model?

It may seem an odd thing to do, to restrict the possible values of our model
parameters! Why would we want to do this? Well firstly, when we have many
correlated features our model estimates can be very unstable or even difficult
to calculate. Secondly, this type of approach can make our model more
generalisable. To show this,
we'll fit a model using the Horvath methylation predictors, using both
It may seem an odd thing to do: to restrict the possible values of our model
parameters! Why would we want to do this? Firstly, as discussed earlier, our
model estimates can be very unstable or even difficult to calculate when we have
many correlated features. Secondly, this type of approach can make our model more
generalisable to new data. To show this, we'll fit a model using the same set
of 20 features (stored as `features`) selected earlier in this episode (these
are a subset of the features identified by Horvarth et al), using both
regularised and ordinary least squares.

```{r plot-ridge, fig.cap="Cap", fig.alt="Alt"}
Expand Down Expand Up @@ -572,7 +574,7 @@ abline(v = log(chosen_lambda), lty = "dashed")
```


> ## Exercise
> ## Challenge 4
>
> 1. Which performs better, ridge or OLS?
> 2. Plot predicted ages for each method against the true ages.
Expand Down Expand Up @@ -697,7 +699,7 @@ drawplot()
```


> ## Exercise
> ## Challenge 5
>
> 1. Use `glmnet` to fit a LASSO model (hint: set `alpha = 1`).
> 2. Plot the model object. Remember that for ridge regression,
Expand Down Expand Up @@ -881,7 +883,7 @@ plot_elastic(0.5)
plot_elastic(0.75)
```

> ## Exercise
> ## Challenge 6
>
> 1. Fit an elastic net model (hint: alpha = 0.5) without cross-validation and plot the model
> object.
Expand Down