From 0630b194d82c4741d17d9e6c6a076be70e9ff4c1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 08:31:54 +0000 Subject: [PATCH 01/15] add computational benefits, task 1 --- episodes/03-regression-regularisation.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 7e767e60..d820348c 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -46,7 +46,8 @@ a linear model where there are more features (predictor variables) than there ar models for each feature and sharing information among these models. Now we will take a look at an alternative approach called regularisation. Regularisation can be used to stabilise coefficient estimates (and thus to fit models with more features than observations) -and even to select a subset of relevant features. +and even to select a subset of relevant features. In addition, regularisation is often very fast +computationally and is thus practically useful. First, let us check out what happens if we try to fit a linear model to high-dimensional data! We start by reading in the data from the last lesson: From fe486966f82a02a67fe5c89e2c57dea4f200708c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 08:41:04 +0000 Subject: [PATCH 02/15] reorder introductory paragraph to clarify differences, task 2 --- episodes/03-regression-regularisation.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index d820348c..1820ca91 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -42,12 +42,12 @@ feature selection and it is particularly useful when dealing with high-dimension One reason that we need special statistical tools for high-dimensional data is that standard linear models cannot handle high-dimensional data sets -- one cannot fit a linear model where there are more features (predictor variables) than there are observations -(data points). In the previous lesson we dealt with this problem by fitting individual +(data points). In the previous lesson, we dealt with this problem by fitting individual models for each feature and sharing information among these models. Now we will -take a look at an alternative approach called regularisation. Regularisation can be used to -stabilise coefficient estimates (and thus to fit models with more features than observations) -and even to select a subset of relevant features. In addition, regularisation is often very fast -computationally and is thus practically useful. +take a look at an alternative approach that can be used to fit models with more +features than observations by stabilising coefficient estimates. This approach is called +regularisation. Compared to many other methods, regularisation is also often very fast +and can therefore be extremely useful in practice. First, let us check out what happens if we try to fit a linear model to high-dimensional data! We start by reading in the data from the last lesson: From bf96ccbe9243a463612347d8ac4c83b2db88a1ac Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 08:45:09 +0000 Subject: [PATCH 03/15] add sentence to motivate discussion of singularities, task 3 --- episodes/03-regression-regularisation.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 1820ca91..a7173c1b 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -79,7 +79,8 @@ high! The summary also says that we were unable to estimate effect sizes for `r format(sum(is.na(coef(fit))), big.mark=",")` features because of "singularities". What this means is that R couldn't find a way to perform the calculations necessary due to the fact that we have more features -than observations. +than observations. We explain what singularities are and why they appear when fitting +models to high-dimensional data below. > ## Singularities From 4be70e113c1f109049b325dc81c901b475097aa5 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 08:59:01 +0000 Subject: [PATCH 04/15] clarify reason for large effect sizes, task 4 Is this a fair summary? --- episodes/03-regression-regularisation.Rmd | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index a7173c1b..93c0ac8c 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -75,7 +75,9 @@ summary(fit) ``` You can see that we're able to get some effect size estimates, but they seem very -high! The summary also says that we were unable to estimate +high! This is common when fitting a linear regression model with a large number of features, +often since the model cannot distinguish between the effects of many, correlated features. +The summary also says that we were unable to estimate effect sizes for `r format(sum(is.na(coef(fit))), big.mark=",")` features because of "singularities". What this means is that R couldn't find a way to perform the calculations necessary due to the fact that we have more features From 4eac9e31a90ad4b425069b6e4fa4774468c15e46 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 09:11:24 +0000 Subject: [PATCH 05/15] clarify why large effect sizes and singularities, tasks 4 and 5 Do you agree? --- episodes/03-regression-regularisation.Rmd | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 93c0ac8c..8a972f31 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -75,15 +75,14 @@ summary(fit) ``` You can see that we're able to get some effect size estimates, but they seem very -high! This is common when fitting a linear regression model with a large number of features, -often since the model cannot distinguish between the effects of many, correlated features. -The summary also says that we were unable to estimate +high! The summary also says that we were unable to estimate effect sizes for `r format(sum(is.na(coef(fit))), big.mark=",")` features -because of "singularities". What this means is that R couldn't find a way to -perform the calculations necessary due to the fact that we have more features -than observations. We explain what singularities are and why they appear when fitting -models to high-dimensional data below. - +because of "singularities". We clarify what singularities are in the note below +but this essentially means that R couldn't find a way to +perform the calculations necessary to fit the model. Large effect sizes and singularities are common +when naively fitting linear regression models with a large number of features (i.e., to high-dimensional data), +often since the model cannot distinguish between the effects of many, correlated features and +when we have more features than observations. > ## Singularities > From ebc6fc1431c0621b96907d6494399024f05f9f52 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 09:16:25 +0000 Subject: [PATCH 06/15] reframe correlated features section doesn't the previous example show this too? Collinearity isn't a distinct issue to having singularities? --- episodes/03-regression-regularisation.Rmd | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 8a972f31..24851f61 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -116,12 +116,9 @@ when we have more features than observations. > ## Correlated features -- common in high-dimensional data > -> So, we can't fit a standard linear model to high-dimensional data. But there -> is another issue. In high-dimensional datasets, there +> In high-dimensional datasets, there > are often multiple features that contain redundant information (correlated features). -> -> We have seen in the first episode that correlated features can make it hard -> (or impossible) to correctly infer parameters. If we visualise the level of +> If we visualise the level of > correlation between sites in the methylation dataset, we can see that many > of the features essentially represent the same information - there are many > off-diagonal cells, which are deep red or blue. For example, the following From 515cbc805d28f46795904bc26fa243c850afb267 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 09:39:48 +0000 Subject: [PATCH 07/15] rewrite singularities description check this is correct --- episodes/03-regression-regularisation.Rmd | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 24851f61..1b349098 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -89,24 +89,22 @@ when we have more features than observations. > The message that `lm` produced is not necessarily the most intuitive. What > are "singularities", and why are they an issue? A singular matrix > is one that cannot be -> [inverted](https://en.wikipedia.org/wiki/Invertible_matrix). -> The inverse of an $n \times n$ square matrix $A$ is the matrix $B$ for which -> $AB = BA = I_n$, where $I_n$ is the $n \times n$ identity matrix. -> -> Why is the inverse important? Well, to find the -> coefficients of a linear model of a matrix of predictor features $X$ and an -> outcome vector $y$, we may perform the calculation +> [inverted](https://en.wikipedia.org/wiki/Invertible_matrix). R uses +> inverse operations to fit linear models (finds the coefficients) using: > > $$ -> (X^TX)^{-1}X^Ty +> (X^TX)^{-1}X^Ty, > $$ > -> You can see that, if we're unable to find the inverse of the matrix $X^TX$, -> then we'll be unable to find the regression coefficients. +> where $X$ is a matrix of predictor features and $y$ is the outcome vector. +> Thus, if the matrix $X^TX$ cannot be inverted to give $(X^TX)^{-1}$, R +> cannot fit the model and returns the singularities error. > -> Why might this be the case? +> Why might R be unable to calculate $(X^TX)^{-1}$ and return singularities errors? > Well, when the [determinant](https://en.wikipedia.org/wiki/Determinant) -> of the matrix is zero, we are unable to find its inverse. +> of the matrix is zero, we are unable to find its inverse. The determinant +> of the matrix is zero when there are more features than observations or when +> the features are highly correlated. > > ```{r determinant} > xtx <- t(methyl_mat) %*% methyl_mat From 90155d7d1bd749fdc12a505b0db254e167761dff Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 09:42:49 +0000 Subject: [PATCH 08/15] minor wording change, singularities, task 6 --- episodes/03-regression-regularisation.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 1b349098..1ec473f7 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -87,10 +87,10 @@ when we have more features than observations. > ## Singularities > > The message that `lm` produced is not necessarily the most intuitive. What -> are "singularities", and why are they an issue? A singular matrix +> are "singularities" and why are they an issue? A singular matrix > is one that cannot be > [inverted](https://en.wikipedia.org/wiki/Invertible_matrix). R uses -> inverse operations to fit linear models (finds the coefficients) using: +> inverse operations to fit linear models (find the coefficients) using: > > $$ > (X^TX)^{-1}X^Ty, @@ -103,7 +103,7 @@ when we have more features than observations. > Why might R be unable to calculate $(X^TX)^{-1}$ and return singularities errors? > Well, when the [determinant](https://en.wikipedia.org/wiki/Determinant) > of the matrix is zero, we are unable to find its inverse. The determinant -> of the matrix is zero when there are more features than observations or when +> of the matrix is zero when there are more features than observations or often when > the features are highly correlated. > > ```{r determinant} From 53772595a817541577148d63c8d70030e7220f4c Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 09:49:29 +0000 Subject: [PATCH 09/15] remove text at the end of correlation section if keeping the other changes --- episodes/03-regression-regularisation.Rmd | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 1ec473f7..cfc81f50 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -136,10 +136,7 @@ library("ComplexHeatmap") > ) > ``` > -> Correlation between features can be problematic for technical reasons. If it is -> very severe, it may even make it impossible to fit a model! This is in addition to -> the fact that with more features than observations, we can't even estimate -> the model properly. Regularisation can help us to deal with correlated features. +Regularisation can help us to deal with correlated features. {: .callout} From 856eaf06e2304fd09aacbde5f08c02b8b6d294d5 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 09:51:41 +0000 Subject: [PATCH 10/15] tasks 7 and 8 addressed if keeping other changes --- episodes/03-regression-regularisation.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index cfc81f50..cdfae8ee 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -136,7 +136,7 @@ library("ComplexHeatmap") > ) > ``` > -Regularisation can help us to deal with correlated features. +> Regularisation can help us to deal with correlated features. {: .callout} From 3d28a550ad4c2b9cc8c95258f8767fd14c0c23b9 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 09:56:00 +0000 Subject: [PATCH 11/15] remove sentence until we're about to discuss regularisation --- episodes/03-regression-regularisation.Rmd | 1 - 1 file changed, 1 deletion(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index cdfae8ee..2ac63621 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -136,7 +136,6 @@ library("ComplexHeatmap") > ) > ``` > -> Regularisation can help us to deal with correlated features. {: .callout} From ea3e28ffcdb9cbe1f80dd0c8c17916127b81b55e Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Thu, 29 Feb 2024 10:10:25 +0000 Subject: [PATCH 12/15] change "singularities errors" --- episodes/03-regression-regularisation.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 2ac63621..2d61f00d 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -98,9 +98,9 @@ when we have more features than observations. > > where $X$ is a matrix of predictor features and $y$ is the outcome vector. > Thus, if the matrix $X^TX$ cannot be inverted to give $(X^TX)^{-1}$, R -> cannot fit the model and returns the singularities error. +> cannot fit the model and returns the error that there are singularities. > -> Why might R be unable to calculate $(X^TX)^{-1}$ and return singularities errors? +> Why might R be unable to calculate $(X^TX)^{-1}$ and return the error that there are singularities? > Well, when the [determinant](https://en.wikipedia.org/wiki/Determinant) > of the matrix is zero, we are unable to find its inverse. The determinant > of the matrix is zero when there are more features than observations or often when From 2fa3452a658da876763f03aef877f57124fc92e8 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Tue, 12 Mar 2024 17:59:25 +0000 Subject: [PATCH 13/15] and to or Co-authored-by: Ailith Ewing <54178580+ailithewing =@=users.noreply.github.com> --- episodes/03-regression-regularisation.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 2d61f00d..d6ebf86d 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -81,7 +81,7 @@ because of "singularities". We clarify what singularities are in the note below but this essentially means that R couldn't find a way to perform the calculations necessary to fit the model. Large effect sizes and singularities are common when naively fitting linear regression models with a large number of features (i.e., to high-dimensional data), -often since the model cannot distinguish between the effects of many, correlated features and +often since the model cannot distinguish between the effects of many, correlated features or when we have more features than observations. > ## Singularities From e7d80c2462dcf1933930e9944d4c9ab9fec9d0a8 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Tue, 12 Mar 2024 17:59:54 +0000 Subject: [PATCH 14/15] remove essentially 1 Co-authored-by: Ailith Ewing <54178580+ailithewing =@=users.noreply.github.com> --- episodes/03-regression-regularisation.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index d6ebf86d..14a1d71d 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -78,7 +78,7 @@ You can see that we're able to get some effect size estimates, but they seem ver high! The summary also says that we were unable to estimate effect sizes for `r format(sum(is.na(coef(fit))), big.mark=",")` features because of "singularities". We clarify what singularities are in the note below -but this essentially means that R couldn't find a way to +but this means that R couldn't find a way to perform the calculations necessary to fit the model. Large effect sizes and singularities are common when naively fitting linear regression models with a large number of features (i.e., to high-dimensional data), often since the model cannot distinguish between the effects of many, correlated features or From 3f9d51f3b0c0d3000ccd59595f01c235acbc42e1 Mon Sep 17 00:00:00 2001 From: Mary Llewellyn Date: Tue, 12 Mar 2024 18:00:22 +0000 Subject: [PATCH 15/15] remove essentially 2 Co-authored-by: Ailith Ewing <54178580+ailithewing =@=users.noreply.github.com> --- episodes/03-regression-regularisation.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/03-regression-regularisation.Rmd b/episodes/03-regression-regularisation.Rmd index 14a1d71d..d7b51e94 100644 --- a/episodes/03-regression-regularisation.Rmd +++ b/episodes/03-regression-regularisation.Rmd @@ -118,7 +118,7 @@ when we have more features than observations. > are often multiple features that contain redundant information (correlated features). > If we visualise the level of > correlation between sites in the methylation dataset, we can see that many -> of the features essentially represent the same information - there are many +> of the features represent the same information - there are many > off-diagonal cells, which are deep red or blue. For example, the following > heatmap visualises the correlations for the first 500 features in the > `methylation` dataset (we selected 500 features only as it can be hard to