diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index 807bb93a..24ebccc4 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -23,9 +23,11 @@ keypoints: - "Multiple testing methods can be more conservative or more liberal, depending on our goals." math: yes +editor_options: + markdown: + wrap: 72 --- - ```{r settings, include=FALSE} library("here") source(here("bin/chunk-options.R")) @@ -34,14 +36,13 @@ knitr_fig_path("02-") # DNA methylation data -For the following few episodes, we will be working with human -DNA methylation data from flow-sorted blood samples. -DNA methylation assays measure, for many sites in the genome, -the proportion of DNA that carries a methyl mark. -In this case, the methylation data come in the form of a matrix -of normalised methylation levels, where negative values correspond to -unmethylated DNA and positive values correspond to methylated DNA. -Along with this, we have a number of sample phenotypes +For the following few episodes, we will be working with human DNA +methylation data from flow-sorted blood samples. DNA methylation assays +measure, for each of many sites in the genome, the proportion of DNA +that carries a methyl mark (a chemical modification that does not alter the DNA sequence). In this case, the methylation data come in +the form of a matrix of normalised methylation levels (M-values), where negative +values correspond to unmethylated DNA and positive values correspond to +methylated DNA. Along with this, we have a number of sample phenotypes (eg, age in years, BMI). Let's read in the data for this episode: @@ -52,30 +53,30 @@ library("minfi") methylation <- readRDS(here("data/methylation.rds")) ``` -This `methylation` object is a `GenomicRatioSet`, a Bioconductor data object -derived from the `SummarizedExperiment` class. -These `SummarizedExperiment` objects contain `assay`s, in this case methylation -levels, and optional sample-level `colData` and feature-level `metadata`. -These objects are very convenient to contain all of the information about -a dataset in a high-throughput context. If you would like more detail -on these objects it may be useful to consult the -[vignettes on Bioconductor](https://www.bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html). +This `methylation` object is a `GenomicRatioSet`, a Bioconductor data +object derived from the `SummarizedExperiment` class. These +`SummarizedExperiment` objects contain `assay`s, in this case +normalised methylation levels, and optional sample-level `colData` and +feature-level `metadata`. These objects are very convenient to contain +all of the information about a dataset in a high-throughput context. If +you would like more detail on these objects it may be useful to consult +the [vignettes on +Bioconductor](https://www.bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html). ```{r showmethy} methylation ``` -You can see in this output that this object has a `dim` of -$`r nrow(methylation)` \times `r ncol(methylation)`$, -meaning it has `r nrow(methylation)` features and `r ncol(methylation)` columns. -To extract the matrix of methylation values (M-values), we can use the `assay` function. -One thing to bear in mind with these objects (and data -structures for computational biology in R generally) is that -in the matrix of methylation data, samples or observations -are stored as columns, while features (in this case, CpG sites) -are stored as rows. This is in contrast to usual tabular data, -where features or variables are stored as columns and -observations are stored as rows. +You can see in this output that this object has a `dim()` of +$`r nrow(methylation)` \times `r ncol(methylation)`$, meaning it has +`r nrow(methylation)` features and `r ncol(methylation)` columns. To +extract the matrix of methylation M-values, we can use the +`assay()` function. One thing to bear in mind with these objects (and +data structures for computational biology in R generally) is that in the +matrix of methylation data, samples or observations are stored as +columns, while features (in this case, sites in the genome) are stored as rows. +This is in contrast to usual tabular data, where features or variables +are stored as columns and observations are stored as rows. ```{r grabx} methyl_mat <- assay(methylation) @@ -87,24 +88,24 @@ The distribution of these M-values looks like this: hist(methyl_mat, breaks = "FD", xlab = "M-value") ``` -You can see that there are two peaks in this distribution, corresponding to -features which are largely unmethylated and methylated, respectively. +You can see that there are two peaks in this distribution, corresponding +to features which are largely unmethylated and methylated, respectively. -Similarly, we can examine the `colData`, which represents the sample-level -metadata we have relating to these data. -In this case, the metadata, phenotypes, and groupings in the `colData` -look like this for the first 6 samples: +Similarly, we can examine the `colData()`, which represents the +sample-level metadata we have relating to these data. In this case, the +metadata, phenotypes, and groupings in the `colData` look like this for +the first 6 samples: ```{r datatable} knitr::kable(head(colData(methylation)), row.names = FALSE) ``` -In this episode, we will focus on the association between age and methylation. -The association between age and methylation status in blood samples has been -studied extensively, and is actually a well-known application of some -high-dimensional regression techniques, with the idea of "epigenetic age" -having been derived using these techniques. The methylation levels for these -data can be presented in a heatmap: +In this episode, we will focus on the association between age and +methylation. The association between age and methylation status in blood +samples has been studied extensively, and is actually a well-known +application of some high-dimensional regression techniques, with the +idea of "epigenetic age" having been derived using these techniques. The +methylation levels for these data can be presented in a heatmap: ```{r heatmap, echo=FALSE, fig.cap="Visualising the data as a heatmap, it's clear that there's too many models to fit 'by hand'.", fig.alt="Heatmap of methylation values across all features. Samples are ordered according to age."} age <- methylation$Age @@ -135,54 +136,46 @@ Heatmap(methyl_mat_ord, ) ``` -We would like to identify features that are related to our outcome of interest -(age). It's clear from the heatmap that there are too many features to do so -manually, even with this reduced number of features - the original dataset -contained over 800,000! +We would like to identify features that are related to our outcome of +interest (age). It is clear from the heatmap that there are too many +features to do so manually. > ## Exercise -> +> > Why can we not just fit linear regression models of all of the columns -> in the `colData` above against all of the features in the matrix of assays, -> and choose all of the significant results at a p-value of 0.05? -> +> in the `colData` above against all of the features in the matrix of +> assays, and choose all of the significant results at a p-value of +> 0.05? +> > > ## Solution -> > +> > > > There are a number of problems that this kind of approach presents. -> > For example: -> > 1. Without a research question in mind when creating a model, it's not clear -> > how we can interpret each model, and rationalising the results after the -> > fact can be dangerous; it's easy to make up a "story" that isn't -> > grounded in anything but the fact that we have signif findings. -> > 2. We may not have a representative sample for each of these covariates. For -> > example, we may have very small sample sizes for some ethnicities, -> > leading to spurious findings. -> > 3. If we perform -> > `r nrow(methylation)` tests for each of `r ncol(colData(methylation))` -> > variables, even if there were no true associations in the data, we'd be -> > likely to observe some strong spurious associations that arise just from -> > random noise. -> > +> > For example: 1. Without a research question in mind when creating a +> > model, it's not clear how we can interpret each model, and +> > rationalising the results after the fact can be dangerous; it's easy +> > to make up a "story" that isn't grounded in anything but the fact +> > that we have signif findings. 2. We may not have a representative +> > sample for each of these covariates. For example, we may have very +> > small sample sizes for some ethnicities, leading to spurious +> > findings. 3. If we perform `r nrow(methylation)` tests for each of +> > `r ncol(colData(methylation))` variables, even if there were no true +> > associations in the data, we'd be likely to observe some strong +> > spurious associations that arise just from random noise. > > -> {: .solution} -{: .challenge} - +> > {: .solution} {: .challenge} > ## Measuring DNA Methylation -> -> DNA methylation is an epigenetic modification of DNA. -> Generally, we are interested in the proportion of -> methylation at many sites or regions in the genome. -> DNA methylation microarrays, as we are using here, -> measure DNA methylation using two-channel microarrays, -> where one channel captures signal from methylated -> DNA and the other captures unmethylated signal. -> These data can be summarised -> as "Beta values" ($\beta$ values), which is the ratio -> of the methylated signal to the total signal -> (methylated plus unmethylated). -> The $\beta$ value for site $i$ is calculated as -> +> +> DNA methylation is an epigenetic modification of DNA. Generally, we +> are interested in the proportion of methylation at many sites or +> regions in the genome. DNA methylation microarrays, as we are using +> here, measure DNA methylation using two-channel microarrays, where one +> channel captures signal from methylated DNA and the other captures +> unmethylated signal. These data can be summarised as "Beta values" +> ($\beta$ values), which is the ratio of the methylated signal to the +> total signal (methylated plus unmethylated). The $\beta$ value for +> site $i$ is calculated as +> > $$ > \beta_i = \frac{ > m_i @@ -190,61 +183,56 @@ contained over 800,000! > u_{i} + m_{i} > } > $$ -> -> where $m_i$ is the methylated signal for site $i$ and -> $u_i$ is the unmethylated signal for site $i$. -> $\beta$ values take on a value in the range -> $[0, 1]$, with 0 representing a completely unmethylated -> site and 1 representing a completely methylated site. -> -> The M-values we use here are the $\log_2$ ratio of -> methylated versus unmethylated signal: +> +> where $m_i$ is the methylated signal for site $i$ and $u_i$ is the +> unmethylated signal for site $i$. $\beta$ values take on a value in +> the range $[0, 1]$, with 0 representing a completely unmethylated site +> and 1 representing a completely methylated site. +> +> The M-values we use here are the $\log_2$ ratio of methylated versus +> unmethylated signal: > > $$ > M_i = \log_2\left(\frac{m_i}{u_i}\right) > $$ -> -> M-values are not bounded to an interval as Beta values -> are, and therefore can be easier to work with in statistical models. -{: .callout} - +> +> M-values are not bounded to an interval as Beta values are, and +> therefore can be easier to work with in statistical models. {: +> .callout} # Identifying associations using linear regression -In high-throughput studies, it's common to have one or more -phenotypes or groupings that we want to relate to features of -interest (eg, gene expression, DNA methylation levels). -In general, we want to identify differences in the -features of interest that are related to a phenotype or grouping of our samples. -Identifying features of interest that vary along with -phenotypes or groupings can allow us to understand how +In high-throughput studies, it is common to have one or more phenotypes +or groupings that we want to relate to features of interest (eg, gene +expression, DNA methylation levels). In general, we want to identify +differences in the features of interest that are related to a phenotype +or grouping of our samples. Identifying features of interest that vary +along with phenotypes or groupings can allow us to understand how phenotypes arise or manifest. -For example, we might want to identify genes that are -expressed at a higher level in mutant mice relative -to wild-type mice to understand the effect -of a mutation on cellular phenotypes. -Alternatively, we might have -samples from a set of patients, and wish to identify -epigenetic features that are different in young patients -relative to old patients, to help us understand how aging -manifests. - -Using linear regression, it's possible to identify differences -like these . However, high-dimensional data like the ones we're -working with require some special considerations. A primary consideration, -as we saw above, is that there are far too many features to fit each one-by-one -as we might using a more focused dataset, for example using `lm` on each -one and checking the linear model assumptions. -A secondary consideration is that statistical approaches may behave slightly -differently in very high-dimensional data, compared to low-dimensional -data. A tertiary consideration is the speed at which -we can actually compute statistics for data this large -- methods optimised -for low-dimensional data may be very slow when applied to high-dimensional data. - +For example, we might want to identify genes that are expressed at a +higher level in mutant mice relative to wild-type mice to understand the +effect of a mutation on cellular phenotypes. Alternatively, we might +have samples from a set of patients, and wish to identify epigenetic +features that are different in young patients relative to old patients, +to help us understand how ageing manifests. + +Using linear regression, it is possible to identify differences like +these. However, high-dimensional data like the ones we're working with +require some special considerations. A primary consideration, as we saw +above, is that there are far too many features to fit each one-by-one as +we might do when analysing low-dimensional datasets (for example using +`lm` on each feature and checking the linear model assumptions). A +secondary consideration is that statistical approaches may behave +slightly differently in very high-dimensional data, compared to +low-dimensional data. A third consideration is the speed at which we can +actually compute statistics for data this large -- methods optimised for +low-dimensional data may be very slow when applied to high-dimensional +data. Ideally when performing regression, we want to identify cases like this, -where there is a clear assocation, and we probably "don't need" statistics: +where there is a clear association, and we probably "don't need" +statistics: ```{r example1, echo=FALSE, fig.cap="A scatter plot of age and a feature of interest.", fig.alt="An example of a strong linear association between a continuous phenotype (age) on the x-axis and a feature of interest (DNA methylation at a given locus) on the y-axis. A strong linear relationship with a positive slope exists between the two.", fig.width=6, fig.height=6} library("ggplot2") @@ -275,8 +263,9 @@ ggplot() + labs(y = "DNA methylation") ``` -However, often due to small differences and small sample sizes, -the problem is a bit more difficult: +However, often due to small differences and small sample sizes, the +problem is more difficult: + ```{r example3, echo=FALSE, fig.cap="A scatter plot of a grouping and a feature of interest.", fig.alt="An example of a strong linear association between a discrete phenotype (group) on the x-axis and a feature of interest (DNA methylation at a given locus) on the y-axis. The two groups seem to differ with respect to DNA methylation, but the relationship is weak.", fig.width=6, fig.height=6} library("ggplot2") set.seed(66) @@ -291,27 +280,26 @@ ggplot() + labs(y = "DNA methylation") ``` -And, of course, we often have an awful lot of features and need -to prioritise a subset of them! We need a rigorous way to -prioritise genes for further analysis. - +And, of course, we often have an awful lot of features and need to +prioritise a subset of them! We need a rigorous way to prioritise genes +for further analysis. # Fitting a linear model -So, in the data we've read in, we have a matrix of methylation values $X$ and a -vector of ages, $y$. -One way to model this is to see if we can use age to predict the expected -(average) methylation value for sample $j$ at a given locus $i$, which we can -write as $X_{ij}$. We can write that model as: +So, in the data we have read in, we have a matrix of methylation values +$X$ and a vector of ages, $y$. One way to model this is to see if we can +use age to predict the expected (average) methylation value for sample +$j$ at a given locus $i$, which we can write as $X_{ij}$. We can write +that model as: $$ \mathbf{E}(X_{ij}) = \beta_0 + \beta_1 \text{Age}_j $$ where $\text{Age}_j$ is the age of sample $j$. In this model, $\beta_1$ -represents the unit change in mean methylation level for each unit (year) -change in age. -We can fit this model and get more information from the model object: +represents the unit change in mean methylation level for each unit +(year) change in age. We can fit this model and get more information +from the model object: ```{r fit1} age <- methylation$Age @@ -319,97 +307,96 @@ lm_age_methyl1 <- lm(methyl_mat[1, ] ~ age) lm_age_methyl1 ``` -We now have estimates for the expected methylation level when age equals 0 -(the intercept) and the change in methylation level for -a unit change in age (the slope). We could plot this linear model: +We now have estimates for the expected methylation level when age equals +0 (the intercept) and the change in methylation level for a unit change +in age (the slope). We could plot this linear model: ```{r plot-lm-methyl1} plot(age, methyl_mat[1, ], xlab = "Age", ylab = "Methylation level", pch = 16) abline(lm_age_methyl1) ``` -For this feature, we can see that there isn't a strong relationship between -methylation and age. We could try to repeat this for every feature in our -dataset; however, we have a lot of features! We need an approach that allows -us to assess associations between all of these features and our outcome while -addressing the three considerations we outlined previously. Before -we introduce this approach, let's go into detail about how we generally -check whether the results of a linear model are statistically significant. +For this feature, we can see that there is no strong relationship +between methylation and age. We could try to repeat this for every +feature in our dataset; however, we have a lot of features! We need an +approach that allows us to assess associations between all of these +features and our outcome while addressing the three considerations we +outlined previously. Before we introduce this approach, let's go into +detail about how we generally check whether the results of a linear +model are statistically significant. # Hypothesis testing in linear regression -Using the linear model we defined above, we can assess if the coefficients -we've fitted are worthy of further investigation, or if they would be likely -to arise by chance. For linear regression, we usually do this -using null hypothesis significance testing. This framework compares the results -that we observed (here, linear model coefficients) to the results you would -expect under the null hypothesis. -To be more specific, null hypothesis significance testing compares our observed -results with a set of hypothetical counter-examples of what we would expect to -observe if we repeated the same experiment and analysis over and over again -under the null hypothesis. In the case of linear regression, the null -hypothesis is that there is absolutely no relationship between the predictor -variable(s) and the outcome. This may not always be the most realistic or -useful null hypothesis, but it's the one we have! - - -For this linear model, we can use `broom` to extract detailed information about -the coefficients and the associated hypothesis tests in this model: +Using the linear model we defined above, we can assess if the +coefficients we have fitted are worthy of further investigation, or if +they would be likely to arise by chance. For linear regression, we +usually do this using null hypothesis significance testing. This +framework compares the results that we observed (here, linear model +coefficients) to the results you would expect under the null hypothesis. +To be more specific, null hypothesis significance testing compares our +observed results with a set of hypothetical counter-examples of what we +would expect to observe if we repeated the same experiment and analysis +over and over again under the null hypothesis. In the case of linear +regression, the null hypothesis is that there is absolutely no +relationship between the predictor variable(s) and the outcome. This may +not always be the most realistic or useful null hypothesis, but it is +the one we have! + +For this linear model, we can use `broom` to extract detailed +information about the coefficients and the associated hypothesis tests +in this model: ```{r tidyfit} library("broom") tidy(lm_age_methyl1) ``` -The standard errors (`std.error`) represent the uncertainty in our effect size -estimate. The test statistics and p-values represent measures of how (un)likely -it would be to observe results like this under the "null hypothesis". - +The standard errors (`std.error`) represent the uncertainty in our +effect size estimate. The test statistics and p-values represent +measures of how (un)likely it would be to observe results like this +under the "null hypothesis". > ## Exercise -> -> In the model we fitted, the p-value for the intercept term is significant -> at $p < 0.05$. What does this mean? -> +> +> In the model we fitted, the p-value for the intercept term is +> significant at $p < 0.05$. What does this mean? +> > > ## Solution -> > -> > The intercept for this model indicates that the mean of methylation, when -> > age is zero, is 0.902. -> > However, this isn't a particularly note-worthy finding! -> > Also, regardless of the p-value, it's unlikely to be a reliable finding, as -> > we don't have any individuals with age zero, nor even any with age < 20. > > -> {: .solution} -{: .challenge} - +> > The intercept for this model indicates that the mean of methylation, +> > when age is zero, is 0.902. However, this isn't a particularly +> > note-worthy finding! Also, regardless of the p-value, it's unlikely +> > to be a reliable finding, as we don't have any individuals with age +> > zero, nor even any with age \< 20. +> > +> > {: .solution} {: .challenge} # Fitting a lot of linear models -If we wanted to repeat this process for each feature in our data, we need -to narrow our focus. In particular, -The first coefficient in a linear model like this is the intercept, -which measures the overall -offset between age and methylation levels. It's not really interesting -if this is zero or non-zero, since we probably don't care what the methylation -level is when age is zero. In fact, this question doesn't even really make -much sense! Instead, we're more -interested if there is a relationship between increasing age and methylation -levels. Therefore, we'll focus only on the second coefficient. +If we wanted to repeat this process for each feature in our data, we +need to narrow our focus. In particular, The first coefficient in a +linear model like this is the intercept, which measures the overall +offset between age and methylation levels. It is not really interesting +if this intercept is zero or not, since we probably do not care what the +methylation level is when age is zero. In fact, this question does not +even make much sense! Instead, we are more interested in whether there +is a relationship between age and methylation levels. Therefore, we will +focus only on the second coefficient. ```{r tidyfit2} coef_age_methyl1 <- tidy(lm_age_methyl1)[2, ] coef_age_methyl1 ``` -Now, we could do this for every feature in the dataset and rank the results. -However, fitting models in this way to `r nrow(methylation)` features -is not very computationally efficient, -and it would also be laborious to do programmatically. There are ways to get -around this, but first let's talk about what exactly we're doing when we -look at significance tests in this context. +Now, we could do this for every feature in the dataset and rank the +results. However, fitting models in this way to `r nrow(methylation)` +features is not very computationally efficient, and it would also be +laborious to do programmatically. There are ways to get around this, but +first let us talk about what exactly we are doing when we look at +significance tests in this context. ```{r rbindfits, echo=FALSE} -# Instead of repeating this model fitting and extraction focus over and over, we +# Instead of repeating this model fitting and coefficient extraction by hand, we # could write a function to fit this kind of model for any given row of the # matrix: @@ -433,45 +420,41 @@ coef_df$feature <- rownames(methyl_mat) # ) ``` - - - # How does hypothesis testing for a linear model work? -In order to decide whether a result would be unlikely -under the null hypothesis, we can calculate a test statistic. -For coefficient $k$ in a linear model (in our case, it would be the slope), -the test statistic is a t-statistic given by: +In order to decide whether a result would be unlikely under the null +hypothesis, we must calculate a test statistic. For coefficient $k$ in a +linear model (in our case, it would be the slope), the test statistic is +a t-statistic given by: $$ t_{k} = \frac{\hat{\beta}_{k}}{SE\left(\hat{\beta}_{k}\right)} $$ -$SE\left(\hat{\beta}_{k}\right)$ measures the uncertainty we have in our effect -size estimate. -Knowing what distribution these t-statistics follow under the null -hypothesis allows us to determine how unlikely it would be for -us to observe what we have under those circumstances, if we repeated the -experiment and analysis over and over again. -To demonstrate, we can manually demonstrate the relationship between these -quantities (this is not important to remember). +$SE\left(\hat{\beta}_{k}\right)$ measures the uncertainty we have in our +effect size estimate. Knowing what distribution these t-statistics +follow under the null hypothesis allows us to determine how unlikely it +would be for us to observe what we have under those circumstances, if we +repeated the experiment and analysis over and over again. To +demonstrate, we can compute the t-statistics "by hand" (advanced content). ```{r simfit} table_age_methyl1 <- tidy(lm_age_methyl1) ``` -We can see that the t-statistic is just the ratio of the estimate to the -standard error: +We can see that the t-statistic is just the ratio between the coefficient estimate +and the standard error: ```{r simtval} tvals <- table_age_methyl1$estimate / table_age_methyl1$std.error all.equal(tvals, table_age_methyl1$statistic) ``` -Calculating the p-values is a bit more tricky. Specifically, it's the proportion -of the distribution of the test statistic under the null hypothesis that is -*as extreme or more extreme* than the observed value of the test statistic. -This is easy to observe visually, by plotting the distribution: +Calculating the p-values is a bit more tricky. Specifically, it is the +proportion of the distribution of the test statistic under the null +hypothesis that is *as extreme or more extreme* than the observed value +of the test statistic. This is easy to observe visually, by plotting the +distribution: ```{r tdist, echo = FALSE, fig.cap="The p-value for a regression coefficient represents how often it'd be observed under the null.", fig.alt="Density plot of a t-distribution showing the observed test statistics (here, t-statistics). The p-values, visualised here with shaded regions, represent the portion of the null distribution that is as extreme or more extreme as the observed test statistics, which are shown as dashed lines."} ggplot() + @@ -497,106 +480,100 @@ ggplot() + labs(x = "t-statistic", y = "Density") ``` -The red-ish shaded region represents the portion of the distribution of the test -statistic under the null hypothesis that is equal or greater to the value -we observe for the intercept term. This shaded region -is small relative to the total area of the null distribution; therefore, the -p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits = 3)`$). -The blue-ish shaded region represents the same measure for the slope -term; this is larger, relative to the total area of the distribution, therefore -the p-value is larger than the one for the intercept term -($p=`r round(table_age_methyl1$p.value[[2]], digits = 3)`$). -You can see that the p-value is a function of the size of the effect we're -estimating and the uncertainty we have in that effect. A large effect with -large uncertainty may not lead to a small p-value, and a small effect with small -uncertainty may lead to a small p-value. - +The red-ish shaded region represents the portion of the distribution of +the test statistic under the null hypothesis that is equal or greater to +the value we observe for the intercept term. This shaded region is small +relative to the total area of the null distribution; therefore, the +p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits = +3)`$). The blue-ish shaded region represents the same measure for the slope term; this is larger, relative to the total area of the distribution, therefore the p-value is larger than the one for the intercept term ($p=`r +round(table_age_methyl1$p.value[[2]], digits = 3)`$). You can see that +the p-value is a function of the size of the effect we're estimating and +the uncertainty we have in that effect. A large effect with large +uncertainty may not lead to a small p-value, and a small effect with +small uncertainty may lead to a small p-value. > ## Calculating p-values from a linear model -> -> Manually calculating the p-value for a linear model is a little bit more -> complex than calculating the t-statistic. The intuition posted above is -> definitely sufficient for most cases, but for completeness, here is how -> we do it: -> -> Since the statistic in a linear model is a t-statistic, it follows a student -> t distribution under the null hypothesis, with degrees of freedom (a parameter -> of the student t distribution) given by -> the number of observations minus the number of coefficients fitted, in this -> case $`r ncol(methylation)` - `r length(coef(lm_age_methyl1))` = -> `r lm_age_methyl1$df`$. -> We want to know what portion of -> the distribution function of the test statistic is as extreme as, or more -> extreme than, the value we observed. The function `pt` (similar to `pnorm`, +> +> Manually calculating the p-value for a linear model is a little bit +> more complex than calculating the t-statistic. The intuition posted +> above is definitely sufficient for most cases, but for completeness, +> here is how we do it: +> +> Since the statistic in a linear model is a t-statistic, it follows a +> student t distribution under the null hypothesis, with degrees of +> freedom (a parameter of the student t distribution) given by the +> number of observations minus the number of coefficients fitted, in +> this case +> $`r ncol(methylation)` - `r length(coef(lm_age_methyl1))` = `r lm_age_methyl1$df`$. We want to know what portion of the distribution function of the test statistic is as extreme as, or more extreme than, the value we observed. The function`pt`(similar to`pnorm\`, > etc) can give us this information. -> -> Since we're not sure if the coefficient will be larger or smaller than zero, -> we want to do a 2-tail test. Therefore we take the absolute value of the -> t-statistic, and look at the upper rather than lower tail. -> Because in a 2-tail test we're looking at "half" of the t-distribution, -> we also multiply the p-value by 2. -> +> +> Since we're not sure if the coefficient will be larger or smaller than +> zero, we want to do a 2-tail test. Therefore we take the absolute +> value of the t-statistic, and look at the upper rather than lower +> tail. Because in a 2-tail test we're looking at "half" of the +> t-distribution, we also multiply the p-value by 2. +> > Combining all of this gives us: > > ```{r simpval} > pvals <- 2 * pt(abs(tvals), df = lm_age_methyl1$df, lower.tail = FALSE) > all.equal(table_age_methyl1$p.value, pvals) > ``` -{: .callout} - +> +> {: .callout} # Sharing information between features -Now that we understand what's going on when we perform a hypothesis test -for a linear model, we're going to introduce an idea that allows us to take -advantage of the fact that we're doing many tests at once on structured data. -We can leverage this fact to *share information* between model parameters. -The insight that we use to perform *information pooling* or sharing is derived -from our knowledge about the structure of the data. For example, in a -high-throughput experiment like a DNA methylation assay, we know that all of -the features were measured simultaneously, using the same technique. This means -that generally, we expect the base-level variability for each feature to be -broadly similar. - -This can enable us to get a better estimate of the uncertainty of model +Now that we understand how hypothesis tests work in the +linear model framework, we are going to introduce an idea that allows us to +take advantage of the fact that we carry out many tests at once on +structured data. We can leverage this fact to *share information* +between model parameters. The insight that we use to perform +*information pooling* or sharing is derived from our knowledge about the +structure of the data. For example, in a high-throughput experiment like +a DNA methylation assay, we know that all of the features were measured +simultaneously, using the same technique. This means that generally, we +expect the base-level variability for each feature to be broadly +similar. + +This can enable us to get a better estimate of the uncertainty of model parameters than we could get if we consider each feature in isolation. -This enables us to share information between genes to get more robust -estimators. -Remember that the t-statistic for coefficient $\beta_k$ -in a linear model is the ratio of the coefficient value to the standard error: +So, to share information between features allows us to get more robust +estimators. Remember that the t-statistic for coefficient $\beta_k$ in a +linear model is the ratio between the coefficient estimate and its standard +error: $$ t_{k} = \frac{\hat{\beta}_{k}}{SE\left(\hat{\beta}_{k}\right)} $$ -It's clear that large effect sizes will likely lead to small p-values, -as long as the standard error for the coefficent is not large. -However, the standard error is affected by the strength of noise, -as we saw earlier. -If we have a small number of observations, it's common for the noise for some -features to be extremely small simply by chance. This will lead to a very small -p-value, which may give us unwarranted confidence in the level of certainty -we have in the results. - -There are many statistical methods in genomics that use this type of approach -to get better estimates by pooling information between features that were -measured simultaneously using the same techniques. -Here we will focus on the package `limma`, which is a relatively old software +It is clear that large effect sizes will likely lead to small p-values, +as long as the standard error for the coefficent is not large. However, +the standard error is affected by the amuont of noise, as we saw +earlier. If we have a small number of observations, it is common for the +noise for some features to be extremely small simply by chance. This, in turn, +causes small p-values for these features, which may give us unwarranted +confidence in the level of certainty we have in the results (false positives). + +There are many statistical methods in genomics that use this type of +approach to get better estimates by pooling information between features +that were measured simultaneously using the same techniques. Here we +will focus on the package **`limma`**, which is an established software package used to fit linear models, originally for the gene expression micro-arrays that were common in the 2000s, but which is still in use in -RNAseq experiments,among others. -The authors of `limma` made some assumptions about the distributions that these -follow, and pool information across genes to get a better estimate of the -uncertainty in effect size estimates. It uses the idea that noise levels should -be similar between features to *moderate* the estimates of the test statistic -by shrinking the estimates of standard errors towards a common value. This -results in a *moderated t-statistic*. - -The process of running a model in `limma` is a bit different to what you may -have seen when running linear models. Here, we define a *model matrix* or -*design matrix*, which is a way of representing the coefficients that should be -fit in each linear model. These are used in similar ways in a lot of different -modelling libraries. +RNAseq experiments, among others. The authors of **`limma`** made some +assumptions about the distributions that these follow, and pool +information across genes to get a better estimate of the uncertainty in +effect size estimates. It uses the idea that noise levels should be +similar between features to *moderate* the estimates of the test +statistic by shrinking the estimates of standard errors towards a common +value. This results in a *moderated t-statistic*. + +The process of running a model in **`limma`** is somewhat different to what you +may have seen when running linear models. Here, we define a *model +matrix* or *design matrix*, which is a way of representing the +coefficients that should be fit in each linear model. These are used in +similar ways in many different modelling libraries. ```{r design-age} library("limma") @@ -605,59 +582,61 @@ dim(design_age) head(design_age) ``` -As you can see, the design matrix has the same number of rows as our methylation -data has samples. It also has two columns - one for the intercept, similar to -the linear model we fit above, and one for age. This happens "under the hood" -when fitting a linear model with `lm`, but here we have to specify it directly. -The [limma user manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) has more detail on how to make design matrices for different -types of experimental design, but here we're going to stick with this simple -two variable case. - -We then pass our matrix of methylation values into `lmFit`, specifying the -design matrix. Internally, this function runs `lm` on each row of the data -in an efficient way. The function `eBayes`, when applied to the output of -`lmFit`, performs the pooled estimation of standard errors that results -in the moderated t-statistics and resulting p-values. +As you can see, the design matrix has the same number of rows as our +methylation data has samples. It also has two columns - one for the +intercept (similar to the linear model we fit above) and one for age. +This happens "under the hood" when fitting a linear model with `lm()`, but +here we have to specify it directly. The [limma user +manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) +has more detail on how to make design matrices for different types of +experimental design, but here we are going to stick with this simple two-variable case. + +We then pass our matrix of methylation values into `lmFit()`, specifying +the design matrix. Internally, this function runs `lm()` on each row of +the data in an efficient way. The function `eBayes()`, when applied to the +output of `lmFit()`, performs the pooled estimation of standard errors +that results in the moderated t-statistics and resulting p-values. ```{r lmfit-age} fit_age <- lmFit(methyl_mat, design = design_age) fit_age <- eBayes(fit_age) ``` -To obtain the results of the linear models, we can use the `topTable` function. -By default, this returns results for the first coefficient in the model. As -we saw above when using `lm`, and when we defined `design_age` above, -the first coefficient relates to the intercept term, which we are not -particularly interested in here; therefore we specify `coef = 2`. -Further, `topTable` by default only returns the top 10 results. To see all -of the results in the data, we specify `number = nrow(fit_age)` to ensure -that it returns a row for every row of the input matrix. +To obtain the results of the linear models, we can use the `topTable()` +function. By default, this returns results for the first coefficient in +the model. As we saw above when using `lm()`, and when we defined +`design_age` above, the first coefficient relates to the intercept term, +which we are not particularly interested in here; therefore we specify +`coef = 2`. Further, `topTable()` by default only returns the top 10 +results. To see all of the results in the data, we specify +`number = nrow(fit_age)` to ensure that it returns a row for every row +of the input matrix. ```{r ebayes-toptab} toptab_age <- topTable(fit_age, coef = 2, number = nrow(fit_age)) -head(toptab_age) +orderEffSize <- rev(order(abs(toptab_age$logFC))) # order by effect size (absolute log-fold change) +head(toptab_age[orderEffSize, ]) ``` -The output of `topTable` includes the coefficient, here termed a log fold change -`logFC`, the average level (`aveExpr`), the t-statistic `t`, the p-value -(`P.Value`), and the *adjusted* p-value (`adj.P.Val`). We'll cover what -an adjusted p-value is very shortly. The table also includes `B`, -which represents the log-odds that a feature is signficantly different, -which we won't cover here, but which will generally be a 1-1 transformation -of the p-value. -The coefficient estimates here are termed `logFC` for legacy reasons relating -to how microarray experiments were traditionally performed. -There are more details on this topic in many places, for example -[this tutorial by Kasper D. Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html) - -Now we have estimates of effect sizes and p-values for the association between -methylation level at each locus and age for our 37 samples. It's useful to -create a plot of effect size estimates (model coefficients) against -p-values for each of these linear models, to visualise the magnitude of effects -and the statistical significance of each. -These plots are often called "volcano plots", because they -resemble an eruption. - +The output of `topTable` includes the coefficient, here termed a log +fold change `logFC`, the average level (`aveExpr`), the t-statistic `t`, +the p-value (`P.Value`), and the *adjusted* p-value (`adj.P.Val`). We'll +cover what an adjusted p-value is very shortly. The table also includes +`B`, which represents the log-odds that a feature is signficantly +different, which we won't cover here, but which will generally be a 1-1 +transformation of the p-value. The coefficient estimates here are termed +`logFC` for legacy reasons relating to how microarray experiments were +traditionally performed. There are more details on this topic in many +places, for example [this tutorial by Kasper D. +Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html) + +Now we have estimates of effect sizes and p-values for the association +between methylation level at each locus and age for our 37 samples. It's +useful to create a plot of effect size estimates (model coefficients) +against p-values for each of these linear models, to visualise the +magnitude of effects and the statistical significance of each. These +plots are often called "volcano plots", because they resemble an +eruption. ```{r limmavolc1, fig.cap="Plotting p-values against effect sizes using limma; the results are similar to a standard linear model.", fig.alt="A plot of -log10(p) against effect size estimates for a regression of age against methylation using limma."} plot(toptab_age$logFC, -log10(toptab_age$P.Value), @@ -668,48 +647,49 @@ plot(toptab_age$logFC, -log10(toptab_age$P.Value), In this figure, every point represents a feature of interest. The x-axis represents the effect size observed for that feature in a linear model, -while the y-axis is the $-\log_{10}(\text{p-value})$, where larger values -indicate increasing statistical evidence of a non-zero effect size. -A positive effect size represents increasing methylation with increasing age, -and a negative effect size represents decreasing methylation with increasing -age. Points higher on the x-axis represent features for which we think the -results we observed would be very unlikely under the null hypothesis. - -Since we want to identify feature that have different methylation levels in -different age groups, -in an ideal case there would be clear separation between "null" and "non-null" -features. However, usually we observe results as we do here: there is a -continuum of effect sizes and p-values, with no clear separation between these -two classes of features. While statistical methods exist to -derive insights from continuous measures like these, it is often convenient to -obtain a list of features which we are confident have non-zero effect sizes. -This is made more difficult by the number of tests we perform. - +while the y-axis is the $-\log_{10}(\text{p-value})$, where larger +values indicate increasing statistical evidence of a non-zero effect +size. A positive effect size represents increasing methylation with +increasing age, and a negative effect size represents decreasing +methylation with increasing age. Points higher on the x-axis represent +features for which we think the results we observed would be very +unlikely under the null hypothesis. + +Since we want to identify feature that have different methylation levels +in different age groups, in an ideal case there would be clear +separation between "null" and "non-null" features. However, usually we +observe results as we do here: there is a continuum of effect sizes and +p-values, with no clear separation between these two classes of +features. While statistical methods exist to derive insights from +continuous measures like these, it is often convenient to obtain a list +of features which we are confident have non-zero effect sizes. This is +made more difficult by the number of tests we perform. > ## Exercise -> -> The effect size estimates are very small, and yet many of the p-values are -> well below a usual significance level of p < 0.05. Why is this? -> +> +> The effect size estimates are very small, and yet many of the p-values +> are well below a usual significance level of p \< 0.05. Why is this? +> > > ## Solution -> > -> > Because age has a much larger range than methylation levels, the unit change -> > in methylation level even for a strong relationship is very small! -> > -> > As we mentioned, the p-value is a function of both the effect size estimate -> > and the uncertainty (standard error) of that estimate. Because the -> > uncertainty in our estimates is much smaller than the estimates themselves, -> > the p-values are also small. -> > -> > If we predicted age using methylation level, it's likely we'd see much -> > larger coefficients, though broadly similar p-values! -> > -> {: .solution} -{: .challenge} +> > +> > Because age has a much larger range than methylation levels, the +> > unit change in methylation level even for a strong relationship is +> > very small! +> > +> > As we mentioned, the p-value is a function of both the effect size +> > estimate and the uncertainty (standard error) of that estimate. +> > Because the uncertainty in our estimates is much smaller than the +> > estimates themselves, the p-values are also small. +> > +> > If we predicted age using methylation level, it's likely we'd see +> > much larger coefficients, though broadly similar p-values! +> > +> > {: .solution} {: .challenge} It's worthwhile considering what exactly the effect of the *moderation* -or information sharing that `limma` performs has on our results. To do this, -let's compare the effect sizes estimates and p-values from the two approaches. +or information sharing that `limma` performs has on our results. To do +this, let's compare the effect sizes estimates and p-values from the two +approaches. ```{r plot-limma-lm-effect, echo = FALSE} plot( @@ -723,12 +703,12 @@ plot( abline(0:1, lty = "dashed") ``` -These are exactly identical! This is because `limma` isn't performing any -sharing of information when estimating effect sizes. This is in contrast to -similar packages that apply shrinkage to the effect size estimates, like -`DESeq2`. These often use information sharing to shrink or moderate the -effect size estimates, in the case of DESeq2 by again sharing information -between features about sample-to-sample variability. +These are exactly identical! This is because `limma` isn't performing +any sharing of information when estimating effect sizes. This is in +contrast to similar packages that apply shrinkage to the effect size +estimates, like `DESeq2`. These often use information sharing to shrink +or moderate the effect size estimates, in the case of DESeq2 by again +sharing information between features about sample-to-sample variability. In contrast, let's look at the p-values from `limma` and `lm`: ```{r plot-limma-lm-pval, echo = FALSE} @@ -744,61 +724,60 @@ plot( abline(0:1, lty = "dashed") ``` -You can see that for the vast majority of features, the results are broadly -similar. There seems to be a minor general tendency for `limma` to produce -smaller p-values, but for several features, the p-values from limma are -considerably larger than the p-values from `lm`. -This is because the information sharing tends to shrink large standard error -estimates downwards and small estimates upwards. -When the degree of statistical significance is due to an abnormally small -standard error rather than a large effect, this effect results in this prominent -reduction in statistical significance, which has been shown to perform well -in case studies. The degree of shrinkage generally depends on -the amount of pooled information and the strength of the -evidence independent of pooling. For example, with very few samples and many -features, information sharing has a larger effect, because there are a lot -of genes that can be used to provide pooled estimates, and the evidence from -the data that this is weighed against is relatively sparse. In contrast, -when there are many samples and few features, there is not much opportunity -to generate pooled estimates, and the evidence of the data can easily outweigh -the pooling. - -Shrinkage methods like these ones can be complex to implement and understand, -but it's good to understand why these approaches may be more precise -and sensitive than the naive approach of fitting a model to each feature -separately. +You can see that for the vast majority of features, the results are +broadly similar. There seems to be a minor general tendency for `limma` +to produce smaller p-values, but for several features, the p-values from +limma are considerably larger than the p-values from `lm`. This is +because the information sharing tends to shrink large standard error +estimates downwards and small estimates upwards. When the degree of +statistical significance is due to an abnormally small standard error +rather than a large effect, this effect results in this prominent +reduction in statistical significance, which has been shown to perform +well in case studies. The degree of shrinkage generally depends on the +amount of pooled information and the strength of the evidence +independent of pooling. For example, with very few samples and many +features, information sharing has a larger effect, because there are a +lot of genes that can be used to provide pooled estimates, and the +evidence from the data that this is weighed against is relatively +sparse. In contrast, when there are many samples and few features, there +is not much opportunity to generate pooled estimates, and the evidence +of the data can easily outweigh the pooling. + +Shrinkage methods like these ones can be complex to implement and +understand, but it's good to understand why these approaches may be more +precise and sensitive than the naive approach of fitting a model to each +feature separately. > ## Exercise -> -> 1. Try to run the same kind of linear model with smoking -> status as covariate instead of age, and making a volcano -> plot. -> *Note: smoking status is stored as `methylation$smoker`.* -> 2. We saw in the example in the lesson that this information sharing -> can lead to larger p-values. Why might this be preferable? -> +> +> 1. Try to run the same kind of linear model with smoking status as +> covariate instead of age, and making a volcano plot. *Note: +> smoking status is stored as `methylation$smoker`.* +> 2. We saw in the example in the lesson that this information sharing +> can lead to larger p-values. Why might this be preferable? > > > ## Solution -> > -> > 1. The following code runs the same type of model with smoking status: -> > ```{r limmavolc2, fig.cap="A plot of significance against effect size for a regression of smoking against methylation.", fig.alt="A plot of -log10(p) against effect size estimates for a regression of smoking status against methylation using limma."} -> > design_smoke <- model.matrix(~methylation$smoker) -> > fit_smoke <- lmFit(methyl_mat, design = design_smoke) -> > fit_smoke <- eBayes(fit_smoke) -> > toptab_smoke <- topTable(fit_smoke, coef = 2, number = nrow(fit_smoke)) -> > plot(toptab_smoke$logFC, -log10(toptab_smoke$P.Value), -> > xlab = "Effect size", ylab = bquote(-log[10](p)), -> > pch = 19 -> > ) -> > ``` -> > 2. Being a bit more conservative when identifying features can help to avoid -> > false discoveries. Furthermore, when rejecting the null hypothesis is -> > based more on a small standard error resulting from abnormally low levels -> > of variability for a given feature, we might want to be a bit more -> > conservative in our expectations. -> {: .solution} -{: .challenge} - +> > +> > 1. The following code runs the same type of model with smoking +> > status: +> > +> > ```{r limmavolc2, fig.cap="A plot of significance against effect size for a regression of smoking against methylation.", fig.alt="A plot of -log10(p) against effect size estimates for a regression of smoking status against methylation using limma."} +> > design_smoke <- model.matrix(~methylation$smoker) +> > fit_smoke <- lmFit(methyl_mat, design = design_smoke) +> > fit_smoke <- eBayes(fit_smoke) +> > toptab_smoke <- topTable(fit_smoke, coef = 2, number = nrow(fit_smoke)) +> > plot(toptab_smoke$logFC, -log10(toptab_smoke$P.Value), +> > xlab = "Effect size", ylab = bquote(-log[10](p)), +> > pch = 19 +> > ) +> > ``` +> > +> > 2. Being a bit more conservative when identifying features can help +> > to avoid false discoveries. Furthermore, when rejecting the null +> > hypothesis is based more on a small standard error resulting +> > from abnormally low levels of variability for a given feature, +> > we might want to be a bit more conservative in our expectations. +> > {: .solution} {: .challenge} ```{r limma-app-ex, echo = FALSE, eval = FALSE} > ## Exercise @@ -828,47 +807,42 @@ separately. ``` > ## Shrinkage -> -> Shrinkage is an intuitive term for an effect -> of information sharing, and is something observed -> in a broad range of statistical models. -> Often, shrinkage is induced by a *multilevel* -> modelling approach or by *Bayesian* methods. -> -> The general idea is that these models incorporate -> information about the structure of the -> data into account when fitting the parameters. -> We can share information between features -> because of our knowledge about the data structure; -> this generally requires careful consideration about -> how the data were generated and the relationships within. > -> An example people often use is estimating the effect -> of attendance on grades in several schools. We can -> assume that this effect is similar in different schools -> (but maybe not identical), so we can *share information* -> about the effect size between schools and shink our +> Shrinkage is an intuitive term for an effect of information sharing, +> and is something observed in a broad range of statistical models. +> Often, shrinkage is induced by a *multilevel* modelling approach or by +> *Bayesian* methods. +> +> The general idea is that these models incorporate information about +> the structure of the data into account when fitting the parameters. We +> can share information between features because of our knowledge about +> the data structure; this generally requires careful consideration +> about how the data were generated and the relationships within. +> +> An example people often use is estimating the effect of attendance on +> grades in several schools. We can assume that this effect is similar +> in different schools (but maybe not identical), so we can *share +> information* about the effect size between schools and shink our > estimates towards a common value. -> -> For example in `DESeq2`, the authors used the observation -> that genes with similar expression counts in RNAseq data -> have similar *dispersion*, and a better estimate of -> these dispersion parameters makes estimates of -> fold changes much more stable. -> Similarly, in `limma` the authors made the assumption that -> in the absence of biological effects, we can often expect the -> technical variation in the measurement of the expression of each of the -> genes to be broadly similar. -> Again, better estimates of variability allow us to -> prioritise genes in a more reliable way. -> +> +> For example in `DESeq2`, the authors used the observation that genes +> with similar expression counts in RNAseq data have similar +> *dispersion*, and a better estimate of these dispersion parameters +> makes estimates of fold changes much more stable. Similarly, in +> `limma` the authors made the assumption that in the absence of +> biological effects, we can often expect the technical variation in the +> measurement of the expression of each of the genes to be broadly +> similar. Again, better estimates of variability allow us to prioritise +> genes in a more reliable way. +> > There are many good resources to learn about this type of approach, > including: -> -> - [a blog post by TJ Mahr](https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/) -> - [a book by David Robinson](https://gumroad.com/l/empirical-bayes) -> - [a (relatively technical) book by Gelman and Hill](http://www.stat.columbia.edu/~gelman/arm/) -{: .callout} +> +> - [a blog post by TJ +> Mahr](https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/) +> - [a book by David Robinson](https://gumroad.com/l/empirical-bayes) +> - [a (relatively technical) book by Gelman and +> Hill](http://www.stat.columbia.edu/~gelman/arm/) {: .callout} ```{r, eval=FALSE, echo=FALSE} # todo: callout box explaining DESeq2 @@ -876,17 +850,17 @@ separately. # The problem of multiple tests -With such a large number of features, it would be useful -to decide which features are "interesting" or "significant" -for further study. However, if we were to apply a normal significance threshold -of 0.05, it's likely that we'd end up with a lot of false positives. That's -because a p-value threshold like this represents a $\frac{1}{20}$ chance that -we'd observe results as extreme or more extreme under the null hypothesis -(that there is no assocation between age and methylation level). If we do -many more than 20 such tests, we can expect that for a lot of the tests, the -null hypothesis is true, but we will still observe extreme results. -To demonstrate this, it's useful to see what happens if -we scramble age and run the same test again: +With such a large number of features, it would be useful to decide which +features are "interesting" or "significant" for further study. However, +if we were to apply a normal significance threshold of 0.05, it's likely +that we'd end up with a lot of false positives. That's because a p-value +threshold like this represents a $\frac{1}{20}$ chance that we'd observe +results as extreme or more extreme under the null hypothesis (that there +is no assocation between age and methylation level). If we do many more +than 20 such tests, we can expect that for a lot of the tests, the null +hypothesis is true, but we will still observe extreme results. To +demonstrate this, it's useful to see what happens if we scramble age and +run the same test again: ```{r volcplotfake, fig.cap="Plotting p-values against effect sizes for a randomised outcome shows we still observe 'significant' results.", fig.alt="Plot of -log10(p) against effect size estimates for a regression of a made-up feature against methylation level for each feature in the data. A dashed line represents a 0.05 significance level."} @@ -904,94 +878,92 @@ plot(toptab_age_perm$logFC, -log10(toptab_age_perm$P.Value), abline(h = -log10(0.05), lty = "dashed", col = "red") ``` -Since we've generated a random sequence of ages, we have no reason to suspect -that there is a true association between methylation levels and this sequence -of random numbers. However, you can see that the p-value for many features is -still lower than a traditional significance level of $p=0.05$. In fact, here -`r sum(toptab_age_perm$P.Value < 0.05)` features are significant at p < 0.05. -If we were to use this fixed threshold in a real experiment, it's likely that -we'd identify many features as associated with age, when the results we're -observing are simply due to chance. +Since we've generated a random sequence of ages, we have no reason to +suspect that there is a true association between methylation levels and +this sequence of random numbers. However, you can see that the p-value +for many features is still lower than a traditional significance level +of $p=0.05$. In fact, here `r sum(toptab_age_perm$P.Value < 0.05)` +features are significant at p \< 0.05. If we were to use this fixed +threshold in a real experiment, it's likely that we'd identify many +features as associated with age, when the results we're observing are +simply due to chance. > ## Exercise > -> -> 1. If we run `r nrow(methylation)` tests under the null hypothesis, -> how many of them (on average) will be statistically -> significant at a threshold of $p < 0.05$? -> 2. Why would we want to be conservative in labelling features -> as significantly different? -> By conservative, we mean to err towards labelling true -> differences as "not significant" rather than vice versa. -> 3. How could we account for a varying number of tests to -> ensure "significant" changes are truly different? -> +> 1. If we run `r nrow(methylation)` tests under the null hypothesis, +> how many of them (on average) will be statistically significant at +> a threshold of $p < 0.05$? +> 2. Why would we want to be conservative in labelling features as +> significantly different? By conservative, we mean to err towards +> labelling true differences as "not significant" rather than vice +> versa. +> 3. How could we account for a varying number of tests to ensure +> "significant" changes are truly different? +> > > ## Solution -> > 1. By default we expect -> > $`r nrow(methylation)` \times 0.05 = `r nrow(methylation) * 0.05`$ -> > features to be statistically significant under the null hypothesis, -> > because p-values should always be uniformly distributed under -> > the null hypothesis. -> > 2. Features that we label as "significantly different" will often -> > be reported in manuscripts. We may also spend time and money -> > investigating them further, computationally or in the lab. -> > Therefore, spurious results have a real cost for ourselves and -> > for others. -> > 3. One approach to controlling for the number of tests is to -> > divide our significance threshold by the number of tests -> > performed. This is termed "Bonferroni correction" and -> > we'll discuss this further now. -> {: .solution} -{: .challenge} - +> > +> > 1. By default we expect +> > $`r nrow(methylation)` \times 0.05 = `r nrow(methylation) * 0.05`$ +> > features to be statistically significant under the null +> > hypothesis, because p-values should always be uniformly +> > distributed under the null hypothesis. +> > 2. Features that we label as "significantly different" will often +> > be reported in manuscripts. We may also spend time and money +> > investigating them further, computationally or in the lab. +> > Therefore, spurious results have a real cost for ourselves and +> > for others. +> > 3. One approach to controlling for the number of tests is to divide +> > our significance threshold by the number of tests performed. +> > This is termed "Bonferroni correction" and we'll discuss this +> > further now. {: .solution} {: .challenge} # Adjusting for multiple tests -When performing many statistical tests to -categorise features, we're effectively classifying -features as "significant" - meaning those for which we reject the null -hypothesis - and "non-significant". We also generally hope that there is a -subset of features for which the null hypothesis is truly false, as well as many -for which the null truly does hold. We hope that for all features for which the -null hypothesis is true, we accept it, and for all features for which the null -hypothesis is not true, we reject it. As we showed in the example with permuted -age, with a large number of tests it's inevitable that we'll get some of these -wrong. - -We can think of these features as being -"truly different" or "not truly different"[^2]. -Using this idea, we can see that each -categorisation we make falls into four categories: - - -| |Label as different |Label as not different| -|-------------------:|--------------------:|---------------------:| -|Truly different |True positive |False negative | -|Truly not different |False positive |True negative | - - -If the null hypothesis was true for every feature, then as we perform more and -more tests we'd tend to correctly categorise most -results as negative. However, since p-values -are uniformly distributed under the null, -at a significance level of 5%, 5% of all -results will be "significant" even though we would expect to see these results, -given the null hypothesis is true, simply by chance. -These would fall under the label "false positives" in the table -above, and are also termed "false discoveries." - -There are two common ways of controlling these -false discoveries. -The first is to say, when we're doing $n$ tests, -that we want to have the same certainty of making -one false discovery with $n$ tests as we have if we're only doing -one test. This is "Bonferroni" correction,[^3] which -divides the significance level by the number of -tests performed, $n$. Equivalently, we can use the -non-transformed p-value threshold but multiply -our p-values by the number of tests. -This is often very conservative, especially -with a lot of features! +When performing many statistical tests to categorise features, we're +effectively classifying features as "significant" - meaning those for +which we reject the null hypothesis - and "non-significant". We also +generally hope that there is a subset of features for which the null +hypothesis is truly false, as well as many for which the null truly does +hold. We hope that for all features for which the null hypothesis is +true, we accept it, and for all features for which the null hypothesis +is not true, we reject it. As we showed in the example with permuted +age, with a large number of tests it's inevitable that we'll get some of +these wrong. + +We can think of these features as being "truly different" or "not truly +different"[^1]. Using this idea, we can see that each categorisation we +make falls into four categories: + +[^1]: "True difference" is a hard category to rigidly define. As we've + seen, with a lot of data, we can detect tiny differences, and with + little data, we can't detect large differences. However, both can be + argued to be "true". + +| | Label as different | Label as not different | +|--------------------:|-------------------:|-----------------------:| +| Truly different | True positive | False negative | +| Truly not different | False positive | True negative | + +If the null hypothesis was true for every feature, then as we perform +more and more tests we'd tend to correctly categorise most results as +negative. However, since p-values are uniformly distributed under the +null, at a significance level of 5%, 5% of all results will be +"significant" even though we would expect to see these results, given +the null hypothesis is true, simply by chance. These would fall under +the label "false positives" in the table above, and are also termed +"false discoveries." + +There are two common ways of controlling these false discoveries. The +first is to say, when we're doing $n$ tests, that we want to have the +same certainty of making one false discovery with $n$ tests as we have +if we're only doing one test. This is "Bonferroni" correction,[^2] which +divides the significance level by the number of tests performed, $n$. +Equivalently, we can use the non-transformed p-value threshold but +multiply our p-values by the number of tests. This is often very +conservative, especially with a lot of features! + +[^2]: Bonferroni correction is also termed "family-wise" error rate + control. ```{r p-fwer, fig.cap="Bonferroni correction often produces very large p-values, especially with low sample sizes.", fig.alt="Plot of Bonferroni-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} p_raw <- toptab_age$P.Value @@ -1007,109 +979,109 @@ ggplot() + labs(x = "Raw p-value", y = "Bonferroni p-value") ``` -You can see that the p-values are exactly one for the vast majority of tests -we performed! This is not ideal sometimes, because unfortunately we usually -don't have very large sample sizes in health sciences. - -The second main way of controlling for multiple tests -is to control the *false discovery rate*.[^4] -This is the proportion of false positives, or false discoveries, -we'd expect to get each time if we repeated -the experiment over and over. - -1. Rank the p-values -2. Assign each a rank (1 is smallest) -3. Calculate the critical value - $$ - q = \left(\frac{i}{m}\right)Q - $$, - where $i$ is rank, $m$ is the number of tests, and $Q$ is the - false discovery rate we want to target.[^5] -4. Find the largest p-value less than the critical value. - All smaller than this are significant. - - -|FWER|FDR| -|:-------------|:--------------| -|+ Controls probability of identifying a false positive|+ Controls rate of false discoveries| -|+ Strict error rate control |+ Allows error control with less stringency| -|- Very conservative |- Does not control probability of making errors| -|- Requires larger statistical power|- May result in false discoveries| +You can see that the p-values are exactly one for the vast majority of +tests we performed! This is not ideal sometimes, because unfortunately +we usually don't have very large sample sizes in health sciences. + +The second main way of controlling for multiple tests is to control the +*false discovery rate*.[^3] This is the proportion of false positives, +or false discoveries, we'd expect to get each time if we repeated the +experiment over and over. + +[^3]: This is often called "Benjamini-Hochberg" adjustment. + +1. Rank the p-values +2. Assign each a rank (1 is smallest) +3. Calculate the critical value $$ + q = \left(\frac{i}{m}\right)Q + $$, where $i$ is rank, $m$ is the number of tests, and $Q$ is the + false discovery rate we want to target.[^4] +4. Find the largest p-value less than the critical value. All smaller + than this are significant. + +[^4]: People often perform extra controls on FDR-adjusted p-values, + ensuring that ranks don't change and the critical value is never + smaller than the original p-value. + +| FWER | FDR | +|:----------------------------------|:------------------------------------| +| \+ Controls probability of identifying a false positive | \+ Controls rate of false discoveries | +| \+ Strict error rate control | \+ Allows error control with less stringency | +| \- Very conservative | \- Does not control probability of making errors | +| \- Requires larger statistical power | \- May result in false discoveries | > ## Exercise > -> 1. At a significance level of 0.05, with 100 tests -> performed, what is the Bonferroni significance -> threshold? -> 2. In a gene expression experiment, after FDR -> correction we observe 500 significant genes. -> What proportion of these genes are truly -> different? -> 3. Try running FDR correction on the `p_raw` vector. -> *Hint: check `help("p.adjust")` to see what the method -> is called*. -> Compare these values to the raw p-values -> and the Bonferroni p-values. -> +> 1. At a significance level of 0.05, with 100 tests performed, what is +> the Bonferroni significance threshold? +> 2. In a gene expression experiment, after FDR correction we observe +> 500 significant genes. What proportion of these genes are truly +> different? +> 3. Try running FDR correction on the `p_raw` vector. *Hint: check +> `help("p.adjust")` to see what the method is called*.\ +> Compare these values to the raw p-values and the Bonferroni +> p-values. +> > > ## Solution -> > -> > 1. The Bonferroni threshold for this significance -> > threshold is -> > $$ -> > \frac{0.05}{100} = 0.0005 -> > $$ -> > 2. Trick question! We can't say what proportion -> > of these genes are truly different. However, if -> > we repeated this experiment and statistical test -> > over and over, on average 5% of the results from -> > each run would be false discoveries. -> > 3. The following code runs FDR correction and compares it to -> > non-corrected values and to Bonferroni: -> > ```{r p-fdr, fig.cap="Benjamini-Hochberg correction is less conservative than Bonferroni", fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} -> > p_fdr <- p.adjust(p_raw, method = "BH") -> > ggplot() + -> > aes(p_raw, p_fdr) + -> > geom_point() + -> > scale_x_log10() + scale_y_log10() + -> > geom_abline(slope = 1, linetype = "dashed") + -> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + -> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + -> > labs(x = "Raw p-value", y = "Benjamini-Hochberg p-value") -> > ``` -> > ```{r plot-fdr-fwer, fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against Bonferroni-adjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} -> > ggplot() + -> > aes(p_fdr, p_fwer) + -> > geom_point() + -> > scale_x_log10() + scale_y_log10() + -> > geom_abline(slope = 1, linetype = "dashed") + -> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + -> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + -> > labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value") -> > ``` -> {: .solution} -{: .challenge} +> > +> > 1. The Bonferroni threshold for this significance threshold is $$ +> > \frac{0.05}{100} = 0.0005 +> > $$ +> > +> > 2. Trick question! We can't say what proportion of these genes are +> > truly different. However, if we repeated this experiment and +> > statistical test over and over, on average 5% of the results +> > from each run would be false discoveries. +> > +> > 3. The following code runs FDR correction and compares it to +> > non-corrected values and to Bonferroni: +> > +> > ```{r p-fdr, fig.cap="Benjamini-Hochberg correction is less conservative than Bonferroni", fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} +> > p_fdr <- p.adjust(p_raw, method = "BH") +> > ggplot() + +> > aes(p_raw, p_fdr) + +> > geom_point() + +> > scale_x_log10() + scale_y_log10() + +> > geom_abline(slope = 1, linetype = "dashed") + +> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + +> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + +> > labs(x = "Raw p-value", y = "Benjamini-Hochberg p-value") +> > ``` +> > +> > ```{r plot-fdr-fwer, fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against Bonferroni-adjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} +> > ggplot() + +> > aes(p_fdr, p_fwer) + +> > geom_point() + +> > scale_x_log10() + scale_y_log10() + +> > geom_abline(slope = 1, linetype = "dashed") + +> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + +> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + +> > labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value") +> > ``` +> > +> > {: .solution} {: .challenge} -## Further reading +> ## Feature selection +> +> In this episode, we have focussed on regression in a setting where there are more +> features than observations. This approach is relevant if we are interested in the +> association of each feature with some outcome or if we want to screen for features +> that have a strong association with an outcome. If, however, we are interested in +> predicting an outcome or if we want to know which features explain the variation +> in the outcome, we may want to restrict ourselves to a subset of relevant features. +> One way of doing this is called *regularisation*, and this is the topic of the next episode. +> An alternative is called *feature selection*. This is covered in the subsequent (optional) episode. +{: .callout} -- [limma tutorial by Kasper D. Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html) -- [limma user manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf). +## Further reading +- [limma tutorial by Kasper D. + Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html) +- [limma user + manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf). ## Footnotes -[^1]: It's not hugely problematic if the assumption of normal residuals is violated. It mainly affects our ability to accurately predict responses for new, unseen observations. - -[^2]: "True difference" is a hard category to rigidly define. As we've seen, with a lot of data, we can detect tiny differences, and with little data, we can't detect large differences. However, both can be argued to be "true". - -[^3]: Bonferroni correction is also termed "family-wise" error rate control. - -[^4]: This is often called "Benjamini-Hochberg" adjustment. - -[^5]: People often perform extra controls on FDR-adjusted p-values, ensuring that ranks don't change and the critical value is never smaller than the original p-value. - - - - {% include links.md %}