From 0c29ca81a84987a0ba3d00bdb3061601dda5332b Mon Sep 17 00:00:00 2001 From: Hannes Becher Date: Tue, 7 Jun 2022 14:51:18 +0100 Subject: [PATCH 1/8] Update 02-high-dimensional-regression.Rmd --- .../02-high-dimensional-regression.Rmd | 1289 ++++++++--------- 1 file changed, 631 insertions(+), 658 deletions(-) diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index 807bb93a..519770b2 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -23,9 +23,11 @@ keypoints: - "Multiple testing methods can be more conservative or more liberal, depending on our goals." math: yes +editor_options: + markdown: + wrap: 72 --- - ```{r settings, include=FALSE} library("here") source(here("bin/chunk-options.R")) @@ -34,14 +36,13 @@ knitr_fig_path("02-") # DNA methylation data -For the following few episodes, we will be working with human -DNA methylation data from flow-sorted blood samples. -DNA methylation assays measure, for many sites in the genome, -the proportion of DNA that carries a methyl mark. -In this case, the methylation data come in the form of a matrix -of normalised methylation levels, where negative values correspond to -unmethylated DNA and positive values correspond to methylated DNA. -Along with this, we have a number of sample phenotypes +For the following few episodes, we will be working with human DNA +methylation data from flow-sorted blood samples. DNA methylation assays +measure, for each of many sites in the genome, the proportion of DNA +that carries a methyl mark. In this case, the methylation data come in +the form of a matrix of normalised methylation levels, where negative +values correspond to unmethylated DNA and positive values correspond to +methylated DNA. Along with this, we have a number of sample phenotypes (eg, age in years, BMI). Let's read in the data for this episode: @@ -52,30 +53,30 @@ library("minfi") methylation <- readRDS(here("data/methylation.rds")) ``` -This `methylation` object is a `GenomicRatioSet`, a Bioconductor data object -derived from the `SummarizedExperiment` class. -These `SummarizedExperiment` objects contain `assay`s, in this case methylation -levels, and optional sample-level `colData` and feature-level `metadata`. -These objects are very convenient to contain all of the information about -a dataset in a high-throughput context. If you would like more detail -on these objects it may be useful to consult the -[vignettes on Bioconductor](https://www.bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html). +This `methylation` object is a `GenomicRatioSet`, a Bioconductor data +object derived from the `SummarizedExperiment` class. These +`SummarizedExperiment` objects contain `assay`s, in this case +methylation levels, and optional sample-level `colData` and +feature-level `metadata`. These objects are very convenient to contain +all of the information about a dataset in a high-throughput context. If +you would like more detail on these objects it may be useful to consult +the [vignettes on +Bioconductor](https://www.bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html). ```{r showmethy} methylation ``` -You can see in this output that this object has a `dim` of -$`r nrow(methylation)` \times `r ncol(methylation)`$, -meaning it has `r nrow(methylation)` features and `r ncol(methylation)` columns. -To extract the matrix of methylation values (M-values), we can use the `assay` function. -One thing to bear in mind with these objects (and data -structures for computational biology in R generally) is that -in the matrix of methylation data, samples or observations -are stored as columns, while features (in this case, CpG sites) -are stored as rows. This is in contrast to usual tabular data, -where features or variables are stored as columns and -observations are stored as rows. +You can see in this output that this object has a `dim()` of +$`r nrow(methylation)` \times `r ncol(methylation)`$, meaning it has +`r nrow(methylation)` features and `r ncol(methylation)` columns. To +extract the matrix of methylation values (M-values), we can use the +`assay()` function. One thing to bear in mind with these objects (and +data structures for computational biology in R generally) is that in the +matrix of methylation data, samples or observations are stored as +columns, while features (in this case, CpG sites) are stored as rows. +This is in contrast to usual tabular data, where features or variables +are stored as columns and observations are stored as rows. ```{r grabx} methyl_mat <- assay(methylation) @@ -87,24 +88,24 @@ The distribution of these M-values looks like this: hist(methyl_mat, breaks = "FD", xlab = "M-value") ``` -You can see that there are two peaks in this distribution, corresponding to -features which are largely unmethylated and methylated, respectively. +You can see that there are two peaks in this distribution, corresponding +to features which are largely unmethylated and methylated, respectively. -Similarly, we can examine the `colData`, which represents the sample-level -metadata we have relating to these data. -In this case, the metadata, phenotypes, and groupings in the `colData` -look like this for the first 6 samples: +Similarly, we can examine the `colData()`, which represents the +sample-level metadata we have relating to these data. In this case, the +metadata, phenotypes, and groupings in the `colData` look like this for +the first 6 samples: ```{r datatable} knitr::kable(head(colData(methylation)), row.names = FALSE) ``` -In this episode, we will focus on the association between age and methylation. -The association between age and methylation status in blood samples has been -studied extensively, and is actually a well-known application of some -high-dimensional regression techniques, with the idea of "epigenetic age" -having been derived using these techniques. The methylation levels for these -data can be presented in a heatmap: +In this episode, we will focus on the association between age and +methylation. The association between age and methylation status in blood +samples has been studied extensively, and is actually a well-known +application of some high-dimensional regression techniques, with the +idea of "epigenetic age" having been derived using these techniques. The +methylation levels for these data can be presented in a heatmap: ```{r heatmap, echo=FALSE, fig.cap="Visualising the data as a heatmap, it's clear that there's too many models to fit 'by hand'.", fig.alt="Heatmap of methylation values across all features. Samples are ordered according to age."} age <- methylation$Age @@ -135,54 +136,47 @@ Heatmap(methyl_mat_ord, ) ``` -We would like to identify features that are related to our outcome of interest -(age). It's clear from the heatmap that there are too many features to do so -manually, even with this reduced number of features - the original dataset -contained over 800,000! +We would like to identify features that are related to our outcome of +interest (age). It is clear from the heatmap that there are too many +features to do so manually, even with this reduced number of features - +the original dataset contained over 800,000! > ## Exercise -> +> > Why can we not just fit linear regression models of all of the columns -> in the `colData` above against all of the features in the matrix of assays, -> and choose all of the significant results at a p-value of 0.05? -> +> in the `colData` above against all of the features in the matrix of +> assays, and choose all of the significant results at a p-value of +> 0.05? +> > > ## Solution -> > +> > > > There are a number of problems that this kind of approach presents. -> > For example: -> > 1. Without a research question in mind when creating a model, it's not clear -> > how we can interpret each model, and rationalising the results after the -> > fact can be dangerous; it's easy to make up a "story" that isn't -> > grounded in anything but the fact that we have signif findings. -> > 2. We may not have a representative sample for each of these covariates. For -> > example, we may have very small sample sizes for some ethnicities, -> > leading to spurious findings. -> > 3. If we perform -> > `r nrow(methylation)` tests for each of `r ncol(colData(methylation))` -> > variables, even if there were no true associations in the data, we'd be -> > likely to observe some strong spurious associations that arise just from -> > random noise. -> > +> > For example: 1. Without a research question in mind when creating a +> > model, it's not clear how we can interpret each model, and +> > rationalising the results after the fact can be dangerous; it's easy +> > to make up a "story" that isn't grounded in anything but the fact +> > that we have signif findings. 2. We may not have a representative +> > sample for each of these covariates. For example, we may have very +> > small sample sizes for some ethnicities, leading to spurious +> > findings. 3. If we perform `r nrow(methylation)` tests for each of +> > `r ncol(colData(methylation))` variables, even if there were no true +> > associations in the data, we'd be likely to observe some strong +> > spurious associations that arise just from random noise. > > -> {: .solution} -{: .challenge} - +> > {: .solution} {: .challenge} > ## Measuring DNA Methylation -> -> DNA methylation is an epigenetic modification of DNA. -> Generally, we are interested in the proportion of -> methylation at many sites or regions in the genome. -> DNA methylation microarrays, as we are using here, -> measure DNA methylation using two-channel microarrays, -> where one channel captures signal from methylated -> DNA and the other captures unmethylated signal. -> These data can be summarised -> as "Beta values" ($\beta$ values), which is the ratio -> of the methylated signal to the total signal -> (methylated plus unmethylated). -> The $\beta$ value for site $i$ is calculated as -> +> +> DNA methylation is an epigenetic modification of DNA. Generally, we +> are interested in the proportion of methylation at many sites or +> regions in the genome. DNA methylation microarrays, as we are using +> here, measure DNA methylation using two-channel microarrays, where one +> channel captures signal from methylated DNA and the other captures +> unmethylated signal. These data can be summarised as "Beta values" +> ($\beta$ values), which is the ratio of the methylated signal to the +> total signal (methylated plus unmethylated). The $\beta$ value for +> site $i$ is calculated as +> > $$ > \beta_i = \frac{ > m_i @@ -190,61 +184,56 @@ contained over 800,000! > u_{i} + m_{i} > } > $$ -> -> where $m_i$ is the methylated signal for site $i$ and -> $u_i$ is the unmethylated signal for site $i$. -> $\beta$ values take on a value in the range -> $[0, 1]$, with 0 representing a completely unmethylated -> site and 1 representing a completely methylated site. -> -> The M-values we use here are the $\log_2$ ratio of -> methylated versus unmethylated signal: +> +> where $m_i$ is the methylated signal for site $i$ and $u_i$ is the +> unmethylated signal for site $i$. $\beta$ values take on a value in +> the range $[0, 1]$, with 0 representing a completely unmethylated site +> and 1 representing a completely methylated site. +> +> The M-values we use here are the $\log_2$ ratio of methylated versus +> unmethylated signal: > > $$ > M_i = \log_2\left(\frac{m_i}{u_i}\right) > $$ -> -> M-values are not bounded to an interval as Beta values -> are, and therefore can be easier to work with in statistical models. -{: .callout} - +> +> M-values are not bounded to an interval as Beta values are, and +> therefore can be easier to work with in statistical models. {: +> .callout} # Identifying associations using linear regression -In high-throughput studies, it's common to have one or more -phenotypes or groupings that we want to relate to features of -interest (eg, gene expression, DNA methylation levels). -In general, we want to identify differences in the -features of interest that are related to a phenotype or grouping of our samples. -Identifying features of interest that vary along with -phenotypes or groupings can allow us to understand how +In high-throughput studies, it is common to have one or more phenotypes +or groupings that we want to relate to features of interest (eg, gene +expression, DNA methylation levels). In general, we want to identify +differences in the features of interest that are related to a phenotype +or grouping of our samples. Identifying features of interest that vary +along with phenotypes or groupings can allow us to understand how phenotypes arise or manifest. -For example, we might want to identify genes that are -expressed at a higher level in mutant mice relative -to wild-type mice to understand the effect -of a mutation on cellular phenotypes. -Alternatively, we might have -samples from a set of patients, and wish to identify -epigenetic features that are different in young patients -relative to old patients, to help us understand how aging -manifests. - -Using linear regression, it's possible to identify differences -like these . However, high-dimensional data like the ones we're -working with require some special considerations. A primary consideration, -as we saw above, is that there are far too many features to fit each one-by-one -as we might using a more focused dataset, for example using `lm` on each -one and checking the linear model assumptions. -A secondary consideration is that statistical approaches may behave slightly -differently in very high-dimensional data, compared to low-dimensional -data. A tertiary consideration is the speed at which -we can actually compute statistics for data this large -- methods optimised -for low-dimensional data may be very slow when applied to high-dimensional data. - +For example, we might want to identify genes that are expressed at a +higher level in mutant mice relative to wild-type mice to understand the +effect of a mutation on cellular phenotypes. Alternatively, we might +have samples from a set of patients, and wish to identify epigenetic +features that are different in young patients relative to old patients, +to help us understand how ageing manifests. + +Using linear regression, it is possible to identify differences like +these. However, high-dimensional data like the ones we're working with +require some special considerations. A primary consideration, as we saw +above, is that there are far too many features to fit each one-by-one as +we might do when analysing low-dimensional datasets (for example using +`lm` on each feature and checking the linear model assumptions). A +secondary consideration is that statistical approaches may behave +slightly differently in very high-dimensional data, compared to +low-dimensional data. A third consideration is the speed at which we can +actually compute statistics for data this large -- methods optimised for +low-dimensional data may be very slow when applied to high-dimensional +data. Ideally when performing regression, we want to identify cases like this, -where there is a clear assocation, and we probably "don't need" statistics: +where there is a clear association, and we probably "don't need" +statistics: ```{r example1, echo=FALSE, fig.cap="A scatter plot of age and a feature of interest.", fig.alt="An example of a strong linear association between a continuous phenotype (age) on the x-axis and a feature of interest (DNA methylation at a given locus) on the y-axis. A strong linear relationship with a positive slope exists between the two.", fig.width=6, fig.height=6} library("ggplot2") @@ -275,8 +264,9 @@ ggplot() + labs(y = "DNA methylation") ``` -However, often due to small differences and small sample sizes, -the problem is a bit more difficult: +However, often due to small differences and small sample sizes, the +problem is more difficult: + ```{r example3, echo=FALSE, fig.cap="A scatter plot of a grouping and a feature of interest.", fig.alt="An example of a strong linear association between a discrete phenotype (group) on the x-axis and a feature of interest (DNA methylation at a given locus) on the y-axis. The two groups seem to differ with respect to DNA methylation, but the relationship is weak.", fig.width=6, fig.height=6} library("ggplot2") set.seed(66) @@ -291,27 +281,26 @@ ggplot() + labs(y = "DNA methylation") ``` -And, of course, we often have an awful lot of features and need -to prioritise a subset of them! We need a rigorous way to -prioritise genes for further analysis. - +And, of course, we often have an awful lot of features and need to +prioritise a subset of them! We need a rigorous way to prioritise genes +for further analysis. # Fitting a linear model -So, in the data we've read in, we have a matrix of methylation values $X$ and a -vector of ages, $y$. -One way to model this is to see if we can use age to predict the expected -(average) methylation value for sample $j$ at a given locus $i$, which we can -write as $X_{ij}$. We can write that model as: +So, in the data we have read in, we have a matrix of methylation values +$X$ and a vector of ages, $y$. One way to model this is to see if we can +use age to predict the expected (average) methylation value for sample +$j$ at a given locus $i$, which we can write as $X_{ij}$. We can write +that model as: $$ \mathbf{E}(X_{ij}) = \beta_0 + \beta_1 \text{Age}_j $$ where $\text{Age}_j$ is the age of sample $j$. In this model, $\beta_1$ -represents the unit change in mean methylation level for each unit (year) -change in age. -We can fit this model and get more information from the model object: +represents the unit change in mean methylation level for each unit +(year) change in age. We can fit this model and get more information +from the model object: ```{r fit1} age <- methylation$Age @@ -319,97 +308,96 @@ lm_age_methyl1 <- lm(methyl_mat[1, ] ~ age) lm_age_methyl1 ``` -We now have estimates for the expected methylation level when age equals 0 -(the intercept) and the change in methylation level for -a unit change in age (the slope). We could plot this linear model: +We now have estimates for the expected methylation level when age equals +0 (the intercept) and the change in methylation level for a unit change +in age (the slope). We could plot this linear model: ```{r plot-lm-methyl1} plot(age, methyl_mat[1, ], xlab = "Age", ylab = "Methylation level", pch = 16) abline(lm_age_methyl1) ``` -For this feature, we can see that there isn't a strong relationship between -methylation and age. We could try to repeat this for every feature in our -dataset; however, we have a lot of features! We need an approach that allows -us to assess associations between all of these features and our outcome while -addressing the three considerations we outlined previously. Before -we introduce this approach, let's go into detail about how we generally -check whether the results of a linear model are statistically significant. +For this feature, we can see that there is no strong relationship +between methylation and age. We could try to repeat this for every +feature in our dataset; however, we have a lot of features! We need an +approach that allows us to assess associations between all of these +features and our outcome while addressing the three considerations we +outlined previously. Before we introduce this approach, let's go into +detail about how we generally check whether the results of a linear +model are statistically significant. # Hypothesis testing in linear regression -Using the linear model we defined above, we can assess if the coefficients -we've fitted are worthy of further investigation, or if they would be likely -to arise by chance. For linear regression, we usually do this -using null hypothesis significance testing. This framework compares the results -that we observed (here, linear model coefficients) to the results you would -expect under the null hypothesis. -To be more specific, null hypothesis significance testing compares our observed -results with a set of hypothetical counter-examples of what we would expect to -observe if we repeated the same experiment and analysis over and over again -under the null hypothesis. In the case of linear regression, the null -hypothesis is that there is absolutely no relationship between the predictor -variable(s) and the outcome. This may not always be the most realistic or -useful null hypothesis, but it's the one we have! - - -For this linear model, we can use `broom` to extract detailed information about -the coefficients and the associated hypothesis tests in this model: +Using the linear model we defined above, we can assess if the +coefficients we have fitted are worthy of further investigation, or if +they would be likely to arise by chance. For linear regression, we +usually do this using null hypothesis significance testing. This +framework compares the results that we observed (here, linear model +coefficients) to the results you would expect under the null hypothesis. +To be more specific, null hypothesis significance testing compares our +observed results with a set of hypothetical counter-examples of what we +would expect to observe if we repeated the same experiment and analysis +over and over again under the null hypothesis. In the case of linear +regression, the null hypothesis is that there is absolutely no +relationship between the predictor variable(s) and the outcome. This may +not always be the most realistic or useful null hypothesis, but it is +the one we have! + +For this linear model, we can use `broom` to extract detailed +information about the coefficients and the associated hypothesis tests +in this model: ```{r tidyfit} library("broom") tidy(lm_age_methyl1) ``` -The standard errors (`std.error`) represent the uncertainty in our effect size -estimate. The test statistics and p-values represent measures of how (un)likely -it would be to observe results like this under the "null hypothesis". - +The standard errors (`std.error`) represent the uncertainty in our +effect size estimate. The test statistics and p-values represent +measures of how (un)likely it would be to observe results like this +under the "null hypothesis". > ## Exercise -> -> In the model we fitted, the p-value for the intercept term is significant -> at $p < 0.05$. What does this mean? -> +> +> In the model we fitted, the p-value for the intercept term is +> significant at $p < 0.05$. What does this mean? +> > > ## Solution -> > -> > The intercept for this model indicates that the mean of methylation, when -> > age is zero, is 0.902. -> > However, this isn't a particularly note-worthy finding! -> > Also, regardless of the p-value, it's unlikely to be a reliable finding, as -> > we don't have any individuals with age zero, nor even any with age < 20. > > -> {: .solution} -{: .challenge} - +> > The intercept for this model indicates that the mean of methylation, +> > when age is zero, is 0.902. However, this isn't a particularly +> > note-worthy finding! Also, regardless of the p-value, it's unlikely +> > to be a reliable finding, as we don't have any individuals with age +> > zero, nor even any with age \< 20. +> > +> > {: .solution} {: .challenge} # Fitting a lot of linear models -If we wanted to repeat this process for each feature in our data, we need -to narrow our focus. In particular, -The first coefficient in a linear model like this is the intercept, -which measures the overall -offset between age and methylation levels. It's not really interesting -if this is zero or non-zero, since we probably don't care what the methylation -level is when age is zero. In fact, this question doesn't even really make -much sense! Instead, we're more -interested if there is a relationship between increasing age and methylation -levels. Therefore, we'll focus only on the second coefficient. +If we wanted to repeat this process for each feature in our data, we +need to narrow our focus. In particular, The first coefficient in a +linear model like this is the intercept, which measures the overall +offset between age and methylation levels. It is not really interesting +if this intercept is zero or not, since we probably do not care what the +methylation level is when age is zero. In fact, this question does not +even make much sense! Instead, we are more interested in whether there +is a relationship between age and methylation levels. Therefore, we will +focus only on the second coefficient. ```{r tidyfit2} coef_age_methyl1 <- tidy(lm_age_methyl1)[2, ] coef_age_methyl1 ``` -Now, we could do this for every feature in the dataset and rank the results. -However, fitting models in this way to `r nrow(methylation)` features -is not very computationally efficient, -and it would also be laborious to do programmatically. There are ways to get -around this, but first let's talk about what exactly we're doing when we -look at significance tests in this context. +Now, we could do this for every feature in the dataset and rank the +results. However, fitting models in this way to `r nrow(methylation)` +features is not very computationally efficient, and it would also be +laborious to do programmatically. There are ways to get around this, but +first let us talk about what exactly we are doing when we look at +significance tests in this context. ```{r rbindfits, echo=FALSE} -# Instead of repeating this model fitting and extraction focus over and over, we +# Instead of repeating this model fitting and coefficient extraction by hand, we # could write a function to fit this kind of model for any given row of the # matrix: @@ -433,45 +421,41 @@ coef_df$feature <- rownames(methyl_mat) # ) ``` - - - # How does hypothesis testing for a linear model work? -In order to decide whether a result would be unlikely -under the null hypothesis, we can calculate a test statistic. -For coefficient $k$ in a linear model (in our case, it would be the slope), -the test statistic is a t-statistic given by: +In order to decide whether a result would be unlikely under the null +hypothesis, we must calculate a test statistic. For coefficient $k$ in a +linear model (in our case, it would be the slope), the test statistic is +a t-statistic given by: $$ t_{k} = \frac{\hat{\beta}_{k}}{SE\left(\hat{\beta}_{k}\right)} $$ -$SE\left(\hat{\beta}_{k}\right)$ measures the uncertainty we have in our effect -size estimate. -Knowing what distribution these t-statistics follow under the null -hypothesis allows us to determine how unlikely it would be for -us to observe what we have under those circumstances, if we repeated the -experiment and analysis over and over again. -To demonstrate, we can manually demonstrate the relationship between these -quantities (this is not important to remember). +$SE\left(\hat{\beta}_{k}\right)$ measures the uncertainty we have in our +effect size estimate. Knowing what distribution these t-statistics +follow under the null hypothesis allows us to determine how unlikely it +would be for us to observe what we have under those circumstances, if we +repeated the experiment and analysis over and over again. To +demonstrate, we can compute the t-statistics "by hand" (advanced content). ```{r simfit} table_age_methyl1 <- tidy(lm_age_methyl1) ``` -We can see that the t-statistic is just the ratio of the estimate to the -standard error: +We can see that the t-statistic is just the ratio between the coefficient estimate +and the standard error: ```{r simtval} tvals <- table_age_methyl1$estimate / table_age_methyl1$std.error all.equal(tvals, table_age_methyl1$statistic) ``` -Calculating the p-values is a bit more tricky. Specifically, it's the proportion -of the distribution of the test statistic under the null hypothesis that is -*as extreme or more extreme* than the observed value of the test statistic. -This is easy to observe visually, by plotting the distribution: +Calculating the p-values is a bit more tricky. Specifically, it is the +proportion of the distribution of the test statistic under the null +hypothesis that is *as extreme or more extreme* than the observed value +of the test statistic. This is easy to observe visually, by plotting the +distribution: ```{r tdist, echo = FALSE, fig.cap="The p-value for a regression coefficient represents how often it'd be observed under the null.", fig.alt="Density plot of a t-distribution showing the observed test statistics (here, t-statistics). The p-values, visualised here with shaded regions, represent the portion of the null distribution that is as extreme or more extreme as the observed test statistics, which are shown as dashed lines."} ggplot() + @@ -497,106 +481,100 @@ ggplot() + labs(x = "t-statistic", y = "Density") ``` -The red-ish shaded region represents the portion of the distribution of the test -statistic under the null hypothesis that is equal or greater to the value -we observe for the intercept term. This shaded region -is small relative to the total area of the null distribution; therefore, the -p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits = 3)`$). -The blue-ish shaded region represents the same measure for the slope -term; this is larger, relative to the total area of the distribution, therefore -the p-value is larger than the one for the intercept term -($p=`r round(table_age_methyl1$p.value[[2]], digits = 3)`$). -You can see that the p-value is a function of the size of the effect we're -estimating and the uncertainty we have in that effect. A large effect with -large uncertainty may not lead to a small p-value, and a small effect with small -uncertainty may lead to a small p-value. - +The red-ish shaded region represents the portion of the distribution of +the test statistic under the null hypothesis that is equal or greater to +the value we observe for the intercept term. This shaded region is small +relative to the total area of the null distribution; therefore, the +p-value is small ($p=`r round(table_age_methyl1$p.value[[1]], digits = +3)`$). The blue-ish shaded region represents the same measure for the slope term; this is larger, relative to the total area of the distribution, therefore the p-value is larger than the one for the intercept term ($p=`r +round(table_age_methyl1$p.value[[2]], digits = 3)`$). You can see that +the p-value is a function of the size of the effect we're estimating and +the uncertainty we have in that effect. A large effect with large +uncertainty may not lead to a small p-value, and a small effect with +small uncertainty may lead to a small p-value. > ## Calculating p-values from a linear model -> -> Manually calculating the p-value for a linear model is a little bit more -> complex than calculating the t-statistic. The intuition posted above is -> definitely sufficient for most cases, but for completeness, here is how -> we do it: -> -> Since the statistic in a linear model is a t-statistic, it follows a student -> t distribution under the null hypothesis, with degrees of freedom (a parameter -> of the student t distribution) given by -> the number of observations minus the number of coefficients fitted, in this -> case $`r ncol(methylation)` - `r length(coef(lm_age_methyl1))` = -> `r lm_age_methyl1$df`$. -> We want to know what portion of -> the distribution function of the test statistic is as extreme as, or more -> extreme than, the value we observed. The function `pt` (similar to `pnorm`, +> +> Manually calculating the p-value for a linear model is a little bit +> more complex than calculating the t-statistic. The intuition posted +> above is definitely sufficient for most cases, but for completeness, +> here is how we do it: +> +> Since the statistic in a linear model is a t-statistic, it follows a +> student t distribution under the null hypothesis, with degrees of +> freedom (a parameter of the student t distribution) given by the +> number of observations minus the number of coefficients fitted, in +> this case +> $`r ncol(methylation)` - `r length(coef(lm_age_methyl1))` = `r lm_age_methyl1$df`$. We want to know what portion of the distribution function of the test statistic is as extreme as, or more extreme than, the value we observed. The function`pt`(similar to`pnorm\`, > etc) can give us this information. -> -> Since we're not sure if the coefficient will be larger or smaller than zero, -> we want to do a 2-tail test. Therefore we take the absolute value of the -> t-statistic, and look at the upper rather than lower tail. -> Because in a 2-tail test we're looking at "half" of the t-distribution, -> we also multiply the p-value by 2. -> +> +> Since we're not sure if the coefficient will be larger or smaller than +> zero, we want to do a 2-tail test. Therefore we take the absolute +> value of the t-statistic, and look at the upper rather than lower +> tail. Because in a 2-tail test we're looking at "half" of the +> t-distribution, we also multiply the p-value by 2. +> > Combining all of this gives us: > > ```{r simpval} > pvals <- 2 * pt(abs(tvals), df = lm_age_methyl1$df, lower.tail = FALSE) > all.equal(table_age_methyl1$p.value, pvals) > ``` -{: .callout} - +> +> {: .callout} # Sharing information between features -Now that we understand what's going on when we perform a hypothesis test -for a linear model, we're going to introduce an idea that allows us to take -advantage of the fact that we're doing many tests at once on structured data. -We can leverage this fact to *share information* between model parameters. -The insight that we use to perform *information pooling* or sharing is derived -from our knowledge about the structure of the data. For example, in a -high-throughput experiment like a DNA methylation assay, we know that all of -the features were measured simultaneously, using the same technique. This means -that generally, we expect the base-level variability for each feature to be -broadly similar. - -This can enable us to get a better estimate of the uncertainty of model +Now that we understand how hypothesis tests work in the +linear model framework, we are going to introduce an idea that allows us to +take advantage of the fact that we carry out many tests at once on +structured data. We can leverage this fact to *share information* +between model parameters. The insight that we use to perform +*information pooling* or sharing is derived from our knowledge about the +structure of the data. For example, in a high-throughput experiment like +a DNA methylation assay, we know that all of the features were measured +simultaneously, using the same technique. This means that generally, we +expect the base-level variability for each feature to be broadly +similar. + +This can enable us to get a better estimate of the uncertainty of model parameters than we could get if we consider each feature in isolation. -This enables us to share information between genes to get more robust -estimators. -Remember that the t-statistic for coefficient $\beta_k$ -in a linear model is the ratio of the coefficient value to the standard error: +So, to share information between features allows us to get more robust +estimators. Remember that the t-statistic for coefficient $\beta_k$ in a +linear model is the ratio between the coefficient estimate and its standard +error: $$ t_{k} = \frac{\hat{\beta}_{k}}{SE\left(\hat{\beta}_{k}\right)} $$ -It's clear that large effect sizes will likely lead to small p-values, -as long as the standard error for the coefficent is not large. -However, the standard error is affected by the strength of noise, -as we saw earlier. -If we have a small number of observations, it's common for the noise for some -features to be extremely small simply by chance. This will lead to a very small -p-value, which may give us unwarranted confidence in the level of certainty -we have in the results. - -There are many statistical methods in genomics that use this type of approach -to get better estimates by pooling information between features that were -measured simultaneously using the same techniques. -Here we will focus on the package `limma`, which is a relatively old software +It is clear that large effect sizes will likely lead to small p-values, +as long as the standard error for the coefficent is not large. However, +the standard error is affected by the amuont of noise, as we saw +earlier. If we have a small number of observations, it is common for the +noise for some features to be extremely small simply by chance. This, in turn, +causes small p-values for these features, which may give us unwarranted +confidence in the level of certainty we have in the results (false positives). + +There are many statistical methods in genomics that use this type of +approach to get better estimates by pooling information between features +that were measured simultaneously using the same techniques. Here we +will focus on the package **`limma`**, which is an established software package used to fit linear models, originally for the gene expression micro-arrays that were common in the 2000s, but which is still in use in -RNAseq experiments,among others. -The authors of `limma` made some assumptions about the distributions that these -follow, and pool information across genes to get a better estimate of the -uncertainty in effect size estimates. It uses the idea that noise levels should -be similar between features to *moderate* the estimates of the test statistic -by shrinking the estimates of standard errors towards a common value. This -results in a *moderated t-statistic*. - -The process of running a model in `limma` is a bit different to what you may -have seen when running linear models. Here, we define a *model matrix* or -*design matrix*, which is a way of representing the coefficients that should be -fit in each linear model. These are used in similar ways in a lot of different -modelling libraries. +RNAseq experiments, among others. The authors of **`limma`** made some +assumptions about the distributions that these follow, and pool +information across genes to get a better estimate of the uncertainty in +effect size estimates. It uses the idea that noise levels should be +similar between features to *moderate* the estimates of the test +statistic by shrinking the estimates of standard errors towards a common +value. This results in a *moderated t-statistic*. + +The process of running a model in **`limma`** is somewhat different to what you +may have seen when running linear models. Here, we define a *model +matrix* or *design matrix*, which is a way of representing the +coefficients that should be fit in each linear model. These are used in +similar ways in many different modelling libraries. ```{r design-age} library("limma") @@ -605,59 +583,61 @@ dim(design_age) head(design_age) ``` -As you can see, the design matrix has the same number of rows as our methylation -data has samples. It also has two columns - one for the intercept, similar to -the linear model we fit above, and one for age. This happens "under the hood" -when fitting a linear model with `lm`, but here we have to specify it directly. -The [limma user manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) has more detail on how to make design matrices for different -types of experimental design, but here we're going to stick with this simple -two variable case. - -We then pass our matrix of methylation values into `lmFit`, specifying the -design matrix. Internally, this function runs `lm` on each row of the data -in an efficient way. The function `eBayes`, when applied to the output of -`lmFit`, performs the pooled estimation of standard errors that results -in the moderated t-statistics and resulting p-values. +As you can see, the design matrix has the same number of rows as our +methylation data has samples. It also has two columns - one for the +intercept (similar to the linear model we fit above) and one for age. +This happens "under the hood" when fitting a linear model with `lm()`, but +here we have to specify it directly. The [limma user +manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) +has more detail on how to make design matrices for different types of +experimental design, but here we are going to stick with this simple two-variable case. + +We then pass our matrix of methylation values into `lmFit()`, specifying +the design matrix. Internally, this function runs `lm()` on each row of +the data in an efficient way. The function `eBayes()`, when applied to the +output of `lmFit()`, performs the pooled estimation of standard errors +that results in the moderated t-statistics and resulting p-values. ```{r lmfit-age} fit_age <- lmFit(methyl_mat, design = design_age) fit_age <- eBayes(fit_age) ``` -To obtain the results of the linear models, we can use the `topTable` function. -By default, this returns results for the first coefficient in the model. As -we saw above when using `lm`, and when we defined `design_age` above, -the first coefficient relates to the intercept term, which we are not -particularly interested in here; therefore we specify `coef = 2`. -Further, `topTable` by default only returns the top 10 results. To see all -of the results in the data, we specify `number = nrow(fit_age)` to ensure -that it returns a row for every row of the input matrix. +To obtain the results of the linear models, we can use the `topTable()` +function. By default, this returns results for the first coefficient in +the model. As we saw above when using `lm()`, and when we defined +`design_age` above, the first coefficient relates to the intercept term, +which we are not particularly interested in here; therefore we specify +`coef = 2`. Further, `topTable()` by default only returns the top 10 +results. To see all of the results in the data, we specify +`number = nrow(fit_age)` to ensure that it returns a row for every row +of the input matrix. ```{r ebayes-toptab} toptab_age <- topTable(fit_age, coef = 2, number = nrow(fit_age)) -head(toptab_age) +orderEffSize <- rev(order(abs(toptab_age$logFC))) # order by effect size (absolute log-fold change) +head(toptab_age[orderEffSize, ]) ``` -The output of `topTable` includes the coefficient, here termed a log fold change -`logFC`, the average level (`aveExpr`), the t-statistic `t`, the p-value -(`P.Value`), and the *adjusted* p-value (`adj.P.Val`). We'll cover what -an adjusted p-value is very shortly. The table also includes `B`, -which represents the log-odds that a feature is signficantly different, -which we won't cover here, but which will generally be a 1-1 transformation -of the p-value. -The coefficient estimates here are termed `logFC` for legacy reasons relating -to how microarray experiments were traditionally performed. -There are more details on this topic in many places, for example -[this tutorial by Kasper D. Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html) - -Now we have estimates of effect sizes and p-values for the association between -methylation level at each locus and age for our 37 samples. It's useful to -create a plot of effect size estimates (model coefficients) against -p-values for each of these linear models, to visualise the magnitude of effects -and the statistical significance of each. -These plots are often called "volcano plots", because they -resemble an eruption. - +The output of `topTable` includes the coefficient, here termed a log +fold change `logFC`, the average level (`aveExpr`), the t-statistic `t`, +the p-value (`P.Value`), and the *adjusted* p-value (`adj.P.Val`). We'll +cover what an adjusted p-value is very shortly. The table also includes +`B`, which represents the log-odds that a feature is signficantly +different, which we won't cover here, but which will generally be a 1-1 +transformation of the p-value. The coefficient estimates here are termed +`logFC` for legacy reasons relating to how microarray experiments were +traditionally performed. There are more details on this topic in many +places, for example [this tutorial by Kasper D. +Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html) + +Now we have estimates of effect sizes and p-values for the association +between methylation level at each locus and age for our 37 samples. It's +useful to create a plot of effect size estimates (model coefficients) +against p-values for each of these linear models, to visualise the +magnitude of effects and the statistical significance of each. These +plots are often called "volcano plots", because they resemble an +eruption. ```{r limmavolc1, fig.cap="Plotting p-values against effect sizes using limma; the results are similar to a standard linear model.", fig.alt="A plot of -log10(p) against effect size estimates for a regression of age against methylation using limma."} plot(toptab_age$logFC, -log10(toptab_age$P.Value), @@ -668,48 +648,49 @@ plot(toptab_age$logFC, -log10(toptab_age$P.Value), In this figure, every point represents a feature of interest. The x-axis represents the effect size observed for that feature in a linear model, -while the y-axis is the $-\log_{10}(\text{p-value})$, where larger values -indicate increasing statistical evidence of a non-zero effect size. -A positive effect size represents increasing methylation with increasing age, -and a negative effect size represents decreasing methylation with increasing -age. Points higher on the x-axis represent features for which we think the -results we observed would be very unlikely under the null hypothesis. - -Since we want to identify feature that have different methylation levels in -different age groups, -in an ideal case there would be clear separation between "null" and "non-null" -features. However, usually we observe results as we do here: there is a -continuum of effect sizes and p-values, with no clear separation between these -two classes of features. While statistical methods exist to -derive insights from continuous measures like these, it is often convenient to -obtain a list of features which we are confident have non-zero effect sizes. -This is made more difficult by the number of tests we perform. - +while the y-axis is the $-\log_{10}(\text{p-value})$, where larger +values indicate increasing statistical evidence of a non-zero effect +size. A positive effect size represents increasing methylation with +increasing age, and a negative effect size represents decreasing +methylation with increasing age. Points higher on the x-axis represent +features for which we think the results we observed would be very +unlikely under the null hypothesis. + +Since we want to identify feature that have different methylation levels +in different age groups, in an ideal case there would be clear +separation between "null" and "non-null" features. However, usually we +observe results as we do here: there is a continuum of effect sizes and +p-values, with no clear separation between these two classes of +features. While statistical methods exist to derive insights from +continuous measures like these, it is often convenient to obtain a list +of features which we are confident have non-zero effect sizes. This is +made more difficult by the number of tests we perform. > ## Exercise -> -> The effect size estimates are very small, and yet many of the p-values are -> well below a usual significance level of p < 0.05. Why is this? -> +> +> The effect size estimates are very small, and yet many of the p-values +> are well below a usual significance level of p \< 0.05. Why is this? +> > > ## Solution -> > -> > Because age has a much larger range than methylation levels, the unit change -> > in methylation level even for a strong relationship is very small! -> > -> > As we mentioned, the p-value is a function of both the effect size estimate -> > and the uncertainty (standard error) of that estimate. Because the -> > uncertainty in our estimates is much smaller than the estimates themselves, -> > the p-values are also small. -> > -> > If we predicted age using methylation level, it's likely we'd see much -> > larger coefficients, though broadly similar p-values! -> > -> {: .solution} -{: .challenge} +> > +> > Because age has a much larger range than methylation levels, the +> > unit change in methylation level even for a strong relationship is +> > very small! +> > +> > As we mentioned, the p-value is a function of both the effect size +> > estimate and the uncertainty (standard error) of that estimate. +> > Because the uncertainty in our estimates is much smaller than the +> > estimates themselves, the p-values are also small. +> > +> > If we predicted age using methylation level, it's likely we'd see +> > much larger coefficients, though broadly similar p-values! +> > +> > {: .solution} {: .challenge} It's worthwhile considering what exactly the effect of the *moderation* -or information sharing that `limma` performs has on our results. To do this, -let's compare the effect sizes estimates and p-values from the two approaches. +or information sharing that `limma` performs has on our results. To do +this, let's compare the effect sizes estimates and p-values from the two +approaches. ```{r plot-limma-lm-effect, echo = FALSE} plot( @@ -723,12 +704,12 @@ plot( abline(0:1, lty = "dashed") ``` -These are exactly identical! This is because `limma` isn't performing any -sharing of information when estimating effect sizes. This is in contrast to -similar packages that apply shrinkage to the effect size estimates, like -`DESeq2`. These often use information sharing to shrink or moderate the -effect size estimates, in the case of DESeq2 by again sharing information -between features about sample-to-sample variability. +These are exactly identical! This is because `limma` isn't performing +any sharing of information when estimating effect sizes. This is in +contrast to similar packages that apply shrinkage to the effect size +estimates, like `DESeq2`. These often use information sharing to shrink +or moderate the effect size estimates, in the case of DESeq2 by again +sharing information between features about sample-to-sample variability. In contrast, let's look at the p-values from `limma` and `lm`: ```{r plot-limma-lm-pval, echo = FALSE} @@ -744,61 +725,60 @@ plot( abline(0:1, lty = "dashed") ``` -You can see that for the vast majority of features, the results are broadly -similar. There seems to be a minor general tendency for `limma` to produce -smaller p-values, but for several features, the p-values from limma are -considerably larger than the p-values from `lm`. -This is because the information sharing tends to shrink large standard error -estimates downwards and small estimates upwards. -When the degree of statistical significance is due to an abnormally small -standard error rather than a large effect, this effect results in this prominent -reduction in statistical significance, which has been shown to perform well -in case studies. The degree of shrinkage generally depends on -the amount of pooled information and the strength of the -evidence independent of pooling. For example, with very few samples and many -features, information sharing has a larger effect, because there are a lot -of genes that can be used to provide pooled estimates, and the evidence from -the data that this is weighed against is relatively sparse. In contrast, -when there are many samples and few features, there is not much opportunity -to generate pooled estimates, and the evidence of the data can easily outweigh -the pooling. - -Shrinkage methods like these ones can be complex to implement and understand, -but it's good to understand why these approaches may be more precise -and sensitive than the naive approach of fitting a model to each feature -separately. +You can see that for the vast majority of features, the results are +broadly similar. There seems to be a minor general tendency for `limma` +to produce smaller p-values, but for several features, the p-values from +limma are considerably larger than the p-values from `lm`. This is +because the information sharing tends to shrink large standard error +estimates downwards and small estimates upwards. When the degree of +statistical significance is due to an abnormally small standard error +rather than a large effect, this effect results in this prominent +reduction in statistical significance, which has been shown to perform +well in case studies. The degree of shrinkage generally depends on the +amount of pooled information and the strength of the evidence +independent of pooling. For example, with very few samples and many +features, information sharing has a larger effect, because there are a +lot of genes that can be used to provide pooled estimates, and the +evidence from the data that this is weighed against is relatively +sparse. In contrast, when there are many samples and few features, there +is not much opportunity to generate pooled estimates, and the evidence +of the data can easily outweigh the pooling. + +Shrinkage methods like these ones can be complex to implement and +understand, but it's good to understand why these approaches may be more +precise and sensitive than the naive approach of fitting a model to each +feature separately. > ## Exercise -> -> 1. Try to run the same kind of linear model with smoking -> status as covariate instead of age, and making a volcano -> plot. -> *Note: smoking status is stored as `methylation$smoker`.* -> 2. We saw in the example in the lesson that this information sharing -> can lead to larger p-values. Why might this be preferable? -> +> +> 1. Try to run the same kind of linear model with smoking status as +> covariate instead of age, and making a volcano plot. *Note: +> smoking status is stored as `methylation$smoker`.* +> 2. We saw in the example in the lesson that this information sharing +> can lead to larger p-values. Why might this be preferable? > > > ## Solution -> > -> > 1. The following code runs the same type of model with smoking status: -> > ```{r limmavolc2, fig.cap="A plot of significance against effect size for a regression of smoking against methylation.", fig.alt="A plot of -log10(p) against effect size estimates for a regression of smoking status against methylation using limma."} -> > design_smoke <- model.matrix(~methylation$smoker) -> > fit_smoke <- lmFit(methyl_mat, design = design_smoke) -> > fit_smoke <- eBayes(fit_smoke) -> > toptab_smoke <- topTable(fit_smoke, coef = 2, number = nrow(fit_smoke)) -> > plot(toptab_smoke$logFC, -log10(toptab_smoke$P.Value), -> > xlab = "Effect size", ylab = bquote(-log[10](p)), -> > pch = 19 -> > ) -> > ``` -> > 2. Being a bit more conservative when identifying features can help to avoid -> > false discoveries. Furthermore, when rejecting the null hypothesis is -> > based more on a small standard error resulting from abnormally low levels -> > of variability for a given feature, we might want to be a bit more -> > conservative in our expectations. -> {: .solution} -{: .challenge} - +> > +> > 1. The following code runs the same type of model with smoking +> > status: +> > +> > ```{r limmavolc2, fig.cap="A plot of significance against effect size for a regression of smoking against methylation.", fig.alt="A plot of -log10(p) against effect size estimates for a regression of smoking status against methylation using limma."} +> > design_smoke <- model.matrix(~methylation$smoker) +> > fit_smoke <- lmFit(methyl_mat, design = design_smoke) +> > fit_smoke <- eBayes(fit_smoke) +> > toptab_smoke <- topTable(fit_smoke, coef = 2, number = nrow(fit_smoke)) +> > plot(toptab_smoke$logFC, -log10(toptab_smoke$P.Value), +> > xlab = "Effect size", ylab = bquote(-log[10](p)), +> > pch = 19 +> > ) +> > ``` +> > +> > 2. Being a bit more conservative when identifying features can help +> > to avoid false discoveries. Furthermore, when rejecting the null +> > hypothesis is based more on a small standard error resulting +> > from abnormally low levels of variability for a given feature, +> > we might want to be a bit more conservative in our expectations. +> > {: .solution} {: .challenge} ```{r limma-app-ex, echo = FALSE, eval = FALSE} > ## Exercise @@ -828,47 +808,42 @@ separately. ``` > ## Shrinkage -> -> Shrinkage is an intuitive term for an effect -> of information sharing, and is something observed -> in a broad range of statistical models. -> Often, shrinkage is induced by a *multilevel* -> modelling approach or by *Bayesian* methods. -> -> The general idea is that these models incorporate -> information about the structure of the -> data into account when fitting the parameters. -> We can share information between features -> because of our knowledge about the data structure; -> this generally requires careful consideration about -> how the data were generated and the relationships within. > -> An example people often use is estimating the effect -> of attendance on grades in several schools. We can -> assume that this effect is similar in different schools -> (but maybe not identical), so we can *share information* -> about the effect size between schools and shink our +> Shrinkage is an intuitive term for an effect of information sharing, +> and is something observed in a broad range of statistical models. +> Often, shrinkage is induced by a *multilevel* modelling approach or by +> *Bayesian* methods. +> +> The general idea is that these models incorporate information about +> the structure of the data into account when fitting the parameters. We +> can share information between features because of our knowledge about +> the data structure; this generally requires careful consideration +> about how the data were generated and the relationships within. +> +> An example people often use is estimating the effect of attendance on +> grades in several schools. We can assume that this effect is similar +> in different schools (but maybe not identical), so we can *share +> information* about the effect size between schools and shink our > estimates towards a common value. -> -> For example in `DESeq2`, the authors used the observation -> that genes with similar expression counts in RNAseq data -> have similar *dispersion*, and a better estimate of -> these dispersion parameters makes estimates of -> fold changes much more stable. -> Similarly, in `limma` the authors made the assumption that -> in the absence of biological effects, we can often expect the -> technical variation in the measurement of the expression of each of the -> genes to be broadly similar. -> Again, better estimates of variability allow us to -> prioritise genes in a more reliable way. -> +> +> For example in `DESeq2`, the authors used the observation that genes +> with similar expression counts in RNAseq data have similar +> *dispersion*, and a better estimate of these dispersion parameters +> makes estimates of fold changes much more stable. Similarly, in +> `limma` the authors made the assumption that in the absence of +> biological effects, we can often expect the technical variation in the +> measurement of the expression of each of the genes to be broadly +> similar. Again, better estimates of variability allow us to prioritise +> genes in a more reliable way. +> > There are many good resources to learn about this type of approach, > including: -> -> - [a blog post by TJ Mahr](https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/) -> - [a book by David Robinson](https://gumroad.com/l/empirical-bayes) -> - [a (relatively technical) book by Gelman and Hill](http://www.stat.columbia.edu/~gelman/arm/) -{: .callout} +> +> - [a blog post by TJ +> Mahr](https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/) +> - [a book by David Robinson](https://gumroad.com/l/empirical-bayes) +> - [a (relatively technical) book by Gelman and +> Hill](http://www.stat.columbia.edu/~gelman/arm/) {: .callout} ```{r, eval=FALSE, echo=FALSE} # todo: callout box explaining DESeq2 @@ -876,17 +851,17 @@ separately. # The problem of multiple tests -With such a large number of features, it would be useful -to decide which features are "interesting" or "significant" -for further study. However, if we were to apply a normal significance threshold -of 0.05, it's likely that we'd end up with a lot of false positives. That's -because a p-value threshold like this represents a $\frac{1}{20}$ chance that -we'd observe results as extreme or more extreme under the null hypothesis -(that there is no assocation between age and methylation level). If we do -many more than 20 such tests, we can expect that for a lot of the tests, the -null hypothesis is true, but we will still observe extreme results. -To demonstrate this, it's useful to see what happens if -we scramble age and run the same test again: +With such a large number of features, it would be useful to decide which +features are "interesting" or "significant" for further study. However, +if we were to apply a normal significance threshold of 0.05, it's likely +that we'd end up with a lot of false positives. That's because a p-value +threshold like this represents a $\frac{1}{20}$ chance that we'd observe +results as extreme or more extreme under the null hypothesis (that there +is no assocation between age and methylation level). If we do many more +than 20 such tests, we can expect that for a lot of the tests, the null +hypothesis is true, but we will still observe extreme results. To +demonstrate this, it's useful to see what happens if we scramble age and +run the same test again: ```{r volcplotfake, fig.cap="Plotting p-values against effect sizes for a randomised outcome shows we still observe 'significant' results.", fig.alt="Plot of -log10(p) against effect size estimates for a regression of a made-up feature against methylation level for each feature in the data. A dashed line represents a 0.05 significance level."} @@ -904,94 +879,92 @@ plot(toptab_age_perm$logFC, -log10(toptab_age_perm$P.Value), abline(h = -log10(0.05), lty = "dashed", col = "red") ``` -Since we've generated a random sequence of ages, we have no reason to suspect -that there is a true association between methylation levels and this sequence -of random numbers. However, you can see that the p-value for many features is -still lower than a traditional significance level of $p=0.05$. In fact, here -`r sum(toptab_age_perm$P.Value < 0.05)` features are significant at p < 0.05. -If we were to use this fixed threshold in a real experiment, it's likely that -we'd identify many features as associated with age, when the results we're -observing are simply due to chance. +Since we've generated a random sequence of ages, we have no reason to +suspect that there is a true association between methylation levels and +this sequence of random numbers. However, you can see that the p-value +for many features is still lower than a traditional significance level +of $p=0.05$. In fact, here `r sum(toptab_age_perm$P.Value < 0.05)` +features are significant at p \< 0.05. If we were to use this fixed +threshold in a real experiment, it's likely that we'd identify many +features as associated with age, when the results we're observing are +simply due to chance. > ## Exercise > -> -> 1. If we run `r nrow(methylation)` tests under the null hypothesis, -> how many of them (on average) will be statistically -> significant at a threshold of $p < 0.05$? -> 2. Why would we want to be conservative in labelling features -> as significantly different? -> By conservative, we mean to err towards labelling true -> differences as "not significant" rather than vice versa. -> 3. How could we account for a varying number of tests to -> ensure "significant" changes are truly different? -> +> 1. If we run `r nrow(methylation)` tests under the null hypothesis, +> how many of them (on average) will be statistically significant at +> a threshold of $p < 0.05$? +> 2. Why would we want to be conservative in labelling features as +> significantly different? By conservative, we mean to err towards +> labelling true differences as "not significant" rather than vice +> versa. +> 3. How could we account for a varying number of tests to ensure +> "significant" changes are truly different? +> > > ## Solution -> > 1. By default we expect -> > $`r nrow(methylation)` \times 0.05 = `r nrow(methylation) * 0.05`$ -> > features to be statistically significant under the null hypothesis, -> > because p-values should always be uniformly distributed under -> > the null hypothesis. -> > 2. Features that we label as "significantly different" will often -> > be reported in manuscripts. We may also spend time and money -> > investigating them further, computationally or in the lab. -> > Therefore, spurious results have a real cost for ourselves and -> > for others. -> > 3. One approach to controlling for the number of tests is to -> > divide our significance threshold by the number of tests -> > performed. This is termed "Bonferroni correction" and -> > we'll discuss this further now. -> {: .solution} -{: .challenge} - +> > +> > 1. By default we expect +> > $`r nrow(methylation)` \times 0.05 = `r nrow(methylation) * 0.05`$ +> > features to be statistically significant under the null +> > hypothesis, because p-values should always be uniformly +> > distributed under the null hypothesis. +> > 2. Features that we label as "significantly different" will often +> > be reported in manuscripts. We may also spend time and money +> > investigating them further, computationally or in the lab. +> > Therefore, spurious results have a real cost for ourselves and +> > for others. +> > 3. One approach to controlling for the number of tests is to divide +> > our significance threshold by the number of tests performed. +> > This is termed "Bonferroni correction" and we'll discuss this +> > further now. {: .solution} {: .challenge} # Adjusting for multiple tests -When performing many statistical tests to -categorise features, we're effectively classifying -features as "significant" - meaning those for which we reject the null -hypothesis - and "non-significant". We also generally hope that there is a -subset of features for which the null hypothesis is truly false, as well as many -for which the null truly does hold. We hope that for all features for which the -null hypothesis is true, we accept it, and for all features for which the null -hypothesis is not true, we reject it. As we showed in the example with permuted -age, with a large number of tests it's inevitable that we'll get some of these -wrong. - -We can think of these features as being -"truly different" or "not truly different"[^2]. -Using this idea, we can see that each -categorisation we make falls into four categories: - - -| |Label as different |Label as not different| -|-------------------:|--------------------:|---------------------:| -|Truly different |True positive |False negative | -|Truly not different |False positive |True negative | - - -If the null hypothesis was true for every feature, then as we perform more and -more tests we'd tend to correctly categorise most -results as negative. However, since p-values -are uniformly distributed under the null, -at a significance level of 5%, 5% of all -results will be "significant" even though we would expect to see these results, -given the null hypothesis is true, simply by chance. -These would fall under the label "false positives" in the table -above, and are also termed "false discoveries." - -There are two common ways of controlling these -false discoveries. -The first is to say, when we're doing $n$ tests, -that we want to have the same certainty of making -one false discovery with $n$ tests as we have if we're only doing -one test. This is "Bonferroni" correction,[^3] which -divides the significance level by the number of -tests performed, $n$. Equivalently, we can use the -non-transformed p-value threshold but multiply -our p-values by the number of tests. -This is often very conservative, especially -with a lot of features! +When performing many statistical tests to categorise features, we're +effectively classifying features as "significant" - meaning those for +which we reject the null hypothesis - and "non-significant". We also +generally hope that there is a subset of features for which the null +hypothesis is truly false, as well as many for which the null truly does +hold. We hope that for all features for which the null hypothesis is +true, we accept it, and for all features for which the null hypothesis +is not true, we reject it. As we showed in the example with permuted +age, with a large number of tests it's inevitable that we'll get some of +these wrong. + +We can think of these features as being "truly different" or "not truly +different"[^1]. Using this idea, we can see that each categorisation we +make falls into four categories: + +[^1]: "True difference" is a hard category to rigidly define. As we've + seen, with a lot of data, we can detect tiny differences, and with + little data, we can't detect large differences. However, both can be + argued to be "true". + +| | Label as different | Label as not different | +|--------------------:|-------------------:|-----------------------:| +| Truly different | True positive | False negative | +| Truly not different | False positive | True negative | + +If the null hypothesis was true for every feature, then as we perform +more and more tests we'd tend to correctly categorise most results as +negative. However, since p-values are uniformly distributed under the +null, at a significance level of 5%, 5% of all results will be +"significant" even though we would expect to see these results, given +the null hypothesis is true, simply by chance. These would fall under +the label "false positives" in the table above, and are also termed +"false discoveries." + +There are two common ways of controlling these false discoveries. The +first is to say, when we're doing $n$ tests, that we want to have the +same certainty of making one false discovery with $n$ tests as we have +if we're only doing one test. This is "Bonferroni" correction,[^2] which +divides the significance level by the number of tests performed, $n$. +Equivalently, we can use the non-transformed p-value threshold but +multiply our p-values by the number of tests. This is often very +conservative, especially with a lot of features! + +[^2]: Bonferroni correction is also termed "family-wise" error rate + control. ```{r p-fwer, fig.cap="Bonferroni correction often produces very large p-values, especially with low sample sizes.", fig.alt="Plot of Bonferroni-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} p_raw <- toptab_age$P.Value @@ -1007,109 +980,109 @@ ggplot() + labs(x = "Raw p-value", y = "Bonferroni p-value") ``` -You can see that the p-values are exactly one for the vast majority of tests -we performed! This is not ideal sometimes, because unfortunately we usually -don't have very large sample sizes in health sciences. - -The second main way of controlling for multiple tests -is to control the *false discovery rate*.[^4] -This is the proportion of false positives, or false discoveries, -we'd expect to get each time if we repeated -the experiment over and over. - -1. Rank the p-values -2. Assign each a rank (1 is smallest) -3. Calculate the critical value - $$ - q = \left(\frac{i}{m}\right)Q - $$, - where $i$ is rank, $m$ is the number of tests, and $Q$ is the - false discovery rate we want to target.[^5] -4. Find the largest p-value less than the critical value. - All smaller than this are significant. - - -|FWER|FDR| -|:-------------|:--------------| -|+ Controls probability of identifying a false positive|+ Controls rate of false discoveries| -|+ Strict error rate control |+ Allows error control with less stringency| -|- Very conservative |- Does not control probability of making errors| -|- Requires larger statistical power|- May result in false discoveries| +You can see that the p-values are exactly one for the vast majority of +tests we performed! This is not ideal sometimes, because unfortunately +we usually don't have very large sample sizes in health sciences. + +The second main way of controlling for multiple tests is to control the +*false discovery rate*.[^3] This is the proportion of false positives, +or false discoveries, we'd expect to get each time if we repeated the +experiment over and over. + +[^3]: This is often called "Benjamini-Hochberg" adjustment. + +1. Rank the p-values +2. Assign each a rank (1 is smallest) +3. Calculate the critical value $$ + q = \left(\frac{i}{m}\right)Q + $$, where $i$ is rank, $m$ is the number of tests, and $Q$ is the + false discovery rate we want to target.[^4] +4. Find the largest p-value less than the critical value. All smaller + than this are significant. + +[^4]: People often perform extra controls on FDR-adjusted p-values, + ensuring that ranks don't change and the critical value is never + smaller than the original p-value. + +| FWER | FDR | +|:----------------------------------|:------------------------------------| +| \+ Controls probability of identifying a false positive | \+ Controls rate of false discoveries | +| \+ Strict error rate control | \+ Allows error control with less stringency | +| \- Very conservative | \- Does not control probability of making errors | +| \- Requires larger statistical power | \- May result in false discoveries | > ## Exercise > -> 1. At a significance level of 0.05, with 100 tests -> performed, what is the Bonferroni significance -> threshold? -> 2. In a gene expression experiment, after FDR -> correction we observe 500 significant genes. -> What proportion of these genes are truly -> different? -> 3. Try running FDR correction on the `p_raw` vector. -> *Hint: check `help("p.adjust")` to see what the method -> is called*. -> Compare these values to the raw p-values -> and the Bonferroni p-values. -> +> 1. At a significance level of 0.05, with 100 tests performed, what is +> the Bonferroni significance threshold? +> 2. In a gene expression experiment, after FDR correction we observe +> 500 significant genes. What proportion of these genes are truly +> different? +> 3. Try running FDR correction on the `p_raw` vector. *Hint: check +> `help("p.adjust")` to see what the method is called*.\ +> Compare these values to the raw p-values and the Bonferroni +> p-values. +> > > ## Solution -> > -> > 1. The Bonferroni threshold for this significance -> > threshold is -> > $$ -> > \frac{0.05}{100} = 0.0005 -> > $$ -> > 2. Trick question! We can't say what proportion -> > of these genes are truly different. However, if -> > we repeated this experiment and statistical test -> > over and over, on average 5% of the results from -> > each run would be false discoveries. -> > 3. The following code runs FDR correction and compares it to -> > non-corrected values and to Bonferroni: -> > ```{r p-fdr, fig.cap="Benjamini-Hochberg correction is less conservative than Bonferroni", fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} -> > p_fdr <- p.adjust(p_raw, method = "BH") -> > ggplot() + -> > aes(p_raw, p_fdr) + -> > geom_point() + -> > scale_x_log10() + scale_y_log10() + -> > geom_abline(slope = 1, linetype = "dashed") + -> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + -> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + -> > labs(x = "Raw p-value", y = "Benjamini-Hochberg p-value") -> > ``` -> > ```{r plot-fdr-fwer, fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against Bonferroni-adjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} -> > ggplot() + -> > aes(p_fdr, p_fwer) + -> > geom_point() + -> > scale_x_log10() + scale_y_log10() + -> > geom_abline(slope = 1, linetype = "dashed") + -> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + -> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + -> > labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value") -> > ``` -> {: .solution} -{: .challenge} +> > +> > 1. The Bonferroni threshold for this significance threshold is $$ +> > \frac{0.05}{100} = 0.0005 +> > $$ +> > +> > 2. Trick question! We can't say what proportion of these genes are +> > truly different. However, if we repeated this experiment and +> > statistical test over and over, on average 5% of the results +> > from each run would be false discoveries. +> > +> > 3. The following code runs FDR correction and compares it to +> > non-corrected values and to Bonferroni: +> > +> > ```{r p-fdr, fig.cap="Benjamini-Hochberg correction is less conservative than Bonferroni", fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against unadjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} +> > p_fdr <- p.adjust(p_raw, method = "BH") +> > ggplot() + +> > aes(p_raw, p_fdr) + +> > geom_point() + +> > scale_x_log10() + scale_y_log10() + +> > geom_abline(slope = 1, linetype = "dashed") + +> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + +> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + +> > labs(x = "Raw p-value", y = "Benjamini-Hochberg p-value") +> > ``` +> > +> > ```{r plot-fdr-fwer, fig.alt="Plot of Benjamini-Hochberg-adjusted p-values (y) against Bonferroni-adjusted p-values (x). A dashed black line represents the identity (where x=y), while dashed red lines represent 0.05 significance thresholds."} +> > ggplot() + +> > aes(p_fdr, p_fwer) + +> > geom_point() + +> > scale_x_log10() + scale_y_log10() + +> > geom_abline(slope = 1, linetype = "dashed") + +> > geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + +> > geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + +> > labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value") +> > ``` +> > +> > {: .solution} {: .challenge} -## Further reading +> ## Feature selection +> +> In this episode, we have focussed on regression in a setting where there are more +> features than observations. This approach is relevant if we are interested in the +> association of each feature with some outcome or if we want to screen for features +> that have a strong association with an outcome. If, however, we are interested in +> predicting an outcome or if we want to know which features explain the variation +> in the outcome, we may want to restrict ourselves to a subset of relevant features. +> One way of doing this is called *regularisation*, and this is the topic of the next episode. +> An alternative is called *feature selection*. This is covered in the subsequent (optional) episode. +{: .callout} -- [limma tutorial by Kasper D. Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html) -- [limma user manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf). +## Further reading +- [limma tutorial by Kasper D. + Hansen](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html) +- [limma user + manual](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf). ## Footnotes -[^1]: It's not hugely problematic if the assumption of normal residuals is violated. It mainly affects our ability to accurately predict responses for new, unseen observations. - -[^2]: "True difference" is a hard category to rigidly define. As we've seen, with a lot of data, we can detect tiny differences, and with little data, we can't detect large differences. However, both can be argued to be "true". - -[^3]: Bonferroni correction is also termed "family-wise" error rate control. - -[^4]: This is often called "Benjamini-Hochberg" adjustment. - -[^5]: People often perform extra controls on FDR-adjusted p-values, ensuring that ranks don't change and the critical value is never smaller than the original p-value. - - - - {% include links.md %} From 556cd2b1f203fe66f6e0b6b7500ab71c24fc6327 Mon Sep 17 00:00:00 2001 From: Hannes Becher Date: Mon, 13 Jun 2022 11:24:16 +0100 Subject: [PATCH 2/8] Update _episodes_rmd/02-high-dimensional-regression.Rmd - mention M-values Co-authored-by: Catalina Vallejos --- _episodes_rmd/02-high-dimensional-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index 519770b2..2d7435b8 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -40,7 +40,7 @@ For the following few episodes, we will be working with human DNA methylation data from flow-sorted blood samples. DNA methylation assays measure, for each of many sites in the genome, the proportion of DNA that carries a methyl mark. In this case, the methylation data come in -the form of a matrix of normalised methylation levels, where negative +the form of a matrix of normalised methylation levels (M-values), where negative values correspond to unmethylated DNA and positive values correspond to methylated DNA. Along with this, we have a number of sample phenotypes (eg, age in years, BMI). From 7090e68836f4514967a83ca8d1de23d4bdcc1a3c Mon Sep 17 00:00:00 2001 From: Hannes Becher Date: Mon, 13 Jun 2022 11:27:57 +0100 Subject: [PATCH 3/8] episode 2 - explain methyl mark --- _episodes_rmd/02-high-dimensional-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index 2d7435b8..fe068dbc 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -39,7 +39,7 @@ knitr_fig_path("02-") For the following few episodes, we will be working with human DNA methylation data from flow-sorted blood samples. DNA methylation assays measure, for each of many sites in the genome, the proportion of DNA -that carries a methyl mark. In this case, the methylation data come in +that carries a methyl mark (a chemical modification that does not alter the DNA sequence). In this case, the methylation data come in the form of a matrix of normalised methylation levels (M-values), where negative values correspond to unmethylated DNA and positive values correspond to methylated DNA. Along with this, we have a number of sample phenotypes From 24e99f58032b29d4e3d9de4148f7c99ad77c659b Mon Sep 17 00:00:00 2001 From: Hannes Becher Date: Mon, 13 Jun 2022 11:28:36 +0100 Subject: [PATCH 4/8] Update _episodes_rmd/02-high-dimensional-regression.Rmd - add "normalised"" Co-authored-by: Catalina Vallejos --- _episodes_rmd/02-high-dimensional-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index fe068dbc..a1dbd79b 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -56,7 +56,7 @@ methylation <- readRDS(here("data/methylation.rds")) This `methylation` object is a `GenomicRatioSet`, a Bioconductor data object derived from the `SummarizedExperiment` class. These `SummarizedExperiment` objects contain `assay`s, in this case -methylation levels, and optional sample-level `colData` and +normalised methylation levels, and optional sample-level `colData` and feature-level `metadata`. These objects are very convenient to contain all of the information about a dataset in a high-throughput context. If you would like more detail on these objects it may be useful to consult From 01f87149e48f2da3b8a8d6156bb2a96aef61a514 Mon Sep 17 00:00:00 2001 From: Hannes Becher Date: Mon, 13 Jun 2022 11:29:11 +0100 Subject: [PATCH 5/8] Update _episodes_rmd/02-high-dimensional-regression.Rmd - parentheses Co-authored-by: Catalina Vallejos --- _episodes_rmd/02-high-dimensional-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index a1dbd79b..a7d158c3 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -70,7 +70,7 @@ methylation You can see in this output that this object has a `dim()` of $`r nrow(methylation)` \times `r ncol(methylation)`$, meaning it has `r nrow(methylation)` features and `r ncol(methylation)` columns. To -extract the matrix of methylation values (M-values), we can use the +extract the matrix of methylation M-values, we can use the `assay()` function. One thing to bear in mind with these objects (and data structures for computational biology in R generally) is that in the matrix of methylation data, samples or observations are stored as From e2920f0f058311357ed44e861f0af9d0e7c450e9 Mon Sep 17 00:00:00 2001 From: Hannes Becher Date: Mon, 13 Jun 2022 11:30:28 +0100 Subject: [PATCH 6/8] Update _episodes_rmd/02-high-dimensional-regression.Rmd - remove mention of number of features Co-authored-by: Catalina Vallejos --- _episodes_rmd/02-high-dimensional-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index a7d158c3..22e99872 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -138,7 +138,7 @@ Heatmap(methyl_mat_ord, We would like to identify features that are related to our outcome of interest (age). It is clear from the heatmap that there are too many -features to do so manually, even with this reduced number of features - +features to do so manually. the original dataset contained over 800,000! > ## Exercise From bccf673c1c97d29204e96cb5f5d27aebdbf904c7 Mon Sep 17 00:00:00 2001 From: Hannes Becher Date: Mon, 13 Jun 2022 11:30:46 +0100 Subject: [PATCH 7/8] Update _episodes_rmd/02-high-dimensional-regression.Rmd - remove mention of number of features 2nd Co-authored-by: Catalina Vallejos --- _episodes_rmd/02-high-dimensional-regression.Rmd | 1 - 1 file changed, 1 deletion(-) diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index 22e99872..f7968beb 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -139,7 +139,6 @@ Heatmap(methyl_mat_ord, We would like to identify features that are related to our outcome of interest (age). It is clear from the heatmap that there are too many features to do so manually. -the original dataset contained over 800,000! > ## Exercise > From 347626a360a9298e80fdc18c261b725854584a4c Mon Sep 17 00:00:00 2001 From: Hannes Becher Date: Mon, 13 Jun 2022 11:31:14 +0100 Subject: [PATCH 8/8] Update _episodes_rmd/02-high-dimensional-regression.Rmd - remove mention of CpG sites Co-authored-by: Catalina Vallejos --- _episodes_rmd/02-high-dimensional-regression.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index f7968beb..24ebccc4 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -74,7 +74,7 @@ extract the matrix of methylation M-values, we can use the `assay()` function. One thing to bear in mind with these objects (and data structures for computational biology in R generally) is that in the matrix of methylation data, samples or observations are stored as -columns, while features (in this case, CpG sites) are stored as rows. +columns, while features (in this case, sites in the genome) are stored as rows. This is in contrast to usual tabular data, where features or variables are stored as columns and observations are stored as rows.