diff --git a/episodes/multi-sample.Rmd b/episodes/multi-sample.Rmd index 1d3385b..f48531c 100644 --- a/episodes/multi-sample.Rmd +++ b/episodes/multi-sample.Rmd @@ -199,16 +199,13 @@ summed The main advantage of using pseudo-bulk samples is the possibility to use well-tested methods for differential analysis like `edgeR` and `DESeq2`, we will -focus on the former for this analysis. `edgeR` uses a Negative Binomial -Generalized Linear Model. +focus on the former for this analysis. `edgeR` and `DESeq2` both use negative binomial models under the hood, but differ in their normalization strategies and other implementation details. First, let's start with a specific cell type, for instance the "Mesenchymal stem -cells", and look into differences between this cell type across conditions. +cells", and look into differences between this cell type across conditions. We put the counts table into a `DGEList` container called `y`, along with the corresponding metadata. ```{r} -label <- "Mesenchyme" - -current <- summed[,label == summed$celltype.mapped] +current <- summed[, summed$celltype.mapped == "Mesenchyme"] y <- DGEList(counts(current), samples = colData(current)) @@ -241,7 +238,7 @@ y <- y[keep,] summary(keep) ``` -We can now proceed to normalize the data There are several approaches for +We can now proceed to normalize the data. There are several approaches for normalizing bulk, and hence pseudo-bulk data. Here, we use the Trimmed Mean of M-values method, implemented in the `edgeR` package within the `calcNormFactors` function. Keep in mind that because we are going to normalize the pseudo-bulk @@ -274,7 +271,7 @@ par(mfrow = c(1,1)) Furthermore, we want to check if the samples cluster together based on their known factors (like the tomato injection in this case). -To do so, we use the MDS plot, which is very close to a PCA representation. +In this case, we'll use the multidimensional scaling (MDS) plot. Multidimensional scaling (which also goes by principal *coordinate* analysis (PCoA)) is a dimensionality reduction technique that's conceptually similar to principal *component* analysis (PCA). ```{r} limma::plotMDS(cpm(y, log = TRUE), @@ -308,7 +305,6 @@ Additionally, the Common and Trend BCV are shown in `red` and `blue`. plotBCV(y) ``` - We then fit a Quasi-Likelihood (QL) negative binomial generalized linear model for each gene. The `robust = TRUE` parameter avoids distortions from highly variable clusters. The QL method includes an additional dispersion parameter, useful to handle the uncertainty and variability of the per-gene variance, which is not well estimated by the NB dispersions, so the two dispersion types complement each other in the final analysis. @@ -419,9 +415,10 @@ data means by the total number of cells in each sample (cell type). On the other hand, this can lead our data to be susceptible to compositional effect. "Compositional" refers to the fact that the cluster abundances in a sample are not independent of one another because each cell type is effectively -competing for space in the sample. They behave like proportions. If cell type A -abundance increases in a new condition, that means we'll observe less of -everything else, even if everything else is unaffected by the new condition. +competing for space in the sample. They behave like proportions in that they +must sum to 1. If cell type A abundance increases in a new condition, that means +we'll observe less of everything else, even if everything else is unaffected by +the new condition. Compositionality means that our conclusions can be biased by the amount of cells present in each cell type. And this amount of cells can be totally unbalanced @@ -466,7 +463,7 @@ topTags(res2, n = 10) ### Testing against a log-fold change threshold -This other approach assumes that the composition bias introduces a spurious +A second approach assumes that the composition bias introduces a spurious log2-fold change of no more than a \tau quantity for a non-DA label. In other words, we interpret this as the maximum log-fold change in the total