diff --git a/inst/pages/beta_diversity.qmd b/inst/pages/beta_diversity.qmd index 2f53eeed..270e5ae5 100644 --- a/inst/pages/beta_diversity.qmd +++ b/inst/pages/beta_diversity.qmd @@ -148,7 +148,10 @@ Sample dissimilarity can be visualized on a lower-dimensional display (typically provides tools to incorporate additional information encoded by color, shape, size and other aesthetics. Can you find any difference between the groups? -```{r plot-mds-bray-curtis, fig.cap = "MDS plot based on the Bray-Curtis distances on the GlobalPattern dataset."} +```{r} +#| label: MDS plot based on the Bray-Curtis distances +#| fig-cap: MDS plot based on the Bray-Curtis distances on the GlobalPattern dataset. + # Create ggplot object p <- plotReducedDim(tse, "MDS_bray", colour_by = "Group") @@ -236,6 +239,177 @@ tse <- runMDS( plotReducedDim(tse, "Unifrac", colour_by = "Group") ``` +### Rarefaction to mitigate impacts of uneven sequencing effort + +The sequencing depth of a sample refers to the number of metagenomic reads +obtained from the sequencing process. A variation in sequencing depth across the +samples of a study may impact the calculation of alpha and beta diversity +metrics [@Schloss2023]. +It is common to find significant variation in sequencing depth between samples. +For instance, the samples of the *TreeSummarizedExperiment* dataset +`GlobalPatterns` show up to a 40-fold difference in the number of metagenomic +reads. + +```{r} +#| label: rarefaction1 + +# Calculate the list of sequencing depths across samples +sequencing_depths <- colSums(assay(tse)) +# Calculate variation between highest and lowest sequencing depth +depth_variation <- max(sequencing_depths)/min(sequencing_depths) +depth_variation +``` + +To address uneven sequencing effort, rarefaction aims to normalize metagenomic +reads counts using subsampling. +First, the user chooses the rarefaction depth and a number of iterations N. All +the samples with metagenomic reads count below the rarefaction depth are removed +and metagenomic reads are randomly drawn from the samples left to get subsamples +fitting the rarefaction depth. Then a beta diversity metric is calculated from +those subsamples and the process is iterated N times. Finally, beta diversity is +estimated with the mean of all beta diversity values calculated on subsampled +data. + +There has been a long-lasting controversy surrounding the use of rarefaction in +microbial ecology. The main concern is that rarefaction would omit data +[@mcmurdie2014waste] [@Schloss2024rarefaction2]. +However, if the subsampling process is repeated a sufficient number of times, +and if the rarefaction depth is set to the lowest metagenomic reads count found +across all samples, no data will be omitted. +Moreover, Patrick Schloss has demonstrated that rarefaction is "the only method +that could control for variation in uneven sequencing effort when measuring +commonly used alpha and beta diversity metrics" [@Schloss2023]. + +Before applying rarefaction, selecting the most variable features can help +minimize variation caused by random subsampling. These features have the highest +read counts, while rare features tend to increase sampling variation. This +approach facilitates comparison of results between non-rarefied and rarefied +distance calculations later on. + +```{r} +#| label: rarefaction2 + +# Let us see what happens when we operate with ntop highest variance features +ntop <- 5 +tse_sub <- tse[head(rev(order(rowSds(assay(tse, "counts")))), ntop), ] +``` + +Let us first convert the count assay to centered log ratio (clr) assay and +calculate Aitchison distance without rarefaction: + +```{r} +#| label: Aitchison distance without rarefaction + +# Centered log-ratio transformation to properly apply Aitchison distance + +tse_sub <- transformAssay( + tse_sub, + assay.type = "counts", + method = "clr", + pseudocount = 1 + ) + +# Run MDS on clr assay with Aitchison distance +tse_sub <- runMDS( + tse_sub, + FUN = vegan::vegdist, + method = "euclidean", + assay.type = "clr", + name = "MDS_aitchison" + ) +``` + +The function `vegan::avgdist` can use rarefaction to compute beta diversity: + + +```{r} +#| label = "MDS plot based on the Bray-Curtis distances using rarefaction." + +# Define custom transformation function.. +clr <- function (x) { + vegan::decostand(x, method="clr", pseudocount=1) +} + +# Run transformation after rarefactions before calculating the beta diversity.. +tse_sub <- runMDS(tse_sub, + assay.type="counts", + ntop = nrow(tse_sub), + FUN = vegan::avgdist, # Custom beta diversity function that includes rarefaction + sample = min(colSums(assay(tse_sub, "counts"))), # rarefaction depth + distfun = vegan::vegdist, # Use vegdist beta diversity function + dmethod = "euclidean", # euclidean beta diversity + iterations = 10, + transf=clr, + replace=TRUE, + name = "MDS_aitchison_rarefied" + ) + +plotReducedDim(tse_sub, "MDS_aitchison_rarefied") + +# Generate plots for non-rarefied and rarefied Bray-Curtis distances scaled by +# MDS +plots <- lapply( + c("MDS_aitchison", "MDS_aitchison_rarefied"), + plotReducedDim, + object = tse_sub) + +# Generate multi-panel plot +wrap_plots(plots) + + plot_layout(guides = "collect") +``` + +```{r} +#| label = rarefaction3 + +# Check correlation between non-rarefied and rarefied Aichitson distances +print(all.equal( + current = cor( + as.vector(reducedDim(tse_sub, "MDS_aitchison")), + as.vector(reducedDim(tse_sub, "MDS_aitchison_rarefied")) + ), + target = 1, + tolerance = 0.05)) +``` + +`ntop` is a `runMDS` option. Here it is set to total number of features in +`GlobalPatterns` dataset so that they are not filtered. If ntop was not set to +the total number of features, only the ntop features with highest variance would +be used for dimensionality reduction and the other features would be filtered. + +The argument `sample` is set to the smallest metagenomic reads count across +all samples. This ensures that no sample will be removed during the rarefaction +process. + +The argument `iterations` is by default set to 100 but 10 iterations is often +sufficient for beta diversity calculations. + +To use transformations after the rarefaction of the samples and before the beta +diversity calculation, `vegan::avgdist` has the argument `transf`. + +We can also plot the correlation between principal coordinates PCx and PCy for +both the rarefied and non-rarefied distance calculations: + +```{r} +#| label: PC_correlation +par(mfrow=c(1, 2)) +for (k in 1:2) { + + # Principal axes are sign invariant; + # align the signs; if the correlation is negative then swap the other axis + pcx <- reducedDim(tse_sub, "MDS_aitchison")[,k] + pcy <- reducedDim(tse_sub, "MDS_aitchison_rarefied")[,k] + pcy <- sign(cor(pcx, pcy)) * pcy + r <- cor(pcx, pcy) + + plot(pcx, pcy, main=paste0("PC", k, "; r=",round(r, 3))); + abline(0, 1) +} +``` + +The mia package includes the rarefaction function `rarefyAssay` that randomly +subsamples a given assay of a `TreeSummarizedExperiment` dataset. However, the +iterations and beta diversity calculations are not included. + ### Other ordination methods {#sec-other-ord-methods} Other dimension reduction methods, such as PCA and UMAP, are inherited from the @@ -292,13 +466,11 @@ would report relative stress, which varies within the unit interval and is better if smaller. This can be calculated as shown below. ```{r relstress} -# Load vegan package -library(vegan) # Quantify dissimilarities in the original feature space # Pick relabunance assay separately x <- assay(tse, "relabundance") -d0 <- as.matrix(vegdist(t(x), "bray")) +d0 <- as.matrix(vegan::vegdist(t(x), "bray")) # PCoA Ordination pcoa <- as.data.frame(cmdscale(d0, k = 2))