Skip to content

Commit

Permalink
Add explanations and examples on rarefaction (#487)
Browse files Browse the repository at this point in the history
Co-authored-by: Leo Lahti <[email protected]>
Co-authored-by: Tuomas Borman <[email protected]>
Co-authored-by: TuomasBorman <[email protected]>
  • Loading branch information
4 people authored Jul 23, 2024
1 parent aa00c4c commit d81f29d
Showing 1 changed file with 176 additions and 4 deletions.
180 changes: 176 additions & 4 deletions inst/pages/beta_diversity.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,10 @@ Sample dissimilarity can be visualized on a lower-dimensional display (typically
provides tools to incorporate additional information encoded by color, shape,
size and other aesthetics. Can you find any difference between the groups?

```{r plot-mds-bray-curtis, fig.cap = "MDS plot based on the Bray-Curtis distances on the GlobalPattern dataset."}
```{r}
#| label: MDS plot based on the Bray-Curtis distances
#| fig-cap: MDS plot based on the Bray-Curtis distances on the GlobalPattern dataset.
# Create ggplot object
p <- plotReducedDim(tse, "MDS_bray", colour_by = "Group")
Expand Down Expand Up @@ -236,6 +239,177 @@ tse <- runMDS(
plotReducedDim(tse, "Unifrac", colour_by = "Group")
```

### Rarefaction to mitigate impacts of uneven sequencing effort

The sequencing depth of a sample refers to the number of metagenomic reads
obtained from the sequencing process. A variation in sequencing depth across the
samples of a study may impact the calculation of alpha and beta diversity
metrics [@Schloss2023].
It is common to find significant variation in sequencing depth between samples.
For instance, the samples of the *TreeSummarizedExperiment* dataset
`GlobalPatterns` show up to a 40-fold difference in the number of metagenomic
reads.

```{r}
#| label: rarefaction1
# Calculate the list of sequencing depths across samples
sequencing_depths <- colSums(assay(tse))
# Calculate variation between highest and lowest sequencing depth
depth_variation <- max(sequencing_depths)/min(sequencing_depths)
depth_variation
```

To address uneven sequencing effort, rarefaction aims to normalize metagenomic
reads counts using subsampling.
First, the user chooses the rarefaction depth and a number of iterations N. All
the samples with metagenomic reads count below the rarefaction depth are removed
and metagenomic reads are randomly drawn from the samples left to get subsamples
fitting the rarefaction depth. Then a beta diversity metric is calculated from
those subsamples and the process is iterated N times. Finally, beta diversity is
estimated with the mean of all beta diversity values calculated on subsampled
data.

There has been a long-lasting controversy surrounding the use of rarefaction in
microbial ecology. The main concern is that rarefaction would omit data
[@mcmurdie2014waste] [@Schloss2024rarefaction2].
However, if the subsampling process is repeated a sufficient number of times,
and if the rarefaction depth is set to the lowest metagenomic reads count found
across all samples, no data will be omitted.
Moreover, Patrick Schloss has demonstrated that rarefaction is "the only method
that could control for variation in uneven sequencing effort when measuring
commonly used alpha and beta diversity metrics" [@Schloss2023].

Before applying rarefaction, selecting the most variable features can help
minimize variation caused by random subsampling. These features have the highest
read counts, while rare features tend to increase sampling variation. This
approach facilitates comparison of results between non-rarefied and rarefied
distance calculations later on.

```{r}
#| label: rarefaction2
# Let us see what happens when we operate with ntop highest variance features
ntop <- 5
tse_sub <- tse[head(rev(order(rowSds(assay(tse, "counts")))), ntop), ]
```

Let us first convert the count assay to centered log ratio (clr) assay and
calculate Aitchison distance without rarefaction:

```{r}
#| label: Aitchison distance without rarefaction
# Centered log-ratio transformation to properly apply Aitchison distance
tse_sub <- transformAssay(
tse_sub,
assay.type = "counts",
method = "clr",
pseudocount = 1
)
# Run MDS on clr assay with Aitchison distance
tse_sub <- runMDS(
tse_sub,
FUN = vegan::vegdist,
method = "euclidean",
assay.type = "clr",
name = "MDS_aitchison"
)
```

The function `vegan::avgdist` can use rarefaction to compute beta diversity:


```{r}
#| label = "MDS plot based on the Bray-Curtis distances using rarefaction."
# Define custom transformation function..
clr <- function (x) {
vegan::decostand(x, method="clr", pseudocount=1)
}
# Run transformation after rarefactions before calculating the beta diversity..
tse_sub <- runMDS(tse_sub,
assay.type="counts",
ntop = nrow(tse_sub),
FUN = vegan::avgdist, # Custom beta diversity function that includes rarefaction
sample = min(colSums(assay(tse_sub, "counts"))), # rarefaction depth
distfun = vegan::vegdist, # Use vegdist beta diversity function
dmethod = "euclidean", # euclidean beta diversity
iterations = 10,
transf=clr,
replace=TRUE,
name = "MDS_aitchison_rarefied"
)
plotReducedDim(tse_sub, "MDS_aitchison_rarefied")
# Generate plots for non-rarefied and rarefied Bray-Curtis distances scaled by
# MDS
plots <- lapply(
c("MDS_aitchison", "MDS_aitchison_rarefied"),
plotReducedDim,
object = tse_sub)
# Generate multi-panel plot
wrap_plots(plots) +
plot_layout(guides = "collect")
```

```{r}
#| label = rarefaction3
# Check correlation between non-rarefied and rarefied Aichitson distances
print(all.equal(
current = cor(
as.vector(reducedDim(tse_sub, "MDS_aitchison")),
as.vector(reducedDim(tse_sub, "MDS_aitchison_rarefied"))
),
target = 1,
tolerance = 0.05))
```

`ntop` is a `runMDS` option. Here it is set to total number of features in
`GlobalPatterns` dataset so that they are not filtered. If ntop was not set to
the total number of features, only the ntop features with highest variance would
be used for dimensionality reduction and the other features would be filtered.

The argument `sample` is set to the smallest metagenomic reads count across
all samples. This ensures that no sample will be removed during the rarefaction
process.

The argument `iterations` is by default set to 100 but 10 iterations is often
sufficient for beta diversity calculations.

To use transformations after the rarefaction of the samples and before the beta
diversity calculation, `vegan::avgdist` has the argument `transf`.

We can also plot the correlation between principal coordinates PCx and PCy for
both the rarefied and non-rarefied distance calculations:

```{r}
#| label: PC_correlation
par(mfrow=c(1, 2))
for (k in 1:2) {
# Principal axes are sign invariant;
# align the signs; if the correlation is negative then swap the other axis
pcx <- reducedDim(tse_sub, "MDS_aitchison")[,k]
pcy <- reducedDim(tse_sub, "MDS_aitchison_rarefied")[,k]
pcy <- sign(cor(pcx, pcy)) * pcy
r <- cor(pcx, pcy)
plot(pcx, pcy, main=paste0("PC", k, "; r=",round(r, 3)));
abline(0, 1)
}
```

The mia package includes the rarefaction function `rarefyAssay` that randomly
subsamples a given assay of a `TreeSummarizedExperiment` dataset. However, the
iterations and beta diversity calculations are not included.

### Other ordination methods {#sec-other-ord-methods}

Other dimension reduction methods, such as PCA and UMAP, are inherited from the
Expand Down Expand Up @@ -292,13 +466,11 @@ would report relative stress, which varies within the unit interval and is
better if smaller. This can be calculated as shown below.

```{r relstress}
# Load vegan package
library(vegan)
# Quantify dissimilarities in the original feature space
# Pick relabunance assay separately
x <- assay(tse, "relabundance")
d0 <- as.matrix(vegdist(t(x), "bray"))
d0 <- as.matrix(vegan::vegdist(t(x), "bray"))
# PCoA Ordination
pcoa <- as.data.frame(cmdscale(d0, k = 2))
Expand Down

0 comments on commit d81f29d

Please sign in to comment.