From d85b18720f1f5cfef26b683e188cc21c88ed6046 Mon Sep 17 00:00:00 2001 From: Artur Sannikov <40318410+artur-sannikov@users.noreply.github.com> Date: Mon, 29 Jul 2024 21:02:27 +0000 Subject: [PATCH] Remove trailing whitespaces (#588) --- DESCRIPTION | 7 +- inst/pages/acknowledgments.qmd | 18 +- inst/pages/agglomeration.qmd | 20 +- inst/pages/alpha_diversity.qmd | 18 +- inst/pages/beta_diversity.qmd | 102 +++---- inst/pages/clustering.qmd | 167 ++++++------ inst/pages/containers.qmd | 38 +-- inst/pages/convert.qmd | 10 +- inst/pages/correlation.qmd | 9 +- inst/pages/cross_correlation.qmd | 22 +- inst/pages/differential_abundance.qmd | 100 +++---- inst/pages/exercises.qmd | 248 +++++++++--------- inst/pages/extra_material.qmd | 56 ++-- inst/pages/import.qmd | 32 +-- inst/pages/integrated_learner.qmd | 129 ++++----- inst/pages/intro.qmd | 4 +- inst/pages/introductory_workflow.qmd | 54 ++-- .../introductory_workflow_dutch_version.qmd | 24 +- .../introductory_workflow_french_version.qmd | 24 +- inst/pages/machine_learning.qmd | 8 +- inst/pages/miaverse.qmd | 8 +- inst/pages/mmuphin_meta_analysis.qmd | 76 +++--- inst/pages/msea.qmd | 112 ++++---- inst/pages/multiassay_ordination.qmd | 12 +- inst/pages/network_comparison.qmd | 64 ++--- inst/pages/network_learning.qmd | 115 ++++---- inst/pages/quality_control.qmd | 76 +++--- inst/pages/resources.qmd | 76 +++--- inst/pages/session_info.qmd | 12 +- inst/pages/taxonomy.qmd | 10 +- inst/pages/training.qmd | 8 +- inst/pages/transformation.qmd | 25 +- inst/pages/visualization.qmd | 62 ++--- inst/pages/wrangling.qmd | 38 +-- 34 files changed, 902 insertions(+), 882 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 7ffe63013..260799172 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -2,7 +2,7 @@ Package: OMA Title: Orchestrating Microbiome Analysis with Bioconductor Version: 0.98.25 Date: 2024-04-26 -Authors@R: +Authors@R: c(person("Leo", "Lahti", role = c("aut"), comment = c(ORCID = "0000-0001-5537-637X")), person(given = "Tuomas", family = "Borman", role = c("aut", "cre"), @@ -11,11 +11,11 @@ Authors@R: person("Felix GM", "Ernst", email = "felix.gm.ernst@outlook.com", role = c("aut"), comment = c(ORCID = "0000-0001-5064-0928")), - person("and others", "(see the full list of contributors)", + person("and others", "(see the full list of contributors)", role = c("ctb")) ) Description: - This is a reference cookbook for **Microbiome Data Science** with + This is a reference cookbook for **Microbiome Data Science** with R and Bioconductor. License: CC BY-NC-SA 4.0 Encoding: UTF-8 @@ -59,6 +59,7 @@ Suggests: glmnet, glue, grid, + gtools, gsEasy, igraph, kableExtra, diff --git a/inst/pages/acknowledgments.qmd b/inst/pages/acknowledgments.qmd index 5860eb504..c85b281d8 100644 --- a/inst/pages/acknowledgments.qmd +++ b/inst/pages/acknowledgments.qmd @@ -31,7 +31,7 @@ coordinated by: and runs regular training workshops in microbiome data science. - *Tuomas Borman*, PhD researcher and the lead developer of OMA/mia at - the Department of Computing, University of Turku. + the Department of Computing, University of Turku. ### Contributors {-} @@ -70,7 +70,7 @@ and the OMA book - *Matti Ruuskanen, PhD*, added machine learning techniques for microbiome analysis - + - *Stefanie Peschel* has contributed chapters on the construction, analysis, and comparison of microbial association networks. @@ -80,15 +80,15 @@ and a Professor for Biomedical Statistics and Data Science at [LMU Munich](https://www.en.statistik.uni-muenchen.de/index.html). He assisted in writing the chapters on network learning and comparison. -- *Shigdel Rajesh, PhD* +- *Shigdel Rajesh, PhD* -- *Artur Sannikov* +- *Artur Sannikov* -- *Akewak Jeba* +- *Akewak Jeba* - *Himmi Lindgren* -- *Lu Yang* +- *Lu Yang* - *Katariina Pärnänen* @@ -114,8 +114,8 @@ in writing the chapters on network learning and comparison. -*Matteo Calgaro* -- *Jacques Serizay* converted the _OMA_ book to the `BiocBook` format. This -allows the _OMA_ book to be built and distributed by Bioconductor. +- *Jacques Serizay* converted the _OMA_ book to the `BiocBook` format. This +allows the _OMA_ book to be built and distributed by Bioconductor. - *Himel Mallick, PhD, FASA*, principal investigator and tenure-track faculty at Cornell University’s Department of Population Health Sciences and an adjunct @@ -124,7 +124,7 @@ Information Science. He contributed to the chapters on meta-analyses, microbe set enrichment analysis (MSEA) and multi-omics prediction and classification. - *Yihan Liu*, assisted Dr. Mallick in writing the chapters on meta-anlayses, -MSEA and multi-omics prediction and classification. +MSEA and multi-omics prediction and classification. ### Acknowledgments {-} diff --git a/inst/pages/agglomeration.qmd b/inst/pages/agglomeration.qmd index 2553b84f7..9ba96a97a 100644 --- a/inst/pages/agglomeration.qmd +++ b/inst/pages/agglomeration.qmd @@ -21,7 +21,7 @@ with counts aggregated from the lower-level taxa associated with them. ## Agglomerate data to certain rank {#sec-data-agglomeration} One of the main applications of taxonomic information in regards to count data -is to agglomerate count data on taxonomic levels and track the influence of +is to agglomerate count data on taxonomic levels and track the influence of changing conditions through these levels. For this `mia` contains the `agglomerateByRank` function. The ideal location to store the agglomerated data is as an alternative experiment. @@ -45,12 +45,12 @@ assayNames(altExp(tse, "Family")) ```{r} assay(altExp(tse, "Family"), "relabundance")[1:5, 1:7] ``` - + ```{r taxinfo_altexp_example} assay(altExp(tse, "Family"), "counts")[1:5, 1:7] ``` -`altExpNames` now consists of `Family` level data. This can be extended to use +`altExpNames` now consists of `Family` level data. This can be extended to use any taxonomic level listed in `taxonomyRanks(tse)`. We can also aggregate the data across all available ranks in one step using @@ -102,18 +102,18 @@ dim(tse_sub) ## Agglomerate based on prevalence -Rare taxa can also be aggregated into a single group "Other" instead of +Rare taxa can also be aggregated into a single group "Other" instead of filtering them out. A suitable function for this is `agglomerateByPrevalence`. -The number of rare taxa is higher on the species level, which causes the need +The number of rare taxa is higher on the species level, which causes the need for data agglomeration by prevalence. ```{r} altExp(tse, "Species_byPrevalence") <- agglomerateByPrevalence( - tse, - rank = "Species", - other.label = "Other", - prevalence = 5 / 100, - detection = 1 / 100, + tse, + rank = "Species", + other.label = "Other", + prevalence = 5 / 100, + detection = 1 / 100, as.relative = TRUE) altExp(tse, "Species_byPrevalence") diff --git a/inst/pages/alpha_diversity.qmd b/inst/pages/alpha_diversity.qmd index 381bddf30..7e61b5f13 100644 --- a/inst/pages/alpha_diversity.qmd +++ b/inst/pages/alpha_diversity.qmd @@ -53,7 +53,7 @@ indices of dominance or rarity: a high share of the total species abundance in the community. Note that dominance indices are generally inversely correlated with other alpha diversity indices. - + **Rarity** indices characterize the concentration of species at low abundance. Prevalence and detection thresholds determine rare species whose total concentration will determine the value of a @@ -112,11 +112,11 @@ type and final barcode). ```{r plot-div-obs, message=FALSE, fig.cap="Shannon diversity estimates plotted grouped by sample type with colour-labeled barcode.", cache=TRUE} library(scater) plotColData( - tse, - "observed", - "SampleType", + tse, + "observed", + "SampleType", colour_by = "Final_Barcode") + - theme(axis.text.x = element_text(angle = 45, hjust = 1)) + + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(expression(Richness[Observed])) ``` @@ -187,7 +187,7 @@ plots <- lapply( # Fine-tune visual appearance plots <- lapply( - plots, "+", + plots, "+", theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank())) @@ -197,7 +197,7 @@ wrap_plots(plots, ncol = 1) + plot_layout(guides = "collect") ``` -## Visualizing significance in group-wise comparisons +## Visualizing significance in group-wise comparisons Next, let's compare the Shannon index between sample groups and visualize the statistical significance. Using the `stat_compare_means` function from the @@ -233,8 +233,8 @@ pvals <- reshape( pvals, direction = "long", varying = colnames(pvals)[ !colnames(pvals) %in% varname ], - times = colnames(pvals)[ !colnames(pvals) %in% varname ], - v.names = "p", + times = colnames(pvals)[ !colnames(pvals) %in% varname ], + v.names = "p", timevar = "group2", idvar = "group1" ) |> diff --git a/inst/pages/beta_diversity.qmd b/inst/pages/beta_diversity.qmd index 270e5ae5d..dbdebb56c 100644 --- a/inst/pages/beta_diversity.qmd +++ b/inst/pages/beta_diversity.qmd @@ -34,7 +34,7 @@ terms dissimilarity and beta diversity are preferred. | Method description | Assay type | Beta diversity metric | |:---------------------------:|:-------------------:|:---------------------:| -| Quantitative profiling | Absolute counts | Bray-Curtis | +| Quantitative profiling | Absolute counts | Bray-Curtis | | Relative profiling | Relative abundances | Bray-Curtis | | Aitchison distance | Absolute counts | Aitchison | | Aitchison distance | clr | Euclidean | @@ -43,9 +43,9 @@ terms dissimilarity and beta diversity are preferred. | Presence/Absence similarity | Absolute counts | Jaccard | | Phylogenetic distance | Rarefied counts | Unifrac | -In practice, beta diversity is usually represented as a `dist` object. Such an +In practice, beta diversity is usually represented as a `dist` object. Such an object is a triangular matrix where the distance between each pair of samples -is encoded by a specific cell. +is encoded by a specific cell. This distance matrix can then undergo ordination, which is an important ecological tool to reduce the dimensionality of data for a more efficient analysis and visualization. Ordination techniques aim to capture as @@ -56,7 +56,7 @@ similarities in an optimal way, which is defined in different ways by different methods. Based on the type of algorithm, ordination methods in microbiome research can -be generally divided in two categories: +be generally divided in two categories: - unsupervised ordination - supervised ordination @@ -75,7 +75,7 @@ library(mia) data("GlobalPatterns", package = "mia") tse <- GlobalPatterns -# Beta diversity metrics like Bray-Curtis are often +# Beta diversity metrics like Bray-Curtis are often # applied to relabundances tse <- transformAssay( tse, assay.type = "counts", method = "relabundance") @@ -242,11 +242,11 @@ plotReducedDim(tse, "Unifrac", colour_by = "Group") ### Rarefaction to mitigate impacts of uneven sequencing effort The sequencing depth of a sample refers to the number of metagenomic reads -obtained from the sequencing process. A variation in sequencing depth across the -samples of a study may impact the calculation of alpha and beta diversity -metrics [@Schloss2023]. +obtained from the sequencing process. A variation in sequencing depth across the +samples of a study may impact the calculation of alpha and beta diversity +metrics [@Schloss2023]. It is common to find significant variation in sequencing depth between samples. -For instance, the samples of the *TreeSummarizedExperiment* dataset +For instance, the samples of the *TreeSummarizedExperiment* dataset `GlobalPatterns` show up to a 40-fold difference in the number of metagenomic reads. @@ -255,36 +255,36 @@ reads. # Calculate the list of sequencing depths across samples sequencing_depths <- colSums(assay(tse)) -# Calculate variation between highest and lowest sequencing depth +# Calculate variation between highest and lowest sequencing depth depth_variation <- max(sequencing_depths)/min(sequencing_depths) depth_variation ``` -To address uneven sequencing effort, rarefaction aims to normalize metagenomic -reads counts using subsampling. -First, the user chooses the rarefaction depth and a number of iterations N. All +To address uneven sequencing effort, rarefaction aims to normalize metagenomic +reads counts using subsampling. +First, the user chooses the rarefaction depth and a number of iterations N. All the samples with metagenomic reads count below the rarefaction depth are removed and metagenomic reads are randomly drawn from the samples left to get subsamples -fitting the rarefaction depth. Then a beta diversity metric is calculated from +fitting the rarefaction depth. Then a beta diversity metric is calculated from those subsamples and the process is iterated N times. Finally, beta diversity is -estimated with the mean of all beta diversity values calculated on subsampled +estimated with the mean of all beta diversity values calculated on subsampled data. There has been a long-lasting controversy surrounding the use of rarefaction in microbial ecology. The main concern is that rarefaction would omit data [@mcmurdie2014waste] [@Schloss2024rarefaction2]. -However, if the subsampling process is repeated a sufficient number of times, -and if the rarefaction depth is set to the lowest metagenomic reads count found -across all samples, no data will be omitted. -Moreover, Patrick Schloss has demonstrated that rarefaction is "the only method -that could control for variation in uneven sequencing effort when measuring +However, if the subsampling process is repeated a sufficient number of times, +and if the rarefaction depth is set to the lowest metagenomic reads count found +across all samples, no data will be omitted. +Moreover, Patrick Schloss has demonstrated that rarefaction is "the only method +that could control for variation in uneven sequencing effort when measuring commonly used alpha and beta diversity metrics" [@Schloss2023]. -Before applying rarefaction, selecting the most variable features can help +Before applying rarefaction, selecting the most variable features can help minimize variation caused by random subsampling. These features have the highest -read counts, while rare features tend to increase sampling variation. This -approach facilitates comparison of results between non-rarefied and rarefied -distance calculations later on. +read counts, while rare features tend to increase sampling variation. This +approach facilitates comparison of results between non-rarefied and rarefied +distance calculations later on. ```{r} #| label: rarefaction2 @@ -294,7 +294,7 @@ ntop <- 5 tse_sub <- tse[head(rev(order(rowSds(assay(tse, "counts")))), ntop), ] ``` -Let us first convert the count assay to centered log ratio (clr) assay and +Let us first convert the count assay to centered log ratio (clr) assay and calculate Aitchison distance without rarefaction: ```{r} @@ -304,7 +304,7 @@ calculate Aitchison distance without rarefaction: tse_sub <- transformAssay( tse_sub, - assay.type = "counts", + assay.type = "counts", method = "clr", pseudocount = 1 ) @@ -331,11 +331,11 @@ clr <- function (x) { } # Run transformation after rarefactions before calculating the beta diversity.. -tse_sub <- runMDS(tse_sub, +tse_sub <- runMDS(tse_sub, assay.type="counts", - ntop = nrow(tse_sub), + ntop = nrow(tse_sub), FUN = vegan::avgdist, # Custom beta diversity function that includes rarefaction - sample = min(colSums(assay(tse_sub, "counts"))), # rarefaction depth + sample = min(colSums(assay(tse_sub, "counts"))), # rarefaction depth distfun = vegan::vegdist, # Use vegdist beta diversity function dmethod = "euclidean", # euclidean beta diversity iterations = 10, @@ -346,7 +346,7 @@ tse_sub <- runMDS(tse_sub, plotReducedDim(tse_sub, "MDS_aitchison_rarefied") -# Generate plots for non-rarefied and rarefied Bray-Curtis distances scaled by +# Generate plots for non-rarefied and rarefied Bray-Curtis distances scaled by # MDS plots <- lapply( c("MDS_aitchison", "MDS_aitchison_rarefied"), @@ -371,22 +371,22 @@ print(all.equal( tolerance = 0.05)) ``` -`ntop` is a `runMDS` option. Here it is set to total number of features in -`GlobalPatterns` dataset so that they are not filtered. If ntop was not set to +`ntop` is a `runMDS` option. Here it is set to total number of features in +`GlobalPatterns` dataset so that they are not filtered. If ntop was not set to the total number of features, only the ntop features with highest variance would be used for dimensionality reduction and the other features would be filtered. -The argument `sample` is set to the smallest metagenomic reads count across -all samples. This ensures that no sample will be removed during the rarefaction -process. +The argument `sample` is set to the smallest metagenomic reads count across +all samples. This ensures that no sample will be removed during the rarefaction +process. -The argument `iterations` is by default set to 100 but 10 iterations is often +The argument `iterations` is by default set to 100 but 10 iterations is often sufficient for beta diversity calculations. -To use transformations after the rarefaction of the samples and before the beta +To use transformations after the rarefaction of the samples and before the beta diversity calculation, `vegan::avgdist` has the argument `transf`. -We can also plot the correlation between principal coordinates PCx and PCy for +We can also plot the correlation between principal coordinates PCx and PCy for both the rarefied and non-rarefied distance calculations: ```{r} @@ -407,7 +407,7 @@ for (k in 1:2) { ``` The mia package includes the rarefaction function `rarefyAssay` that randomly -subsamples a given assay of a `TreeSummarizedExperiment` dataset. However, the +subsamples a given assay of a `TreeSummarizedExperiment` dataset. However, the iterations and beta diversity calculations are not included. ### Other ordination methods {#sec-other-ord-methods} @@ -469,7 +469,7 @@ better if smaller. This can be calculated as shown below. # Quantify dissimilarities in the original feature space # Pick relabunance assay separately -x <- assay(tse, "relabundance") +x <- assay(tse, "relabundance") d0 <- as.matrix(vegan::vegdist(t(x), "bray")) # PCoA Ordination @@ -479,7 +479,7 @@ names(pcoa) <- c("PCoA1", "PCoA2") # Quantify dissimilarities in the ordination space dp <- as.matrix(dist(pcoa)) -# Calculate stress i.e. relative difference +# Calculate stress i.e. relative difference # in the original and projected dissimilarities stress <- sum((dp - d0)^2) / sum(d0^2) ``` @@ -494,10 +494,10 @@ df <- data.frame(d0 = as.vector(d0)[ord], ggplot(df, aes(x = d0, y = dmds)) + geom_smooth() + - geom_point() + + geom_point() + labs(title = "Shepard plot", x = "Original distance", - y = "MDS distance", + y = "MDS distance", subtitle = paste("Stress:", round(stress, 2))) + theme_bw() ``` @@ -513,7 +513,7 @@ between the supervised and unsupervised ordination methods. | | supervised ordination | unsupervised ordination | |:------------------------:|:----------------------:|:------------------------:| | Euclidean distance | RDA | PCA | -| non-Euclidean distance | dbRDA | PCoA/MDS, NMDS, UMAP | +| non-Euclidean distance | dbRDA | PCoA/MDS, NMDS, UMAP | In summary, the dbRDA is the more general method that allows a wider variety dissimilarity, or beta diversity, indices. This method is @@ -631,11 +631,11 @@ ggplot(df, aes(x = x, y = y)) + ``` In the example above, the largest differences between the two groups -can be attributed to _`r names(sort(abs(top_coef), decreasing = TRUE))[[1]]`_ +can be attributed to _`r names(sort(abs(top_coef), decreasing = TRUE))[[1]]`_ and _`r names(sort(abs(top_coef), decreasing = TRUE))[[2]] `_. -### Checking the homogeneity condition +### Checking the homogeneity condition It is important to note that the application of PERMANOVA assumes homogeneous group dispersions (variances). This can be tested with the @@ -757,11 +757,11 @@ percent_FALSE <- round(freq_FALSE / sum(freq_FALSE) * 100, 1) plotReducedDim( tse_genus[ , colData(tse_genus)$Group == TRUE], "PCoA_BC", colour_by = "most_abundant") + - + scale_colour_manual( values = my_colors, labels = paste0(names(percent_TRUE), "(", percent_TRUE, "%)")) + - + labs( x = paste("PC 1 (", round(var_explained[1], 1), "%)"), y = paste("PC 2 (", round(var_explained[2], 1), "%)"), @@ -770,11 +770,11 @@ plotReducedDim( plotReducedDim( tse_genus[ , colData(tse_genus)$Group == FALSE], "PCoA_BC", colour_by = "most_abundant") + - + scale_colour_manual( values = my_colors, labels = paste0(names(percent_FALSE), "(", percent_FALSE, "%)")) + - + labs( x = paste("PC 1 (", round(var_explained[1], 1), "%)"), y = paste("PC 2 (", round(var_explained[2], 1), "%)"), @@ -808,6 +808,6 @@ and `runTSNE` For more information on sample clustering, you can refer to: -* [How to extract information from clusters](http://bioconductor.org/books/release/OSCA/clustering.html) +* [How to extract information from clusters](http://bioconductor.org/books/release/OSCA/clustering.html) * Chapter [@sec-clustering] on community typing ::: diff --git a/inst/pages/clustering.qmd b/inst/pages/clustering.qmd index 152fdfaf7..3a26f7e29 100644 --- a/inst/pages/clustering.qmd +++ b/inst/pages/clustering.qmd @@ -5,36 +5,36 @@ library(rebook) chapterPreamble() ``` -Community typing in microbial ecology involves identifying distinct microbial -communities by recognizing patterns in the data. The community types are +Community typing in microbial ecology involves identifying distinct microbial +communities by recognizing patterns in the data. The community types are described based on taxa characteristics, representing each community. -The samples are described -by community type assignments, which are defined by the ecosystem features in +The samples are described +by community type assignments, which are defined by the ecosystem features in the sample. Community typing techniques can typically be divided -into unsupervised -clustering and dimensionality reduction techniques, where clustering is more +into unsupervised +clustering and dimensionality reduction techniques, where clustering is more commonly in use. In this chapter, we first walk you through some common clustering schemes in use for microbial community ecology, and then focus in more detailed community typing techniques based on data dimensionality reduction. -## Clustering +## Clustering Clustering techniques aim to find groups, called clusters, that share a pattern -in the data. In the microbiome context, clustering techniques are included in -microbiome community typing methods. For example, clustering allow samples to be +in the data. In the microbiome context, clustering techniques are included in +microbiome community typing methods. For example, clustering allow samples to be distinguished from each other based on their microbiome community composition. -Clustering scheme consists of two steps, the first is to compute the sample -dissimilarities with a given distance metrics, and the second is to form the +Clustering scheme consists of two steps, the first is to compute the sample +dissimilarities with a given distance metrics, and the second is to form the clusters based on the dissimilarity matrix. The data can be clustered either -based on features or samples. The examples below +based on features or samples. The examples below are focused on sample clustering. -There are multiple clustering algorithms available. *bluster* is a -Bioconductor package providing tools for clustering data in the -`SummarizedExperiment` container. It offers multiple algorithms such -as hierarchical clustering, DBSCAN, and K-means. +There are multiple clustering algorithms available. *bluster* is a +Bioconductor package providing tools for clustering data in the +`SummarizedExperiment` container. It offers multiple algorithms such +as hierarchical clustering, DBSCAN, and K-means. ```{r load_bluster} # Load dependencies @@ -43,11 +43,11 @@ library(kableExtra) ``` In the examples of this chapter we use *peerj13075* data for -microbiota community -typing. This chapter illustrates how different results can be obtained depending -on the choice of the algorithm. To reduce calculation time we decided to -agglomerate taxa onto 'Class' level and filter out the least prevalent taxa as -well as less commonly detected within a sample resulting in 25 features in the +microbiota community +typing. This chapter illustrates how different results can be obtained depending +on the choice of the algorithm. To reduce calculation time we decided to +agglomerate taxa onto 'Class' level and filter out the least prevalent taxa as +well as less commonly detected within a sample resulting in 25 features in the data. ```{r load-pkg-data1} @@ -64,23 +64,23 @@ altExp(tse, "prevalent") <- agglomerateByPrevalence( The hierarchical clustering aims to find hierarchy between samples/features. There are two approaches: agglomerative ("bottom-up") -and divisive ("top-down"). In agglomerative approach, each observation is first -in a unique cluster, after which the algorithm continues to agglomerate similar -data points into clusters. The divisive approach, instead, starts with one -cluster that contains all observations. Clusters are split recursively into -clusters that differ the most. The clustering can be continued until each +and divisive ("top-down"). In agglomerative approach, each observation is first +in a unique cluster, after which the algorithm continues to agglomerate similar +data points into clusters. The divisive approach, instead, starts with one +cluster that contains all observations. Clusters are split recursively into +clusters that differ the most. The clustering can be continued until each cluster contains only one observation. -In this example we use `addCluster` function from mia to cluster the data. -`addCluster` function allows to choose a clustering algorithm and offers multiple -parameters to shape the result. `HclustParam()` parameter is chosen for hierarchical -clustering. `HclustParam()` parameter itself has parameters on its own +In this example we use `addCluster` function from mia to cluster the data. +`addCluster` function allows to choose a clustering algorithm and offers multiple +parameters to shape the result. `HclustParam()` parameter is chosen for hierarchical +clustering. `HclustParam()` parameter itself has parameters on its own [HclustParam documentation](https://rdrr.io/github/LTLA/bluster/man/HclustParam-class.html). -A parameter, `by` defines whether to cluster features or samples. Here we -cluster counts data, for which we compute the dissimilarities with the -Bray-Curtis distance. Here, again, we use `ward.d2` metohod. Returning -statistical information can chosen by using -`full` parameter. Finally, the `clust.col` parameter allows us to choose the +A parameter, `by` defines whether to cluster features or samples. Here we +cluster counts data, for which we compute the dissimilarities with the +Bray-Curtis distance. Here, again, we use `ward.d2` metohod. Returning +statistical information can chosen by using +`full` parameter. Finally, the `clust.col` parameter allows us to choose the name of the column in the colData (default name is `clusters`). ```{r hclust1} @@ -100,8 +100,8 @@ table(altExp(tse, "prevalent")$hclust) Hierarchical clustering suggests six clusters to describe the data. -Now, we visualize the hierarchical structure of the clusters with a dendrogram -tree. In dendrograms, the tree is split where the branch length is the largest. +Now, we visualize the hierarchical structure of the clusters with a dendrogram +tree. In dendrograms, the tree is split where the branch length is the largest. In each splitting point, the tree is divided into two clusters leading to the hierarchy. In this example, each sample is labelled by their dominant taxon to visualize ecological differences between the clusters. @@ -133,8 +133,8 @@ par(mar=c(9, 3, 0.5, 0.5)) dend %>% set("labels_cex", 0.8) %>% plot() ``` -We can also visualize the clusters by by projecting the data onto -two-dimensional +We can also visualize the clusters by projecting the data onto +two-dimensional surface that captures the most variability in the data. In this example, we use multi-dimensional scaling (MDS). @@ -144,23 +144,22 @@ library (scater) # Add the MDS dimensions for plotting altExp(tse, "prevalent") <- runMDS( altExp(tse, "prevalent"), - assay.type = "counts", - FUN = vegan::vegdist, + assay.type = "counts", + FUN = vegan::vegdist, method = "bray") ``` ```{r hclust4, fig.width=6, fig.height=4} plotReducedDim( altExp(tse, "prevalent"), "MDS", colour_by = "hclust", - theme_size = 13, point_size = 4) + - + theme_size = 13, point_size = 4) + labs(color = "HClust", title = "Hclust") + theme_minimal() ``` ### Agglomeration based on clusters {#sec-taxa-clustering} Another way to agglomerate the data is to cluster the taxa. To do so, -we usually start by doing a compositionality aware transformation such as CLR, +we usually start by doing a compositionality aware transformation such as CLR, followed by the application of a standard clustering method. Here is an example that does a CLR transformation followed by the hierarchical @@ -184,14 +183,14 @@ tse <- addCluster( tse, assay.type = "standardize", clust.col = "hclustKendall", - MARGIN = "rows", + MARGIN = "rows", HclustParam(dist.fun = kendall_dissimilarity, method = "ward.D2")) ``` -Results are stored in the `rowData` column specified +Results are stored in the `rowData` column specified with the `clust.col` parameter. -To better visualize the results and the distribution of the clusters, we can +To better visualize the results and the distribution of the clusters, we can plot the histogram of the clusters. ```{r taxa_clustering_histogram} @@ -220,14 +219,14 @@ You can find more information on agglomeration from [@sec-agglomeration]. ### Dirichlet Multinomial Mixtures (DMM) This section focuses on -[Dirichlet-Multinomial MixturenModel](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030126) -analysis. It is a Bayesian clustering technique that allows to search for sample +[Dirichlet-Multinomial MixturenModel](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030126) +analysis. It is a Bayesian clustering technique that allows to search for sample patterns that reflect sample similarity in the data using both prior information and the data. -In this example, we cluster the data with DMM clustering. In the example below, +In this example, we cluster the data with DMM clustering. In the example below, we calculate model fit using Laplace approximation ro reduce the -calculation time. +calculation time. The cluster information is added in the metadata with an optional `name`. In this example, we use a prior assumption that the optimal number of clusters to @@ -237,21 +236,21 @@ describe the data would be in the range from 1 to 6. # Run the model and calculates the most likely number of clusters from 1 to 6 set.seed(1495723) altExp(tse, "prevalent") <- addCluster( - altExp(tse, "prevalent"), - assay.type = "counts", - name = "DMM", - DmmParam(k = 1:6, type = "laplace"), - MARGIN = "samples", - full = TRUE, + altExp(tse, "prevalent"), + assay.type = "counts", + name = "DMM", + DmmParam(k = 1:6, type = "laplace"), + MARGIN = "samples", + full = TRUE, clust.col = "dmmclust") ``` -The plot below represents the Laplace approximation to the model evidence for -each of the six models. We can see that the best number of clusters is two, as +The plot below represents the Laplace approximation to the model evidence for +each of the six models. We can see that the best number of clusters is two, as the minimum value of Laplace approximation to the negative log model -evidence for -DMM models as a function of k, determines the optimal k. The optimal k suggests -to fit a model with k mixtures of Dirichlet distributions as a prior. +evidence for +DMM models as a function of k, determines the optimal k. The optimal k suggests +to fit a model with k mixtures of Dirichlet distributions as a prior. ```{r dmm2, fig.width=5, fig.height=3} library(miaViz) @@ -259,7 +258,7 @@ p <- plotDMNFit(altExp(tse, "prevalent"), type = "laplace", name = "DMM") p + theme_minimal(base_size = 11) ``` -The plot above suggests that two clusters describe the data the best. Now +The plot above suggests that two clusters describe the data the best. Now we can plot, which taxa define the key cluster features. ```{r dmm3, fig.width=6, fig.height=2} @@ -273,45 +272,45 @@ drivers$class <- rownames(drivers) for (i in 1:2) { # Transfer to relative to optimize plotting drivers[i] <- drivers[i]/sum(drivers[i]) - + drivers <- drivers[order(drivers[[i]], decreasing = TRUE),] p <- ggplot(head(drivers, 5), aes( x = reorder(head(class, 5), + head(drivers[[i]], 5)), y = head(drivers[[i]], 5))) + - + geom_bar(stat = "identity", fill = scales::hue_pal()(1), alpha = 0.6) + - coord_flip() + + coord_flip() + labs(title = paste("Top phyla in group", i)) + - theme_minimal(base_size = 11) + - labs(x="", y="") + + theme_minimal(base_size = 11) + + labs(x="", y="") + scale_y_continuous(limits=c(0,1)) print(p) } ``` -As well as in hierarchical clustering, we can also visualize the clusters by +As well as in hierarchical clustering, we can also visualize the clusters by projecting the data onto two-dimensional surface calculated using MDS. The plot rather clearly demonstrates the cluster division. ```{r dmm4} # DMM clusters -plotReducedDim(altExp(tse, "prevalent"), "MDS", theme_size = 13) + - geom_point(size = 3, alpha = 0.6, +plotReducedDim(altExp(tse, "prevalent"), "MDS", theme_size = 13) + + geom_point(size = 3, alpha = 0.6, aes(color = altExp(tse, "prevalent")$dmmclust)) + labs(color = "DMMclust", title = "DMM") + theme_minimal() ``` ## Dimensionality reduction -Dimensionality reduction can be considered as a part of community typing, -where the aim is to find underlying structure in the data. Dimensionality +Dimensionality reduction can be considered as a part of community typing, +where the aim is to find underlying structure in the data. Dimensionality reduction relies in the idea that the data can be describe in a -lower rank representation, where latent features with different loads of the -observed ones describe the sample set. This representation stores the -information about sample dissimilarity with significantly reduced amount of -data dimensions. Unlike clustering, where each sample is assigned into one -cluster, dimensionality reduction allows continuous memberships for all samples -into all community types. Dimensionality reduction techniques are explained in +lower rank representation, where latent features with different loads of the +observed ones describe the sample set. This representation stores the +information about sample dissimilarity with significantly reduced amount of +data dimensions. Unlike clustering, where each sample is assigned into one +cluster, dimensionality reduction allows continuous memberships for all samples +into all community types. Dimensionality reduction techniques are explained in chapter [@sec-community-similarity] in more detail. ### Non-negative matrix factorization @@ -320,7 +319,7 @@ In this section, we will particularly focus on Non-negative Matrix Factorization (NMF). NMF decomposes the original data *X* into two lower rank matrices: *H* representing key taxa characteristics of each community type and *W* representing -continuous sample memberships across all community types. NMF algorithm +continuous sample memberships across all community types. NMF algorithm calculates the minimum distance between the original data and it's approximation using Kullback-Leibler divergence. @@ -359,7 +358,7 @@ for each sample onto a two-dimensional surface. plotReducedDim( altExp(tse, "prevalent"), "MDS", colour_by = "NMFcomponent", point_size = 3, theme_size = 13) + - + labs(color = "NMFcomponent", title = "NMF") + theme_minimal() ``` @@ -421,7 +420,7 @@ library(patchwork) # Elbow method p1 <- fviz_nbclust(assay(tse, "standardize"), kmeans, method = "wss") # Silhouette method -p2 <- fviz_nbclust(assay(tse, "standardize"), kmeans, method = "silhouette") +p2 <- fviz_nbclust(assay(tse, "standardize"), kmeans, method = "silhouette") p1 + p2 ``` @@ -440,7 +439,7 @@ colData(tse)[["clusters"]] Now we can create a heatmap with additional annotations. -```{r pheatmap8} +```{r pheatmap8, fig.width=13, fig.height=7} # Order the data based on clusters tse <- tse[ order(rowData(tse)[["clusters"]]), @@ -461,7 +460,7 @@ breaks <- seq( colors <- colorRampPalette(c("darkblue", "blue", "white", "red", "darkred"))(length(breaks)-1) pheatmap( - mat, annotation_row = taxa_clusters, + mat, annotation_row = taxa_clusters, annotation_col = sample_data, cluster_cols = FALSE, cluster_rows = FALSE, breaks = breaks, diff --git a/inst/pages/containers.qmd b/inst/pages/containers.qmd index fdf530667..2b0642b28 100644 --- a/inst/pages/containers.qmd +++ b/inst/pages/containers.qmd @@ -69,8 +69,8 @@ An assay is a way of measuring the presence and abundance of different types of microbes in a sample. For example, if you want to know how many bacteria of a certain type are in your gut, you can use an assay to measure this. When storing assays, the original data is count-based. However, the original -count-based taxonomic abundance tables may undergo different -transformations, such as logarithmic, Centered Log-Ratio (CLR), or relative +count-based taxonomic abundance tables may undergo different +transformations, such as logarithmic, Centered Log-Ratio (CLR), or relative abundance. These are typically stored in _**assays**_. See [@sec-assay-transform] for more information on transformations. @@ -90,7 +90,7 @@ assay(tse, "counts") |> head() So, in summary, in the world of microbiome analysis, an assay is essentially a way to quantify and understand the composition of microbes in a given sample, which is super important for all kinds of research, ranging from human health -to environment studies. +to environment studies. Furthermore, to illustrate the use of multiple assays, we can create an empty matrix and add it to the object. @@ -101,7 +101,7 @@ assay(tse, "empty_table", withDimnames=FALSE) <- mat assays(tse) ``` -Now there are two assays available in the `tse` object, `counts` and +Now there are two assays available in the `tse` object, `counts` and `empty_table`. ```{r} @@ -140,13 +140,13 @@ microbiome composition and various environmental or health factors. rowData(tse) ``` -## rowTree +## rowTree -Phylogenetic trees also play an important role in the microbiome field. The +Phylogenetic trees also play an important role in the microbiome field. The `TreeSE` class can keep track of features and node relations via two functions, `rowTree` and `rowLinks`. -A tree can be accessed via `rowTree` as `phylo` object. +A tree can be accessed via `rowTree` as `phylo` object. ```{r rowtree} rowTree(tse) @@ -158,7 +158,7 @@ The links to the individual features are available through `rowLinks`. rowLinks(tse) ``` -Please note that there can be a 1:1 relationship between tree nodes and +Please note that there can be a 1:1 relationship between tree nodes and features, but this is not a must-have. This means there can be features that are not linked to nodes, and nodes that are not linked to features. To change the links in an existing object, the `changeTree` function is available. @@ -177,9 +177,9 @@ sequencing. Another common use case is including abundance tables for different taxonomic ranks. Such alternative experiments concerning the same set of samples can be stored as -1. Separate _assays_ assuming that the taxonomic information can be mapped -between features directly 1:1; or -2. Data in the _altExp_ slot of the `TreeSE`, if the feature +1. Separate _assays_ assuming that the taxonomic information can be mapped +between features directly 1:1; or +2. Data in the _altExp_ slot of the `TreeSE`, if the feature dimensions differ. Each element of the _altExp_ slot is a `SE` or an object from a derived class with independent feature data. @@ -227,14 +227,14 @@ _**Multiple experiments**_ relate to complementary measurement types, such as transcriptomic or metabolomic profiling of the microbiome or the host. Multiple experiments can be represented using the same options as alternative experiments, or by using the -`MAE` class [@Ramos2017]. Depending on how the +`MAE` class [@Ramos2017]. Depending on how the datasets relate to each other the data can be stored as: 1. _altExp_ if the samples can be matched directly 1:1; or 2. As `MAE` objects, in which the connections between samples are defined through a `sampleMap`. Each element on the `ExperimentList` of an `MAE` is `matrix` or -`matrix`-like objects, including `SE` objects, and +`matrix`-like objects, including `SE` objects, and the number of samples can differ between the elements. The `MAE` object can handle more complex relationships between experiments. @@ -254,7 +254,7 @@ important bookkeeper, maintaining the information about which samples are associated with which experiments. This ensures that data linkages are correctly managed and preserved across different types of experiments. -In fact, we can have +In fact, we can have ```{r} #| label: show_mae2 @@ -287,16 +287,16 @@ samples belong to which patient. For information have a look at the [intro vignette](https://bioconductor.org/packages/release/bioc/vignettes/MultiAssayExperiment/inst/doc/MultiAssayExperiment.html) -of the `MultiAssayExperiment` package. +of the `MultiAssayExperiment` package. ::: {.callout-tip} ## Recommended options for storing multiple data tables in microbiome studies - Option Rows (features) Cols (samples) Recommended + Option Rows (features) Cols (samples) Recommended --------- -------------- --------------- ------------------------ - assays match match Data transformations - altExp free match Alternative experiments -MultiAssay free free (mapping) Multi-omic experiments + assays match match Data transformations + altExp free match Alternative experiments +MultiAssay free free (mapping) Multi-omic experiments : The _assays_ are best suited for data transformations (one-to-one match between diff --git a/inst/pages/convert.qmd b/inst/pages/convert.qmd index bc8fd6551..5686771e6 100644 --- a/inst/pages/convert.qmd +++ b/inst/pages/convert.qmd @@ -19,14 +19,14 @@ Biom, and DADA2). library(mia) # phyloseq example data -data(GlobalPatterns, package = "phyloseq") +data(GlobalPatterns, package = "phyloseq") GlobalPatterns_phyloseq <- GlobalPatterns GlobalPatterns_phyloseq ``` ```{r, message=FALSE} # convert phyloseq to TSE -GlobalPatterns_TSE <- convertFromPhyloseq(GlobalPatterns_phyloseq) +GlobalPatterns_TSE <- convertFromPhyloseq(GlobalPatterns_phyloseq) GlobalPatterns_TSE ``` @@ -38,7 +38,7 @@ instance when additional methods are available for `phyloseq`. ```{r, message=FALSE} # convert TSE to phyloseq -GlobalPatterns_phyloseq2 <- convertToPhyloseq(GlobalPatterns_TSE) +GlobalPatterns_phyloseq2 <- convertToPhyloseq(GlobalPatterns_TSE) GlobalPatterns_phyloseq2 ``` @@ -69,7 +69,7 @@ such as Julia or Python. For information, have a look at the #| label: write_feather #| eval: false -data(GlobalPatterns, package = "mia") +data(GlobalPatterns, package = "mia") tse <- GlobalPatterns molten_tse <- meltSE( @@ -86,7 +86,7 @@ write_feather(molten_tse, path) Another way could be using a CSV file. This works the same as for a feather file, make sure you have converted your `TreeSE` data container as a dataframe -Here note that you can decide whether you want to write the row names or not. +Here note that you can decide whether you want to write the row names or not. ```{r} #| label: write_csv diff --git a/inst/pages/correlation.qmd b/inst/pages/correlation.qmd index 0b894a0a5..0c8157187 100644 --- a/inst/pages/correlation.qmd +++ b/inst/pages/correlation.qmd @@ -49,7 +49,8 @@ library(qgraph) # Create correlation network plot qgraph( res$cor, layout = "spring", labels = colnames(res$cor), - label.cex = 3, theme = "colorblind") + label.cex = 1.2, theme = "colorblind", + node.width = 1.5, node.height = 2) ``` You can find more on networks from [@sec-network-learning]. @@ -80,7 +81,7 @@ Below, we present the results using a heatmap visualization. #| fig-width: 8 #| fig-height: 8 -library(ComplexHeatmap) +library(ComplexHeatmap) library(shadowtext) # Function for marking significant correlations with "X" @@ -133,7 +134,7 @@ p ``` ::: {.callout-tip} -## Cross-correlation +## Cross-association -See [@sec-cross-correlation] for further information on correlation analyses. +See [@sec-cross-correlation] for further information on correlation and association analyses. ::: diff --git a/inst/pages/cross_correlation.qmd b/inst/pages/cross_correlation.qmd index fa448c3aa..6f66fdf79 100644 --- a/inst/pages/cross_correlation.qmd +++ b/inst/pages/cross_correlation.qmd @@ -31,24 +31,24 @@ already in `MAE` format which is tailored for multi-assay analyses (see [@sec-containers]). The dataset includes three different experiments: microbial abundance data, metabolite concentrations, and data about different biomarkers. If you like to construct the same data object from the -original files instead, [here]([@sec-loading-experimental-microbiome-data]) +original files instead, [here]([@sec-loading-experimental-microbiome-data]) you can find help for importing data into an SE object. -## Cross-correlation Analysis {#sec-cross-correlation} +## Cross-association Analysis {#sec-cross-correlation} -Cross-correlation analysis is a straightforward approach that can +Cross-association analysis is a straightforward approach that can reveal strength and type of associations between data sets. For instance, we can analyze if higher presence of a specific taxon relates to higher levels of a biomolecule. Correlation analyses within dataset were already discussed in [@sec-correlation]. -```{r cross-correlation1} +```{r cross-association1} # Load the data data(HintikkaXOData, package = "mia") mae <- HintikkaXOData ``` -```{r cross-correlation2} +```{r cross-association2} library(stringr) # Drop those bacteria that do not include information in Phylum or lower levels mae[[1]] <- mae[[1]][!is.na(rowData(mae[[1]])$Phylum), ] @@ -67,17 +67,17 @@ experiments(mae) getWithColData(mae, "microbiota") ``` -```{r cross-correlation3} +```{r cross-association3} # Metabolite data getWithColData(mae, "metabolites") ``` -```{r cross-correlation4} +```{r cross-association4} # Biomarker data getWithColData(mae, "biomarkers") ``` -Next we can perform a cross-correlation analysis. Let us analyze if +Next we can perform a cross-association analysis. Let us analyze if individual bacteria genera are correlated with concentrations of individual metabolites. This helps to answer the following question: "If bacterium X is present, is the concentration of metabolite Y lower or higher"? @@ -93,10 +93,10 @@ rownames(mae[[1]]) <- getTaxonomyLabels(mae[[1]]) # Cross correlates data sets res <- getCrossAssociation( - mae, + mae, experiment1 = 1, experiment2 = 2, - assay.type1 = "log10", + assay.type1 = "log10", assay.type2 = "nmr", method = "spearman", test.signif = TRUE, @@ -112,7 +112,7 @@ Next, we create a heatmap depicting all cross-correlations between bacterial genera and metabolite concentrations. ```{r cross-correlation6, fig.width=10, fig.height=8} -library(ComplexHeatmap) +library(ComplexHeatmap) library(shadowtext) # Function for marking significant correlations with "X" diff --git a/inst/pages/differential_abundance.qmd b/inst/pages/differential_abundance.qmd index 4e42346fe..cb8053ab0 100755 --- a/inst/pages/differential_abundance.qmd +++ b/inst/pages/differential_abundance.qmd @@ -6,7 +6,7 @@ chapterPreamble() ``` Differential Abundance Analysis (DAA) is a method used to identify -differences in the abundances of individual taxa (at any taxonomic level) +differences in the abundances of individual taxa (at any taxonomic level) between two or more groups, such as treatment versus control groups. Here, we demonstrate its implementation on [Tengeler2020](https://microbiome.github.io/mia/reference/Tengeler2020.html) @@ -17,13 +17,13 @@ and gain understanding of a complex system by looking at its isolated components. For example, the identification of a bacterial taxon that is more or less abundant between healthy patients and diseased patients can lead to important -insights into the underlying mechanisms of the disease. In other words, -differentially abundant taxa can be involved in the dynamics of the -disease, which in turn helps understand the system as a whole. Despite its +insights into the underlying mechanisms of the disease. In other words, +differentially abundant taxa can be involved in the dynamics of the +disease, which in turn helps understand the system as a whole. Despite its relevance in current research, the DAA approach has also been subject to debate [@Quinn2021]. -## Statistical challenges of microbiome data +## Statistical challenges of microbiome data As discussed in [@sec-stat-challenges], data display unique properties that are exclusively addressed by DAA tools developed for microbiome analysis. @@ -32,13 +32,13 @@ We recommend to have a look at @Nearing2022. In this study, multiple DAA methods were applied to 38 different datasets and their results were compared to one another. The study highlighted that different methods might -yield varying results +yield varying results due to differing assumptions and normalization techniques. Interestingly, @pelto2024 concluded that elementary DAA methods achieve the highest consistency. Recently, @Yang2022 comprehensively evaluated DAA methods via a -semi-parametric framework and 106 real +semi-parametric framework and 106 real datasets, concluding that different methods can produce contradictory results, creating the risk of cherry-picking the most favorable options for one’s own hypothesis. Therefore, it is highly recommended to perform DAA with @@ -64,7 +64,7 @@ such as **over-dispersed count models** and **zero-inflated mixture models**. DESeq2, edgeR and corncorb are based on over-dispersed count models, whereas metagenomeSeq, RAIDA, ZIBB and Omnibus implement zero-inflated mixture models to address zero-inflation. Typically, these models assume a negative binomial, -beta-binomial or normal/log-normal distribution. +beta-binomial or normal/log-normal distribution. Another approach to deal with zero-inflated data is **zero imputation** where zeross are replaced with estimated values. ALDEx2 and @@ -73,7 +73,7 @@ data, accounting for sampling variability and sequencing depth variation. Other methods, such as MaAsLin2 and ANCOMBC impute the zeros with a pseudo-count strategy. -#### Approaches to handle Compositional Data : +#### Approaches to handle Compositional Data : To address the compositionality of microbiome data, several approaches have been developed to perform **robust normalization** with methods specifically designed @@ -81,18 +81,18 @@ to reduce the bias found in compositional data. Some examples include : - *Trimmed mean of M-values (TMM)* normalization used by edgeR. -- *Relative log expression (RLE)* normalization used by DESeq2. +- *Relative log expression (RLE)* normalization used by DESeq2. - *Cumulative sum scaling (CSS)* normalization used by metagenomeSeq. - *Centered log-ratio transformation (CLR)* normalization used by ALDEx2. -- *Geometric mean of pairwise ratios (GMPR)* normalization used by Omnibus +- *Geometric mean of pairwise ratios (GMPR)* normalization used by Omnibus and Wrench normalization [@Kumar2018], which corrects the compositional bias -by an empirical Bayes approach. +by an empirical Bayes approach. -Other methods to deal with compositional data entail reference taxa approach -used by DACOMP and RAIDA, analyzing the pattern of pairwise log ratios as done +Other methods to deal with compositional data entail reference taxa approach +used by DACOMP and RAIDA, analyzing the pattern of pairwise log ratios as done by ANCOM and bias-correction applied by ANCOMBC. ## Using the tools @@ -127,7 +127,7 @@ table(tse$patient_status, tse$cohort) %>% knitr::kable() ``` -### Preparing the data for DAA +### Preparing the data for DAA Before starting the analysis, it is recommended to reduce the size and complexity of the data to make the results more reproducible. For this @@ -153,7 +153,7 @@ regarded as the method of choice for its consistency, as it normally identifies features that are also found by complementary methods [@Nearing2022]. A more extensive introduction to its functionality is available in the -[ALDEx2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html). +[ALDEx2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html). ALDEx2 estimates technical variation within each sample per taxon by utilizing the Dirichlet distribution. It furthermore applies the CLR transformation (or @@ -171,7 +171,7 @@ library(ALDEx2) # Convert each instance using the centered log-ratio transform. # This is the input for all further analyses. set.seed(123) -x <- aldex.clr(assay(tse), tse$patient_status) +x <- aldex.clr(assay(tse), tse$patient_status) ``` The t-test: @@ -191,7 +191,7 @@ Effect sizes: # of the between group difference and the larger of the variance within groups x_effect <- aldex.effect(x, CI = TRUE, verbose = FALSE) -# combine all outputs +# combine all outputs aldex_out <- data.frame(x_tt, x_effect) ``` @@ -262,19 +262,19 @@ this method provides p-values and confidence intervals for each taxon. It also controls the FDR and it is computationally simple to implement. -Note that the original method was implemented in the `ancombc()` function (see +Note that the original method was implemented in the `ancombc()` function (see [extended tutorial](https://www.bioconductor.org/packages/release/bioc/vignettes/ANCOMBC/inst/doc/ANCOMBC.html)). The method has since then been updated and new features have been added to enable -multi-group comparisons and repeated measurements among other improvements. -We do not cover the more advanced features of ANCOMBC in this tutorial -as these features are documented in detail in this +multi-group comparisons and repeated measurements among other improvements. +We do not cover the more advanced features of ANCOMBC in this tutorial +as these features are documented in detail in this [tutorial](https://www.bioconductor.org/packages/release/bioc/vignettes/ANCOMBC/inst/doc/ANCOMBC2.html). -We now proceed with a simple example. First, we specify a formula. In this -formula, other covariates could potentially be included to adjust for -confounding. We show this further below. Again, please make sure to check the +We now proceed with a simple example. First, we specify a formula. In this +formula, other covariates could potentially be included to adjust for +confounding. We show this further below. Again, please make sure to check the [function documentation](https://rdrr.io/github/FrederickHuangLin/ANCOMBC/man/ancombc.html) -as well as the linked tutorials to learn about the additional arguments +as well as the linked tutorials to learn about the additional arguments that we specify. ```{r run-ancombc, warning=FALSE} @@ -305,7 +305,7 @@ errors, p-values and q-values. Below we show the first entries of this dataframe. ```{r ancombc-res} -# store the FDR adjusted results +# store the FDR adjusted results ancombc2_out$res %>% dplyr::select(taxon, lfc_patient_statusControl, q_patient_statusControl) %>% filter(q_patient_statusControl < 0.05) %>% @@ -314,7 +314,7 @@ ancombc2_out$res %>% knitr::kable() ``` -### MaAsLin2 +### MaAsLin2 Let us next illustrate MaAsLin2 [@Mallick2020]. This method is based on generalized linear models and flexible for different study designs @@ -325,8 +325,8 @@ and covariate structures. For details, check their # Load package library(Maaslin2) -# maaslin expects features as columns and samples as rows -# for both the abundance table as well as metadata +# maaslin expects features as columns and samples as rows +# for both the abundance table as well as metadata # We can specify different GLMs/normalizations/transforms. # specifying a ref is especially important if you have more than 2 levels @@ -362,7 +362,7 @@ significant taxa. PhILR is a tree-based method that tests group-wise associations based on balances. A detailed introduction to this method is available in -[this Bioconductor tutorial](https://www.bioconductor.org/packages/devel/bioc/vignettes/philr/inst/doc/philr-intro.html). +[this Bioconductor tutorial](https://www.bioconductor.org/packages/devel/bioc/vignettes/philr/inst/doc/philr-intro.html). ### Comparison of methods @@ -380,7 +380,7 @@ apparent dynamics between the response and the main independent variable. They are common in experimental studies. Generally, they can be classified into three groups: -- Biological confounders, such as age and sex +- Biological confounders, such as age and sex - Technical confounders produced during sample collection, processing and analysis @@ -423,15 +423,15 @@ abundant taxa are associated with one of the variables when the other two are kept constant. ```{r run-adj-ancombc, warning=FALSE} -# perform the analysis +# perform the analysis ancombc2_out <- ancombc2( tse, assay.type = "counts", fix_formula = "patient_status + cohort + library_size", p_adj_method = "fdr", lib_cut = 0, - group = "patient_status", - struc_zero = TRUE, + group = "patient_status", + struc_zero = TRUE, neg_lb = TRUE, alpha = 0.05, # multi-group comparison is deactivated automatically @@ -455,38 +455,38 @@ ancombc2_out$res %>% knitr::kable() ``` -## Further information on tools for DAA : - +## Further information on tools for DAA : + #### LinDA - + LinDA covers linear models for differential abundance analysis of microbiome -compositional data (@Zhou2022). This is very similar to ANCOMBC with few differences: - +compositional data (@Zhou2022). This is very similar to ANCOMBC with few differences: + 1. LinDA corrects for the compositional bias differently using the mode of - all regression coefficients. - - 2. It is faster (100x-1000x than ANCOMBC and according to the authors); - + all regression coefficients. + + 2. It is faster (100x-1000x than ANCOMBC and according to the authors); + 3. It supports hierarchical models. The latest ANCOMBC versions are also supporting hierarchical models. - + Nevertheless, LinDA seems a promising tool that achieves a very good power/fdr trade-off together with ANCOMBC according to the review. The speed improvements might make it critical especially for datasets that have higher sample or feature set sizes. - + #### ZicoSeq - + Subsequently, we demonstrate DAA with ZicoSeq, a method based on linear models and permutation. Further details can be found in this tutorial. This approach has been assessed to exhibit high power and a low false discovery rate, which has the following components: - + 1. Winsorization to decrease the influence of outliers; - + 2. Posterior sampling based on a beta mixture prior to address sampling variability and zero inflation; - + 3. Reference-based multiple-stage normalization to address compositional effects;Additional resources diff --git a/inst/pages/exercises.qmd b/inst/pages/exercises.qmd index 75929cbd3..361c7e1c8 100644 --- a/inst/pages/exercises.qmd +++ b/inst/pages/exercises.qmd @@ -1,11 +1,11 @@ # Exercises {#sec-exercises} -Here you can find assignments on different topics. +Here you can find assignments on different topics. **Tips for exercises:** - + ::: {.callout-tip} ## Tips for exercises @@ -17,7 +17,7 @@ that results can easily be understood by others without advanced knowledge on data analytics. - Avoid hard-coding. Use variables which get values in the beginning of the pipeline. That way it is easier for you to modify parameters and reuse the code. - + ::: ## Basics of R/Bioconductor @@ -73,17 +73,17 @@ previously Rmarkdown) is to combine text with code in R or other programming languages, so that both the analytical pipeline and verbal description can be put in one place. In this exercise, we learn how to write and run code in Quarto. -1. Open RStudio and create a new Quarto file (click on File tab - New File - Quarto document). +1. Open RStudio and create a new Quarto file (click on File tab - New File - Quarto document). 2. Initialize a code chunk by pressing `alt` + `cmd` + `i` and define the variables - `A <- "my name"` and `B <- 0` in it. -3. Write the text `Below is my first code chunk` just above the code chunk. -4. Initialize one more code chunk and add 100 to the variable `B` in it. -5. Write the text `Below I change variable B` just above the second chunk. + `A <- "my name"` and `B <- 0` in it. +3. Write the text `Below is my first code chunk` just above the code chunk. +4. Initialize one more code chunk and add 100 to the variable `B` in it. +5. Write the text `Below I change variable B` just above the second chunk. 6. **Extra**: Write the following line of text: `my name is A and I am B years old`, where A and B are variables defined in the code chunks upstream and change if those variables are modified. Be aware that inline code can be added as - `> r my_inline_command` (without `>`). - + `> r my_inline_command` (without `>`). + Good job. You are now able to combine text and code in a Quarto document. If you got stuck, you can refer to the Quarto docs on [using code chunks](https://quarto.org/docs/visual-editor/technical.html#code-chunks). @@ -97,11 +97,11 @@ syntax `#| option-name: value`, also described [here](https://quarto.org/docs/computations/r.html#chunk-options). In this exercise, we explore part of the knitr potential. -1. Open RStudio and create a new Quarto file. -2. Initialize three code chunks and label them as `setup`, `fig-box` and - `tbl-coldata`, respectively. Remember that the name of a chunk can be - specified with the `label` option. -3. Write the following code in the corresponding chunk and render the document. +1. Open RStudio and create a new Quarto file. +2. Initialize three code chunks and label them as `setup`, `fig-box` and + `tbl-coldata`, respectively. Remember that the name of a chunk can be + specified with the `label` option. +3. Write the following code in the corresponding chunk and render the document. ```{r} #| label: knitr-options-ex1 @@ -134,7 +134,7 @@ knitr::kable(head(colData(tse))) 4. Set `include: false` in the `setup` chunk, `fig-width: 10` in the `fig-box` chunk and `echo: false` in the `tbl-coldata` chunk. Render the document again - and find the differences from before. + and find the differences from before. The setup code and output are no longer visible, resulting in a cleaner document without initialization details. Figures in the fig-box chunk are now @@ -144,14 +144,14 @@ the results rather than the code implementation. 5. Add the options `fig-cap` and `tab-cap` to the `fig-box` and `tbl-coldata` chunks, respectively. They require some text input, which makes for the - caption of the figures or tables. + caption of the figures or tables. 6. **Extra**: Create a cross-reference to `fig-box` and `tbl-coldata` in the text above the respective code chunk. You can do that with the syntax - `@chunk-name`. + `@chunk-name`. 7. **Extra**: Define a custom folder for storing figures with `fig-path`. Insert it in `knitr::opts_chunk$set`, so that it applies globally to all the figures - generated in the document. - + generated in the document. + Congratulations! You are now familiar with the great potential and flexibility of knitr chunk options. An exhaustive list of available options can be found in the [knitr documentation](https://yihui.org/knitr/options/). @@ -166,15 +166,15 @@ execution and figures can be specified. A comprehensive list of yaml options is available [here](https://quarto.org/docs/reference/formats/html.html). In this exercise, we will get a tiny taste of such functionality. -1. Open RStudio and create a new Quarto file. +1. Open RStudio and create a new Quarto file. 2. In the yaml box at the beginning of the document, change the title from - `Untitled` to `My first Quarto`. + `Untitled` to `My first Quarto`. 4. In the same box, add the two lines `author` and `date` followed by your name - and today's date, respectively. -5. Render the document and check its appearance. + and today's date, respectively. +5. Render the document and check its appearance. 6. **Extra**: Set `toc: true` to produce a table of contents. This line should - follow `format` and `html` at the second level of indentation. - + follow `format` and `html` at the second level of indentation. + Well done! Now you are able to specify yaml options and understand how they affect your Quarto document. If you got stuck, you can check [this section](https://quarto.org/docs/tools/rstudio.html#yaml-intelligence) of @@ -186,14 +186,14 @@ An advanced feature of Quarto consists of execution parameters, which are externally pre-defined variables that are also accessible in the Quarto document. They can be specified in the yaml box as `params`. Here we learn how to use them. -1. Open RStudio and create a new Quarto file. +1. Open RStudio and create a new Quarto file. 2. In the yaml box at the beginning of the document, add a line named `params` - followed by an indented line with `gamma: 10` -3. Initialize a code chunk and type `str(params$gamma)` in it. -4. Render the document and check what happened. + followed by an indented line with `gamma: 10` +3. Initialize a code chunk and type `str(params$gamma)` in it. +4. Render the document and check what happened. 5. Define one more parameter `beta: 3` and multiply `gamma` by `beta` in a code - chunk below. -6. Render the document again and check what happened. + chunk below. +6. Render the document again and check what happened. Well done! You can now use an advanced feature of Quarto such as parameters. If you got stuck, @@ -229,8 +229,8 @@ url_to_rowdata <- url("https://github.com/microbiome/data/raw/main/OKeefeDSData/ ``` 2. Read in the csv files with `read.csv` and store them into the variables -`assays`, `rowdata` and `coldata`, respectively. - +`assays`, `rowdata` and `coldata`, respectively. + ```{r} #| label: Data-containers-ex2 #| eval: FALSE @@ -245,7 +245,7 @@ okeefe_rowdata <- read.csv(url_to_rowdata, row.names = 1) 3. Create a TreeSE from the individual components with `TreeSummarizedExperiment`. Note that the function requires three arguments: assays, rowData and colData, to which you can give the appropriate item. - + ```{r} #| label: Data-containers-ex3 #| eval: FALSE @@ -258,7 +258,7 @@ tse <- TreeSummarizedExperiment(assays = list(counts = okeefe_assay), ``` 4. Check that importing is done correctly. E.g., choose random samples and - features, and check that their values equal between raw files and TreeSE. + features, and check that their values equal between raw files and TreeSE. ```{r} #| label: Data-containers-ex4 @@ -284,11 +284,11 @@ functions explained in chapter [@sec-import-from-file]. You can also check the [microbiome data repository](https://github.com/microbiome/data) and read the instructions in its [README](https://github.com/microbiome/data#training-microbiome-datasets) -to import and construct datasets from there. -2. Import data by using mia. (functions: [importMetaPhlAn](https://microbiome.github.io/mia/reference/loadFromMetaphlan.html) | [importMothur](https://microbiome.github.io/mia/reference/importMothur.html) | -[importQIIME2](https://microbiome.github.io/mia/reference/importQIIME2.html) | [convertFromBIOM](https://microbiome.github.io/mia/reference/convertFromBIOM.html) | +to import and construct datasets from there. +2. Import data by using mia. (functions: [importMetaPhlAn](https://microbiome.github.io/mia/reference/loadFromMetaphlan.html) | [importMothur](https://microbiome.github.io/mia/reference/importMothur.html) | +[importQIIME2](https://microbiome.github.io/mia/reference/importQIIME2.html) | [convertFromBIOM](https://microbiome.github.io/mia/reference/convertFromBIOM.html) | [convertFromDADA2](https://microbiome.github.io/mia/reference/convertFromDADA2.html) ...) -3. Try out conversions between TreeSE and phyloseq data containers ([convertFromPhyloseq](https://microbiome.github.io/mia/reference/convertFromPhyloseq.html); [convertToPhyloseq](https://microbiome.github.io/mia/reference/convertToPhyloseq.html)) +3. Try out conversions between TreeSE and phyloseq data containers ([convertFromPhyloseq](https://microbiome.github.io/mia/reference/convertFromPhyloseq.html); [convertToPhyloseq](https://microbiome.github.io/mia/reference/convertToPhyloseq.html)) ### Preliminary exploration @@ -331,7 +331,7 @@ and `ncol`. How many samples and features are present? dim(tse) ``` -4. List sample and features names with `rownames` and `colnames`. +4. List sample and features names with `rownames` and `colnames`. ```{r} #| label: Preliminary-exploration-ex4 @@ -372,7 +372,7 @@ can use `apply` to count unique elements for each column of rowData. 1. Import the mia package, load any of the example data sets mentioned in [@sec-example-data] with `data` and store it into a variable named `tse`. - + ```{r} #| label: Assay-retrieval-ex1 #| eval: FALSE @@ -395,7 +395,7 @@ tse <- GlobalPatterns assayNames(tse) ``` -3. Fetch the list of assays with `assays`. +3. Fetch the list of assays with `assays`. ```{r} #| label: Assay-retrieval-ex3 @@ -407,7 +407,7 @@ assays(tse) ``` 4. Retrieve the first assay of the TreeSE with `assay`, where the second -argument can be either the name or the index of the desired assay. +argument can be either the name or the index of the desired assay. ```{r} #| label: Assay-retrieval-ex4 @@ -422,7 +422,7 @@ Well done! You can now locate and retrieve individual assays of a TreeSE. If you got stuck, you can refer to chapter [@sec-assay-slot] of this book. -### Sample information +### Sample information 1. Import the mia package, load any of the example data sets mentioned in [@sec-example-data] with `data` and store it into a variable named `tse`. @@ -473,7 +473,7 @@ View(colData(tse)) ``` 5. Get the abundances of all features for a specific sample, such as -`ID34`, for an *assay* of your choice. +`ID34`, for an *assay* of your choice. ### Feature information @@ -527,7 +527,7 @@ View(rowData(tse)) 5. Get the abundances for a specific feature, such as `OTU1810`, in all the samples. You can access feature-specific abundances for an _assay_ of your -choice. +choice. If you got stuck, you can look up chapters [@sec-taxonomic-information] and [@sec-fly-tree] on how to pick specific abundances and generate @@ -540,8 +540,8 @@ Extract some of the other TreeSE elements listed in chapter [@sec-containers]. However, such data are not always included. 1. Import the mia package, load any of the example data sets mentioned in -[@sec-example-data] and store it into a variable named `tse`. - +[@sec-example-data] and store it into a variable named `tse`. + ```{r} #| label: Other-elements-ex1 #| eval: FALSE @@ -551,7 +551,7 @@ However, such data are not always included. library(mia) data("GlobalPatterns", package = "mia") tse <- GlobalPatterns -``` +``` 2. Fetch the metadata of the TreeSE. Is there any information available? @@ -562,23 +562,23 @@ tse <- GlobalPatterns #| code-summary: "Show solution" names(metadata(tse)) -``` +``` 3. Access the phylogenetic tree with `rowTree`. How big is it in terms of tips -and nodes. If you like you can visualize it with `ggtree`. - +and nodes. If you like you can visualize it with `ggtree`. + ```{r} #| label: Other-elements-ex3 #| eval: FALSE #| code-fold: true #| code-summary: "Show solution" -rowTree(tse) +rowTree(tse) ``` 4. Check if a sample tree is available with `colTree`, which is suitable for hierarchical or nested study designs. - + ```{r} #| label: Other-elements-ex4 #| eval: FALSE @@ -598,7 +598,7 @@ sequence slot. #| code-summary: "Show solution" colData(tse)$Barcode_full_length -``` +``` ## Data manipulation @@ -608,9 +608,9 @@ In this exercise, we'll go over the basics of subsetting a TreeSE object. Keep in mind that it's good practice to always keep the original data intact. Therefore, we suggest creating a new TreeSE object whenever you subset. -Please have a try at the following exercises. +Please have a try at the following exercises. -1. Subset the TreeSE object to specific samples and features. +1. Subset the TreeSE object to specific samples and features. 1.1. Find some samples and features to subset with. @@ -645,11 +645,11 @@ cols <- tse$SampleType %in% samples tse_subset_by_feature <- tse[rows, cols] unique(rowData(tse_subset_by_feature)$Phylum) -``` +``` ### Library sizes -1. Calculate library sizes +1. Calculate library sizes 2. Subsample / rarify the counts (see: rarefyAssay) Useful functions: nrow, ncol, dim, summary, table, quantile, unique, @@ -671,7 +671,7 @@ data("GlobalPatterns", package = "mia") tse <- GlobalPatterns # First, merge the features by rank and add it as an alternative experiment. -tse_phylum <- agglomerateByRank(tse, "Phylum") +tse_phylum <- agglomerateByRank(tse, "Phylum") # Calculate relative abundance tse_phylum <- transformAssay(tse_phylum, method = "relabundance") @@ -701,7 +701,7 @@ tse_prevalent2 <- subsetByPrevalent( tse, assay.type = "counts", detection = 1, prevalence = 0.1) identical(tse_prevalent1, tse_prevalent2) -``` +``` 3. How many features are left ? @@ -712,9 +712,9 @@ identical(tse_prevalent1, tse_prevalent2) #| code-summary: "Show solution" nrow(tse_prevalent1) -``` +``` -4. Recalculate the most prevalent features based on relative abundance. +4. Recalculate the most prevalent features based on relative abundance. How many are left? ```{r} @@ -730,7 +730,7 @@ tse_prevalent <- subsetByPrevalent( detection = 0.01/100, prevalence = 50/100) nrow(tse_prevalent) -``` +``` 5. Visualize prevalence of most prevalent Phyla using a violin/bean plot. @@ -754,43 +754,43 @@ rowData(tse_phylum)$prevalence <- res # Then plot the row data plotRowData(tse_phylum, "prevalence", colour_by = "Phylum") -``` +``` Useful functions: getPrevalence, getPrevalent, subsetByPrevalent ### Data exploration 1. Summarize sample metadata variables. (How many age groups, how they are -distributed? 0%, 25%, 50%, 75%, and 100% quantiles of library size?) +distributed? 0%, 25%, 50%, 75%, and 100% quantiles of library size?) 2. Create two histograms. Another shows the distribution of absolute counts, -another shows how CLR transformed values are distributed. -3. Visualize how relative abundances are distributed between taxa in samples. +another shows how CLR transformed values are distributed. +3. Visualize how relative abundances are distributed between taxa in samples. Useful functions: nrow, ncol, dim, summary, table, quantile, unique, transformAssay, ggplot, wilcox.test, agglomerateByRank, miaViz::plotAbundance ### Other functions -1. Merge data objects (merge, mergeSEs) -2. Melt the data for visualization purposes (meltSE) +1. Merge data objects (merge, mergeSEs) +2. Melt the data for visualization purposes (meltSE) ### Assay transformation 1. Import the mia package, load any of the example data sets mentioned in -[@sec-example-data] with `data` and store it into a variable named `tse`. +[@sec-example-data] with `data` and store it into a variable named `tse`. 2. Transform the counts assay into relative abundances with `transformAssay` and store it into the TreeSE as an assay named `relabund` -(see chapter [@sec-assay-transform]). +(see chapter [@sec-assay-transform]). 3. Similarly, perform a clr transformation on the counts assay with a -`pseudocount` of 1 and add it to the TreeSE as a new assay. -4. List the available assays by name with `assays`. +`pseudocount` of 1 and add it to the TreeSE as a new assay. +4. List the available assays by name with `assays`. 5. Access the clr assay and select a subset of its first 100 features and 10 -samples. Remember that assays are subsettable with `assay[row_idx, col_idx]`. +samples. Remember that assays are subsettable with `assay[row_idx, col_idx]`. 5. Take the same subset from the TreeSE, and check how this affects the individual transformed assays. TreeSE can also be subsetted with -`tse[row_idx, col_idx]`. +`tse[row_idx, col_idx]`. 6. **Extra**: If the data has phylogenetic tree, perform the phILR -transformation. +transformation. ## Abundance tables @@ -799,17 +799,17 @@ transformation. 1. Import the mia package, load one of the example data sets mentioned in [@sec-example-data] with `data` (you need one with taxonomic information at -Phylum level) and store it into a variable named `tse`. -2. List the available taxonomic ranks in the data with `taxonomyRanks`. +Phylum level) and store it into a variable named `tse`. +2. List the available taxonomic ranks in the data with `taxonomyRanks`. 3. Agglomerate the data to Phylum level with `agglomerateByRank` and the -appropriate value for `Rank`. +appropriate value for `Rank`. 4. Report the dimensions of the TreeSE before and after agglomerating. You can -use `dim` for that. +use `dim` for that. 5. **Extra**: Perform CLR transformation on the data. Does this affect -agglomeration? +agglomeration? 6. **Extra**: List full taxonomic information for a few selected taxa, such as `OTU1` and `OTU1368`. For that you can use `mapTaxonomy` on a specific subset -of the TreeSE. +of the TreeSE. ### Alternative experiments @@ -817,17 +817,17 @@ of the TreeSE. 1. Import the mia package, load one of the example data sets mentioned in [@sec-example-data] with `data` (you need one with taxonomic information) and store it into a variable named `tse`. Can you find the same taxonomyRanks from -rowData? +rowData? 2. Check the taxonomic ranks of the features with `taxonomyRanks`. What is the -deepest taxonomic rank available? +deepest taxonomic rank available? 3. Agglomerate the TreeSE to each taxonomic rank and store the resulting -experiments as altExps. This can be performed automatically with -`agglomerateByRanks`. +experiments as altExps. This can be performed automatically with +`agglomerateByRanks`. 4. Check the names of the generated altExps with `altExpNames` and retrieve a -complete list with `altExps`. +complete list with `altExps`. 5. Retrieve the data agglomerated by genus from the corresponding `altExp`. -As for assays, you can access the desired altExp by name or index. -6. **Extra**: Split the data based on other features with `splitOn`. +As for assays, you can access the desired altExp by name or index. +6. **Extra**: Split the data based on other features with `splitOn`. ### Beta Diversity @@ -939,7 +939,7 @@ diversity of the original experiment with its agglomerated version. You can use `agglomerateByRank` to perform agglomeration and `mean` to calculate the mean values of the respective columns in colData. -### Visualization +### Visualization 1. Import the mia and scater packages, load any of the example data sets mentioned in [@sec-example-data] with `data` and store it into a variable @@ -954,7 +954,7 @@ differ? 5. **Extra**: Make a scatterplot of Shannon diversity on the y axis and Faith's phylogenetic diversity on the x axis with `plotColData`. Colour the points by sample type with the appropriate optional argument. - + ### Correlation 1. Import the mia and scater packages, load any of the example data sets @@ -972,7 +972,7 @@ relate to one another? 6. **Extra**: Compute the library size of the samples by applying `colSums` to the counts assay of the TreeSE, and test the correlation of library size with Shannon diversity or coverage. Which index is more correlated with library size? - + In this example, we inspected the correlation between two related variables, also known as multicollinearity, and checked the correlation to library size, which is part of quality control. However, the correlation between alpha @@ -1058,7 +1058,7 @@ dissimilarity. The distance metric to use can be defined with the optional argument `method`, choosing from the methods in `?vegan::vegdist`. If you don't want to overwrite the reducedDim object made in point 3, set `name` to a name of your choice. Visualize and compare it to the plot from point 4. - + Good job! You are now able to produce and visualize reduced dimensions of a TreeSE. `runMDS` is actually one of several algorithms for PCoA and dimensionality reduction, which you can find in [@sec-other-ord-methods]. @@ -1086,7 +1086,7 @@ Diet with respect to beta diversity? dissimilarity matrix of `relabund_assay` calculated with `vegdist(relabund_assay, "bray")` and `my_groups` is the vector of Diet values obtained from the colData of the TreeSE. - + Well done! You went through testing the effect and significance of Diet on beta diversity. Keep in mind that the formula fed to `adonis2` can take more than one independent variable, so that you can also (and very often should) include @@ -1110,7 +1110,7 @@ abundance assay (`formula = assay ~ Diet` and `assay.type = relabundance`) in terms of Bray-Curtis dissimilarity (`method = "bray"`). 4. Extract the RDA dimensions from the appropriate reducedDim slot with `attr(reducedDim(tse, "RDA"), "rda)` and store it into a variable named `rda`. -5. Extract the effect and significance of Diet on beta diversity with +5. Extract the effect and significance of Diet on beta diversity with attr(reducedDim(tse, "RDA"), "significance"). What do the results tell you about Diet? 6. **Extra**: Check what other parameters are stored in the colData of peerj13075, add them to the formula (`formula = assay ~ Diet + ...`) of `getRDA` @@ -1131,7 +1131,7 @@ walkthrough, which may be simplified in the future. 1. Import the mia package, load any of the example data sets mentioned in [@sec-example-data] with `data` and store it into a variable named `tse`. - + ```{r} #| label: dbRDA-ex1 #| eval: FALSE @@ -1143,8 +1143,8 @@ library(mia) data("GlobalPatterns", package = "mia") # Feel free to use any dataset tse <- GlobalPatterns ``` - -2. Create a new assay with the relative abundance for your dataset and merge the + +2. Create a new assay with the relative abundance for your dataset and merge the features by a rank of your choice. ```{r} @@ -1157,7 +1157,7 @@ tse <- transformAssay(tse, MARGIN = "cols", method="relabundance") ``` 3. Select a group of samples and perform a permutational analysis on the -calculated relative abundance using Bray-Curtis dissimilarities +calculated relative abundance using Bray-Curtis dissimilarities ```{r} #| label: dbRDA-ex3 @@ -1169,7 +1169,7 @@ tse$Group <- tse$SampleType == "Feces" ``` 4. Implement dbRDA on relative abundances and perform another permutational -analysis on the redundancy analysis. +analysis on the redundancy analysis. ```{r} #| label: dbRDA-ex4 @@ -1189,9 +1189,9 @@ rda_info <- attr(reducedDim(tse, "RDA"), "significance") rda_info ``` -rda_info now holds information on the variation differences between -the chosen groups, in our case, the homogeneity test has a p-value of 0.01, -indicating that the groups have different variances, making it so that the +rda_info now holds information on the variation differences between +the chosen groups, in our case, the homogeneity test has a p-value of 0.01, +indicating that the groups have different variances, making it so that the permanova results are compromised and do not correctly explain whether there are significant differences between groups. @@ -1286,7 +1286,7 @@ aldex_out %>% rownames_to_column(var = "Genus") %>% filter(wi.eBH <= 0.05) %>% dplyr::select(Genus, we.eBH, wi.eBH, effect, overlap) -``` +``` 6. **Extra**: If these results appear boring, repeat steps 1 - 5, but use `Gender` or `Age` as the grouping variable. Do we have any better luck with @@ -1357,7 +1357,7 @@ linda_res <- linda_out$output$AgeElderly 5. Filter `linda_res` for features with `reject == TRUE`. How many differentially abundant taxa were found? What are their names and how significant are they in terms of log-fold change and adjusted p-value? - + ```{r} #| label: confounders-ex5 #| eval: FALSE @@ -1424,7 +1424,7 @@ zicoseq_out <- ZicoSeq( feature.dat = as.matrix(assay(tse)), meta.dat = as.data.frame(colData(tse)), grp.name = "Age", - adj.name = c("Diet", "Gender"), + adj.name = c("Diet", "Gender"), feature.dat.type = "count", return.feature.dat = TRUE, perm.no = 999) @@ -1457,7 +1457,7 @@ zicoseq_res %>% 1. Get the abundances for an individual feature (taxonomic group / row). 2. Visualize the abundances per group with boxplot / jitterplot. 3. Is the difference significant (Wilcoxon test)? -4. Is the difference significant (linear model with covariates)? +4. Is the difference significant (linear model with covariates)? 5. How do transformations affect the outcome (log10, clr..)? 6. Get p-values for all features (taxa), for instance with a for loop. 7. Do multiple testing correction. @@ -1472,7 +1472,7 @@ lm.test, transformAssay, p.adjust 1. Install the latest development version of mia from GitHub. 2. Load experimental dataset from mia. 3. Compare abundances of each taxa between groups. First, use Wilcoxon or -Kruskall-Wallis test. Then use some other method dedicated to microbiome data. +Kruskall-Wallis test. Then use some other method dedicated to microbiome data. 4. Summarize findings by plotting a taxa vs samples heatmap. Add column annotation that tells the group of each sample, and row annotation that tells whether the difference of certain taxa was statistically significant. @@ -1566,7 +1566,7 @@ choice library(mia) library(miaViz) library(shadowtext) -library(ComplexHeatmap) +library(ComplexHeatmap) data("Tengeler2020", package = "mia") tse <- Tengeler2020 @@ -1598,11 +1598,11 @@ and count assays. #| eval: false correlations <- getCrossAssociation( - tse, - test.signif=TRUE, - assay.type1 = "log10", + tse, + test.signif=TRUE, + assay.type1 = "log10", assay.type2 = "counts", - method = "spearman", + method = "spearman", p.adj.threshold = NULL, cor.threshold = NULL, # Remove when mia is fixed @@ -1701,7 +1701,7 @@ pca_aitchison_clusters <- kmeans(reducedDim(pca_aitchison), centers = 3)$cluster plotReducedDim(pca_aitchison, colour_by = pca_aitchison_clusters) ``` -Where: +Where: - `reducedDim` is a function that extracts the reduced dimensionality representation of your data (e.g., from a PCA or PCoA object). - `plotReducedDim` is a function that can plot these reduced dimensions and @@ -1799,7 +1799,7 @@ transformation of assays of individual experiments in the MAE. to be transformed with the argument `assay.type`. 4. Apply a CLR transformation to the counts assay of the microbiota experiment. To ensure non-null values in the assay, set `pseudocount` equal to 1. - + You made it! You learnt how to apply different transformations to the assays of individual experiments in a MAE with `transformAssay`, specifying optional arguments based on the used method. @@ -1848,7 +1848,7 @@ variables `sample_data` and `sample_map`, respectively. is an `ExperimentList` (for now include only the microbiota and metabolites TreeSEs), the second is colData and the third is sampleMap. 5. Make sure that the MAE experiments are identical to the original TreeSEs. You -can do that qualitatively by checking their `head` and quantitatively by +can do that qualitatively by checking their `head` and quantitatively by looking at their `dim`. 6. **Extra**: Add the biomarkers TreeSE as a new experiment to the MAE. Note that new experiments can be added to a MAE through simple concatenation @@ -1857,9 +1857,9 @@ with `c(mae, experiment)`. Good job! Now you are aware of how MAEs are built and we can proceed to some analytical exercises. -### Cross-correlation analysis +### cross-association analysis -Now we will perform a cross-correlation analysis between two of the +Now we will perform a cross-association analysis between two of the experiments in the MAE. 1. Import the mia package, load HintikkaXOData with `data` and store it into a @@ -1871,14 +1871,14 @@ which of their assays with `assay.type1` and `assay.type2`. 3. What does the output look like? By default, correlation is measured in terms of Kendall tau coefficients. Repeat point 2, but this time change `method` to Spearman coefficients. -4. Are you able to infer significance from the output? +4. Are you able to infer significance from the output? 5. Visualize results with a heatmap similarly to the example in section [@sec-cross-correlation]. Do you see any significant correlations? Interpret your results. -6. **Extra**: Perform cross-correlation analysis between the remaining +6. **Extra**: Perform cross-association analysis between the remaining experiments (microbiota vs metabolites and metabolites vs biomarkers) and visualize results with heatmaps. - -Great job! You performed a cross-correlation analysis between two experiments of + +Great job! You performed a cross-association analysis between two experiments of a MAE and visualized the results with a heatmap. You are also able to customise the correlation method and significance testing used for the analysis. diff --git a/inst/pages/extra_material.qmd b/inst/pages/extra_material.qmd index b232f0165..1c28d5348 100644 --- a/inst/pages/extra_material.qmd +++ b/inst/pages/extra_material.qmd @@ -1,4 +1,4 @@ -# Extra material {#sec-extras} +# Extra material {#sec-extras} ```{r} knitr::opts_chunk$set(eval=TRUE, warning=FALSE, message=FALSE) @@ -160,7 +160,7 @@ Let's convert the data frame to a matrix and make the list of assays. counts <- as.matrix(phy_otutable) # Convert to a matrix assays <- SimpleList(counts = counts) tse <- TreeSummarizedExperiment( - assays = assays, + assays = assays, colData = phy_sampledata, rowData = phy_taxtable ) @@ -178,7 +178,7 @@ assayNames(tse) Adding the assays as a list might seem inconvenient if you only have one type of OTU table (counts in our example), but let's see why it is actually very -convenient to be able to hold multiple assays in one data object. +convenient to be able to hold multiple assays in one data object. Here we'll show an example of how to add relative abundances and CLR normalized OTU tables to your tse assays. @@ -253,14 +253,14 @@ tse_nomock_bacteria <- tse[tse$Kingdom == "Bacteria", ] phy_nomock_bacteria # We have 19008 taxa (only bacteria) and before 19216 -tse_nomock_bacteria +tse_nomock_bacteria ``` ### Calculating alpha diversity: estimate_richness = estimateDiversity Now we know how data stored in TreeSE can be accessed and the TreeSE data objects created. Let's look at how we can calculate alpha diversity using -mia compared to phyloseq package. +mia compared to phyloseq package. The mia command estimateDiversity will return a TreeSE and the results are stored in colData, unlike the phyloseq command that outputs a data frame with @@ -273,7 +273,7 @@ sample data to keep it safe with the other sample level data. #| label: "alpha_div" # Alpha diversity with phyloseq -df <- estimate_richness(phy, measures = "Shannon") +df <- estimate_richness(phy, measures = "Shannon") head(df) # Inspect # Add Shannon to the sample_data to keep results safe with other sample data @@ -311,7 +311,7 @@ df <- colData(tse) %>% data.frame %>% dplyr::select(matches("shannon")) We can calculate PCoA with Bray-Curtis distances in phyloseq using the ordinate command. The beta diversity calculation in mia outputs a TreeSE -with a new type of data, reduced dimensions or reducedDim. +with a new type of data, reduced dimensions or reducedDim. Here we will use the scater package that runs the PCoA with runMDS. (PCoA and MDS mean the same thing) @@ -413,7 +413,7 @@ df <- Functionality = c( "Access sample data", # Row 1 "Access tax table", # Row 2 - "Access OTU table", + "Access OTU table", "Build data object", "Calculate alpha diversity", "Calculate beta diversity", @@ -472,16 +472,16 @@ kable(df2) Result of amplicon sequencing is a large number of files that include all the sequences that were read from samples. Those sequences need to be matched with taxa. Additionally, we need to know how many times each taxa were found from -each sample. +each sample. -There are several algorithms to do that, and DADA2 is one of the most common. -You can find DADA2 pipeline tutorial, for example, +There are several algorithms to do that, and DADA2 is one of the most common. +You can find DADA2 pipeline tutorial, for example, [here](https://benjjneb.github.io/dada2/tutorial.html). After the DADA2 portion of the tutorial is completed, the data is stored into _phyloseq_ object (Bonus: Handoff to phyloseq). To store the data to _TreeSE_, -follow the example below. +follow the example below. -You can find full workflow script without further explanations and comments from +You can find full workflow script without further explanations and comments from [Rmd file](extra_material/dada2_workflow.Rmd) ```{r dada2_1, include=FALSE} @@ -557,7 +557,7 @@ Logistic-Normal Linear Regression model; see [vignette](https://jsilve24.github.io/fido/articles/introduction-to-fido.html) of package. -The following presents such an exemplary analysis based on the +The following presents such an exemplary analysis based on the data of @Sprockett2020 available through `microbiomeDataSets` package. @@ -646,7 +646,7 @@ Note: Some computational failures could occur (see the arguments `multDirichletBoot` `calcGradHess` could be passed in such case. ```{r} -priors$Y <- Y +priors$Y <- Y posterior <- refit(priors, optim_method="adam", multDirichletBoot=0.5) #calcGradHess=FALSE ``` @@ -679,8 +679,8 @@ Here, we use following packages: - [cobiclust](https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13582) _cobiclust_ is especially developed for microbiome data whereas _biclust_ is -more general method. In this section, we show two different cases and example -solutions to apply biclustering to them. +more general method. In this section, we show two different cases and example +solutions to apply biclustering to them. 1. Taxa vs samples 2. Taxa vs biomolecule/biomarker @@ -705,7 +705,7 @@ data("HintikkaXOData") mae <- HintikkaXOData ``` -Only the most prevalent taxa are included in analysis. +Only the most prevalent taxa are included in analysis. ```{r cobiclust_1} # Subset data in the first experiment @@ -717,7 +717,7 @@ mae[[1]] <- transformAssay(mae[[1]], method = "rclr") ``` _cobiclust_ takes counts table as an input and gives _cobiclust_ object as an -output. It includes clusters for taxa and samples. +output. It includes clusters for taxa and samples. ```{r cobiclust_2} # Do clustering using counts table @@ -791,7 +791,7 @@ p1 + p2 ### Taxa vs biomolecules -Here, we analyze cross-correlation between taxa and metabolites. This +Here, we analyze cross-association between taxa and metabolites. This is a case, where we use _biclust_ method which is suitable for numeric matrices in general. First we pre-process the data. @@ -825,7 +825,7 @@ corr <- getExperimentCrossCorrelation( mode = "matrix", correlation = "spearman") ``` -_biclust_ takes a matrix as an input and returns a _biclust_ object. +_biclust_ takes a matrix as an input and returns a _biclust_ object. ```{r biclust_3} library(biclust) @@ -842,7 +842,7 @@ The object includes cluster information. However compared to _cobiclust_, _biclust_ object includes only information about clusters that were found, not general cluster. -Meaning that if one cluster size of 5 features was found out of 20 features, +Meaning that if one cluster size of 5 features was found out of 20 features, those 15 features do not belong to any cluster. That is why we have to create an additional cluster for features/samples that are not assigned into any cluster. @@ -860,7 +860,7 @@ additional cluster for features/samples that are not assigned into any cluster. # Get data into right format bc_columns <- .manipulate_bc_data(bc_columns, assay, "col") bc_rows <- .manipulate_bc_data(bc_rows, assay, "row") - + return(list(bc_columns = bc_columns, bc_rows = bc_rows)) } @@ -891,7 +891,7 @@ additional cluster for features/samples that are not assigned into any cluster. bc_clusters <- cbind(bc_clusters, vec) } } - + # Adjust row and column names rownames(bc_clusters) <- names colnames(bc_clusters) <- paste0("cluster_", 1:ncol(bc_clusters)) @@ -910,7 +910,7 @@ bicluster_columns <- bcs$bc_columns head(bicluster_rows) ``` -Let's collect information for the scatter plot. +Let's collect information for the scatter plot. ```{r biclust_6} # Function for obtaining sample-wise sum, mean, median, and mean variance @@ -931,13 +931,13 @@ Let's collect information for the scatter plot. mean1 <- colMeans2(assay1, na.rm = T) median1 <- colMedians(assay1, na.rm = T) var1 <- colVars(assay1, na.rm = T) - + sum2 <- colSums2(assay2, na.rm = T) mean2 <- colMeans2(assay2, na.rm = T) median2 <- colMedians(assay2, na.rm = T) var2 <- colVars(assay2, na.rm = T) - - list[[i]] <- data.frame(sample = colnames(tse1), sum1, sum2, mean1, + + list[[i]] <- data.frame(sample = colnames(tse1), sum1, sum2, mean1, mean2, median1, median2, var1, var2) } return(list) diff --git a/inst/pages/import.qmd b/inst/pages/import.qmd index 07b9f9b21..fdfa29740 100644 --- a/inst/pages/import.qmd +++ b/inst/pages/import.qmd @@ -55,7 +55,7 @@ sample_meta_file_path <- system.file( "extdata", "Mapping_file_ADHD_aggregated.csv", package = "OMA") tree_file_path <- system.file( "extdata", "Data_humanization_phylo_aggregation.tre", package = "OMA") -``` +``` Now we can read in the biom file and convert it into a `TreeSE` object. In addition, we retrieve the rank names from the prefixes of the feature names @@ -73,7 +73,7 @@ tse <- importBIOM( # Check tse -``` +``` The `assays` slot includes a list of abundance tables. The imported abundance table is named as "counts". Let us inspect only the first @@ -98,11 +98,11 @@ then replace the original rowData with its updated version. ```{r} # Genus level has additional '\"', so let's delete that also rowdata_modified <- BiocParallel::bplapply( - rowData(tse), - FUN = stringr::str_remove, + rowData(tse), + FUN = stringr::str_remove, pattern = '\"') -# rowdata_modified is a list, so convert this back to DataFrame format. +# rowdata_modified is a list, so convert this back to DataFrame format. # and assign the cleaned data back to the TSE rowData rowData(tse) <- DataFrame(rowdata_modified) @@ -190,13 +190,13 @@ sample IDs; other values should be numeric (abundances). column provides feature IDs, the first row provides column headers; this file usually contains the taxonomic mapping between different taxonomic levels. Ideally, the feature IDs (row names) match one-to-one with -the abundance table row names. +the abundance table row names. - Column data (`coldata.csv`): data table (samples x info); first column provides sample IDs, the first row provides column headers; this file usually contains the sample metadata/phenodata (such as subject age, health etc). Ideally, the sample IDs match one-to-one with -the abundance table column names. +the abundance table column names. After you have set up the CSV files, you can read them in R: @@ -259,7 +259,7 @@ class(samples) # should be data.frame or DataFrame all(rownames(tax) == rownames(counts)) # our dataset class(tax) # should be data.frame or DataFrame -# Counts +# Counts class(counts) # should be a numeric matrix ``` @@ -300,7 +300,7 @@ downstream analyses. #### Constructing MAE -To construct a _MAE_ object, just combine multiple _TreeSE_ data containers. +To construct a _MAE_ object, just combine multiple _TreeSE_ data containers. Here we import metabolite data from the same study. ```{r importingcsv4, message=FALSE} @@ -308,7 +308,7 @@ count_file <- system.file("extdata", "assay_metabolites.csv", package = "OMA") sample_file <- system.file("extdata", "coldata.csv", package = "OMA") # Load files -counts <- read.csv(count_file, row.names=1) +counts <- read.csv(count_file, row.names=1) samples <- read.csv(sample_file, row.names=1) # Create a TreeSE for the metabolite data @@ -428,8 +428,8 @@ tse <- curatedMetagenomicData("Vatanen*", dryrun = FALSE, counts = TRUE) ### Human microbiome compendium -[MicroBioMap](https://seandavi.github.io/MicroBioMap/) dataset includes -over 170k samples of publicly available 16S rRNA amplicon sequencing data, +[MicroBioMap](https://seandavi.github.io/MicroBioMap/) dataset includes +over 170k samples of publicly available 16S rRNA amplicon sequencing data, all processed using the same pipeline and reference database [@microbiomap]. After installing the MicroBioMap package (see the [original website](https://github.com/seandavi/MicroBioMap#microbiome-compendium) @@ -440,7 +440,7 @@ library(MicroBioMap) cpd <- getCompendium() ``` -This returns a `TreeSE` object. Currently, +This returns a `TreeSE` object. Currently, the `rowTree` slot of the `TreeSE` is not populated. After loading the compendium, you will have immediate access to @@ -455,6 +455,6 @@ The current collections provide access to vast microbiome data resources. The output has to be converted into `TreeSE/MAE` separately. - [MGnifyR](https://bioconductor.org/packages/release/bioc/html/MGnifyR.html) provides access to [EBI/MGnify](https://www.ebi.ac.uk/metagenomics/) -- [HoloFoodR](https://bioconductor.org/packages/release/bioc/html/HoloFoodR.html) provides access to [EBI/HoloFood](https://www.holofooddata.org/) -- [qiitr](https://github.com/cran/qiitr) provides access to [QIITA](https://qiita.com/about) -- [qiime2R](https://github.com/microbiome/qiime2R) provides access to [QIIME2](https://docs.qiime2.org/2024.2/) +- [HoloFoodR](https://bioconductor.org/packages/release/bioc/html/HoloFoodR.html) provides access to [EBI/HoloFood](https://www.holofooddata.org/) +- [qiitr](https://github.com/cran/qiitr) provides access to [QIITA](https://qiita.com/about) +- [qiime2R](https://github.com/microbiome/qiime2R) provides access to [QIIME2](https://docs.qiime2.org/2024.2/) diff --git a/inst/pages/integrated_learner.qmd b/inst/pages/integrated_learner.qmd index 06c72db4e..824a7b8b8 100644 --- a/inst/pages/integrated_learner.qmd +++ b/inst/pages/integrated_learner.qmd @@ -29,9 +29,9 @@ from the multi-omics integrated framework (`IntegratedLearner`) proposed by ## Input data -We use the publicly available Inflammatory Bowel Diseases (IBD) data from the -curatedMetagenomicData package [@Lloyd-Price2019], where we aim to -predict IBD disease status based on both taxonomic (species abundances) and +We use the publicly available Inflammatory Bowel Diseases (IBD) data from the +curatedMetagenomicData package [@Lloyd-Price2019], where we aim to +predict IBD disease status based on both taxonomic (species abundances) and functional (pathway abundances) profiles. ```{r load-pkg-data} @@ -85,10 +85,10 @@ rownames(feature_species) <- sub('.*s__', '', rownames(feature_species)) feature_pwys <- as.data.frame(assay(se_pathway)) feature_pwys <- rownames_to_column(feature_pwys, "ID") -feature_pwys <- feature_pwys %>% - filter(!grepl("\\|", ID)) %>% - filter(!ID %in% c('UNMAPPED', 'UNINTEGRATED')) %>% - column_to_rownames('ID') %>% +feature_pwys <- feature_pwys %>% + filter(!grepl("\\|", ID)) %>% + filter(!ID %in% c('UNMAPPED', 'UNINTEGRATED')) %>% + column_to_rownames('ID') %>% as.data.frame() ################################ @@ -148,9 +148,9 @@ library(caret) feature_table_t <- as.data.frame(t(feature_table)) abd_threshold = 0 prev_threshold = 0.1 -nzv <- nearZeroVar(feature_table_t) -features_var_filtered <- feature_table_t[, -nzv] -features_var_filtered <- as.data.frame(features_var_filtered) +nzv <- nearZeroVar(feature_table_t) +features_var_filtered <- feature_table_t[, -nzv] +features_var_filtered <- as.data.frame(features_var_filtered) features_filtered <- features_var_filtered[ , colSums(features_var_filtered > abd_threshold) > nrow(features_var_filtered) * prev_threshold @@ -199,7 +199,7 @@ subjectCvFoldsIN <- createFolds(1:length(subjectID), k = 5, returnTrain=TRUE) # Curate subject-level samples per fold # ######################################### -obsIndexIn <- vector("list", 5) +obsIndexIn <- vector("list", 5) for(k in 1:length(obsIndexIn)){ x <- which(!sample_metadata$subjectID %in% subjectID[subjectCvFoldsIN[[k]]]) obsIndexIn[[k]] <- x @@ -221,10 +221,10 @@ name_layers <- with( droplevels(feature_metadata), list(levels = levels(featureType)), nlevels = nlevels(featureType))$levels SL_fit_predictions <- vector("list", length(name_layers)) -SL_fit_layers <- vector("list", length(name_layers)) +SL_fit_layers <- vector("list", length(name_layers)) names(SL_fit_layers) <- name_layers names(SL_fit_predictions) <- name_layers -X_train_layers <- vector("list", length(name_layers)) +X_train_layers <- vector("list", length(name_layers)) names(X_train_layers) <- name_layers layer_wise_predictions_train <- vector("list", length(name_layers)) names(layer_wise_predictions_train) <- name_layers @@ -244,39 +244,39 @@ utilizing the SuperLearner R package. ######################################################## for (i in seq_along(name_layers)){ - + # Prepare single-omic input data - include_list <- feature_metadata %>% filter(featureType == name_layers[i]) + include_list <- feature_metadata %>% filter(featureType == name_layers[i]) t_dat_slice <- feature_table[ rownames(feature_table) %in% include_list$featureID, ] dat_slice <- as.data.frame(t(t_dat_slice)) Y = sample_metadata$Y X = dat_slice X_train_layers[[i]] <- X - + # Run user-specified base learner library(SuperLearner) - + SL_fit_layers[[i]] <- SuperLearner( - Y = Y, + Y = Y, X = X, - cvControl = cvControl, + cvControl = cvControl, SL.library = c('SL.randomForest'), family = binomial()) - + # Append the corresponding y and X to the results SL_fit_layers[[i]]$Y <- sample_metadata['Y'] SL_fit_layers[[i]]$X <- X - + # Remove redundant data frames and collect pre-stack predictions rm(t_dat_slice) rm(dat_slice) rm(X) SL_fit_predictions[[i]] <- SL_fit_layers[[i]]$Z - + # Re-fit to entire dataset for final predictions layer_wise_predictions_train[[i]] <- SL_fit_layers[[i]]$SL.predict - + } ``` @@ -308,9 +308,9 @@ names(combo_final) <- name_layers # Run user-specified meta learner SL_fit_stacked <- SuperLearner( - Y = Y, - X = combo, - cvControl = cvControl, + Y = Y, + X = combo, + cvControl = cvControl, verbose = TRUE, SL.library = 'SL.glmnet', family = binomial()) @@ -326,8 +326,8 @@ SL_fit_stacked$X <- combo ``` -In contrast to late fusion, in early fusion, we will simply supply a combined -representation of the data and we will use the `random forest` method for +In contrast to late fusion, in early fusion, we will simply supply a combined +representation of the data and we will use the `random forest` method for building an integrated prediction model. ```{r early-fusion} @@ -340,9 +340,9 @@ fulldat<-as.data.frame(t(feature_table)) # Early Fusion using Random Forest SL_fit_concat <- SuperLearner( - Y = Y, + Y = Y, X = fulldat, - cvControl = cvControl, + cvControl = cvControl, SL.library = 'SL.randomForest', family = binomial()) @@ -355,16 +355,16 @@ SL_fit_concat$X <- fulldat ``` -Finally, we consider the intermediate fusion approach, which combines ideas -from both late fusion and early fusion by integrating the usual squared-error -loss of predictions with an "agreement" (fusion) penalty ($\rho$) so that the +Finally, we consider the intermediate fusion approach, which combines ideas +from both late fusion and early fusion by integrating the usual squared-error +loss of predictions with an "agreement" (fusion) penalty ($\rho$) so that the predictions from different data views agree. The intermediate fusion adjusts the degree of fusion in an adaptive manner, where the test set prediction error is estimated with a cross-validation method. By varying the weight of the fusion penalty - hyperparameter $\rho$, we obtain the early and late fusion approaches as special cases. If $\rho = 0$, we have a -simple form of early fusion; if $\rho = 1$, we obtain a simple form of late +simple form of early fusion; if $\rho = 1$, we obtain a simple form of late fusion. For $0 < \rho < 1$, we obtain the intermediate fusion. Here, as examples, we will prepare the input data, fit the multiview model, and @@ -388,8 +388,14 @@ name_layers <- with( dataList <- vector("list", length = length(name_layers)) names(dataList) <- name_layers table(feature_metadata$featureType) -dataList[[1]] <- t(feature_table[124:nrow(feature_table), ]) -dataList[[2]] <- t(feature_table[1:123, ]) + +# Select indices of Pathways and Species rows +pathways_ind <- which(feature_metadata$featureType == "Pathways") +species_ind <- which(feature_metadata$featureType == "Species") + +# Create two data lists for pathways and species +dataList[[1]] <- t(feature_table[pathways_ind, ]) +dataList[[2]] <- t(feature_table[species_ind, ]) # Extract y and X's dataList <- lapply(dataList, as.matrix) @@ -403,15 +409,15 @@ set.seed(1234) library(multiview) library(glmnet) -cvfit <- cv.multiview(dataList, Y, family = binomial(), alpha = 0.5) +cvfit <- cv.multiview(dataList, Y, family = binomial(), alpha = 0.5) DD <- as.data.frame(as.matrix(coef_ordered(cvfit, s="lambda.min", alpha = 0.5))) DD$standardized_coef <- as.numeric(DD$standardized_coef) DD$coef <- as.numeric(DD$coef) ``` -In this section, we visualize the prediction performance by gathering all -predictions and extracting the data necessary for ROC plotting, corresponding +In this section, we visualize the prediction performance by gathering all +predictions and extracting the data necessary for ROC plotting, corresponding to each integration paradigm. ```{r visualization1} @@ -427,36 +433,36 @@ yhat.train <- cbind( colnames(yhat.train) <- c( colnames(combo), "Late Fusion", "Early Fusion", "Intermediate Fusion") -# Extract ROC plot data +# Extract ROC plot data list.ROC <- vector("list", length = ncol(yhat.train)) names(list.ROC) <- colnames(yhat.train) - -# Loop over layers + +# Loop over layers library(ROCR) for(k in 1:length(list.ROC)){ preds <- yhat.train[ ,k] pred <- prediction(preds, Y) AUC <- round(performance(pred, "auc")@y.values[[1]], 2) - perf <- performance(pred, "sens", "spec") + perf <- performance(pred, "sens", "spec") list.ROC[[k]] <- data.frame( sensitivity = slot(perf, "y.values")[[1]], specificity = 1 - slot(perf, "x.values")[[1]], AUC = AUC, layer = names(list.ROC)[k]) } - + # Combine ROC_table <- do.call('rbind', list.ROC) ``` -Based on the ROC plot described below, we observe that the AUC is 0.65 when +Based on the ROC plot described below, we observe that the AUC is 0.65 when considering only the pathway abundance data in the model, and 0.64 for the model including only the species abundance data. The AUC increases to 0.68 when using -the early fusion model and reaches 1 with the late fusion model. In contrast, +the early fusion model and reaches 1 with the late fusion model. In contrast, the intermediate fusion model achieves an AUC of 0.99. Overall, most integrated -classifiers outperform individual layers in distinguishing between IBD and +classifiers outperform individual layers in distinguishing between IBD and non-IBD controls. ```{r visualization2} @@ -466,24 +472,24 @@ plot_data <- ROC_table plot_data$displayItem <- paste(plot_data$layer, " AUC = ", plot_data$AUC, sep="") plot_data$displayItem <- factor( plot_data$displayItem, levels = unique(plot_data$displayItem)) - + # ROC curves p <- ggplot( plot_data, aes(x = specificity, y = sensitivity, - group = displayItem)) + + group = displayItem)) + geom_line(aes(x = specificity,y = sensitivity,color = displayItem)) + theme( - legend.position = "bottom", + legend.position = "bottom", legend.background=element_blank(), - legend.box.background=element_rect(colour = "black")) + + legend.box.background=element_rect(colour = "black")) + theme_bw() + xlab("False Positive Rate") + ylab("True Positive Rate") + theme(legend.position = "right", legend.direction = "vertical") + - labs(color = '') - + labs(color = '') + # Print print(p) @@ -492,9 +498,9 @@ print(p) Finally, we visualize the results for the top 20 features for each layer based on the intermediate fusion. -```{r visualization3, fig.width = 6, fig.height = 3} -#### Change the pathway -# Only plot tp to 20 per layer +```{r visualization3, fig.width = 6, fig.height = 7} +#### Change the pathway +# Only plot 20 features per layer DD <- DD %>% group_by(view) %>% top_n(n = 20, wt = abs(standardized_coef)) @@ -504,16 +510,18 @@ DD$view_col <- gsub(":.*", "", DD$view_col) # Visualization library(forcats) -p <- DD %>% - mutate(view_col = fct_reorder(view_col, standardized_coef)) %>% - ggplot(aes(x = view_col, y = standardized_coef, fill = view, width = 0.75)) + +p <- DD %>% + mutate(view_col = fct_reorder(view_col, standardized_coef)) %>% + ggplot(aes(x = view_col, y = standardized_coef, fill = view, width = 0.75)) + geom_bar(stat = "identity", show.legend = FALSE, width = 1) + coord_flip() + facet_wrap(~ view, scales = 'free_y', nrow = 2) + ylab('Standardized LFC') + xlab('') + - ggtitle('IBD-associated multi-omics features') + + #labs(title = "IBD-associated\nmulti-omics features") + + ggtitle('IBD-associated\nmulti-omics features') + theme_bw() + theme( + plot.title = element_text(hjust = 0.5, vjust = 0.5), strip.background = element_blank(), panel.grid.major = element_line(colour = "grey80"), panel.border = element_blank(), @@ -521,7 +529,6 @@ p <- DD %>% panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank()) p - ``` Based on the bar plot, we observe that \emph{Proteobacteria_bacterium_CAG.139} diff --git a/inst/pages/intro.qmd b/inst/pages/intro.qmd index 5a9ddf5e1..022ef9088 100644 --- a/inst/pages/intro.qmd +++ b/inst/pages/intro.qmd @@ -49,8 +49,8 @@ on how to handle data containers from the SummarizedExperiment family. The Bioconductor microbiome data science framework consists of: - **Data containers**, designed to organize multi-assay microbiome data -- **R/Bioconductor packages** that provide dedicated methods -- **Community** of users and developers +- **R/Bioconductor packages** that provide dedicated methods +- **Community** of users and developers ![Data containers are central in Bioconductor.](figures/ecosystem.png){width="50%"} diff --git a/inst/pages/introductory_workflow.qmd b/inst/pages/introductory_workflow.qmd index cfdcfa5db..d050c5fe0 100644 --- a/inst/pages/introductory_workflow.qmd +++ b/inst/pages/introductory_workflow.qmd @@ -101,7 +101,7 @@ You can access **assays** [@sec-assay-slot] as such: ```{r} #| label: showAssay -assay(tse)[1:5,1:10] +assay(tse)[1:5,1:10] ``` Whereas **colData** [@sec-add-or-modify-data] is accessible through: @@ -198,9 +198,9 @@ All these variations are referred to as alpha diversity. # Estimate (observed) richness tse_alpha <- addAlpha( - tse, - assay.type = "counts", - index = "observed", + tse, + assay.type = "counts", + index = "observed", name="observed") # Check some of the first values in colData @@ -215,12 +215,12 @@ graph to visualize this. ```{r} #| label: plotColdata plotColData( - tse_alpha, - "observed", - "cohort", + tse_alpha, + "observed", + "cohort", colour_by = "patient_status") + - - theme(axis.text.x = element_text(angle=45,hjust=1)) + + ggtitle("Richness plot of three patient cohorts") + + theme(axis.text.x = element_text(angle=45,hjust=1)) + labs(y=expression(Richness[Observed])) ``` @@ -240,9 +240,9 @@ First, we can easily calculate this measure and add them to our TreeSE. ```{r} #| label: calculateDiversity tse_alpha <- addAlpha( - tse_alpha, + tse_alpha, assay.type = "counts", - index = c("shannon"), + index = c("shannon"), name = c("shannon")) ``` @@ -261,16 +261,26 @@ plots <- lapply( colour_by = "patient_status") # Fine-tune visual appearance -plots <- lapply(plots, "+", +plots <- lapply(plots, "+", theme( axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank())) -# Plot the figures -(plots[[1]] | plots[[2]]) + - plot_layout(guides = "collect") +# Add titles to individual plots +plots[[1]] <- plots[[1]] + + ggtitle("Observed richness") +plots[[2]] <- plots[[2]] + + ggtitle("Shannon diversity") + +# Plot the figures +patchwork <- (plots[[1]] | plots[[2]]) +patchwork + + plot_layout(guides = "collect") + + plot_annotation( + title = "Comparison of observed richness estimate and Shannon diversity", +) ``` It's very important to make all these comparisons in order to quantify @@ -295,7 +305,7 @@ $$ Luckily for us, the [**mia package**](https://microbiome.github.io/mia/) provides an easy way to calculate the relative abundance for our TreeSE -using the `transformAssay` method. +using the `transformAssay()` method. ```{r} #| label: calculateRelabundance @@ -331,8 +341,8 @@ In our case, the assay contains 151 rows and 27 columns. Having so many columns and thus dimensions can be troublesome for visualizing dissimilarity. In order to visualize the dissimilarity between the different samples, we can -conduct a Principal Coordinate analysis on the newly created assay. This -essentially projects the Bray-curtis dimensions onto a lower space whilst +conduct a Principal Coordinate Analysis on the newly created assay. This +essentially projects the Bray-Curtis dimensions onto a lower space whilst maintaining as much of the variation as possible, the projected values are called principal coordinates. You can read more on @Multidimensional-scaling here. @@ -343,8 +353,9 @@ we can use Bioconductor's scater package and the vegan package, made by ```{r} #| label: showPCoA -# Create ggplot object -p <- plotReducedDim(tse, "MDS_bray", colour_by = "cohort") +# Create a ggplot object +p <- plotReducedDim(tse, "MDS_bray", colour_by = "cohort") + + ggtitle("Bray-Curtis dissimilarity using multidimensional scaling") # Convert to an interactive plot with ggplotly ggplotly(p) @@ -363,7 +374,8 @@ rel_eig <- e / sum(e[e > 0]) # Add explained variance for each axis on the plot p <- p + labs( x = paste("PCoA 1 (", round(100 * rel_eig[[1]], 1), "%", ")", sep = ""), - y = paste("PCoA 2 (", round(100 * rel_eig[[2]], 1), "%", ")", sep = "")) + y = paste("PCoA 2 (", round(100 * rel_eig[[2]], 1), "%", ")", sep = "")) + + ggtitle("Bray-Curtis dissimilarity using multidimensional scaling") # Reonvert to an interactive plot with ggplotly ggplotly(p) diff --git a/inst/pages/introductory_workflow_dutch_version.qmd b/inst/pages/introductory_workflow_dutch_version.qmd index b8b231b95..baca33741 100644 --- a/inst/pages/introductory_workflow_dutch_version.qmd +++ b/inst/pages/introductory_workflow_dutch_version.qmd @@ -113,7 +113,7 @@ zal er meer uitleg worden gegeven over het maken van subsets. ```{r} #| label: assay -assay(tse)[1:5,1:10] +assay(tse)[1:5,1:10] ``` ### colData @@ -242,9 +242,9 @@ worden ook wel *Alfabetische diversiteit (alpha-diversiteit)* genoemd. #| label: rijkdom # Schatting (waargenomen) rijkdom tse_alpha <- addAlpha( - tse, - assay.type = "counts", - index = "observed", + tse, + assay.type = "counts", + index = "observed", name="observed") # Controleer enkele van de eerste waarden in colData @@ -259,12 +259,12 @@ kunnen dan vervolgens een figuur tekenen om dit te visualiseren. ```{r} #| label: rijkdom-grafiek plotColData( - tse_alpha, - "observed", - "cohort", + tse_alpha, + "observed", + "cohort", colour_by = "patient_status") + - - theme(axis.text.x = element_text(angle=45,hjust=1)) + + + theme(axis.text.x = element_text(angle=45,hjust=1)) + labs(y=expression(Richness[Observed])) ``` @@ -288,9 +288,9 @@ Eerst kunnen we deze maatstaf eenvoudig berekenen en toevoegen aan onze TSE. ```{r} #| label: diversiteit tse_alpha <- addAlpha( - tse_alpha, + tse_alpha, assay.type = "counts", - index = c("shannon"), + index = c("shannon"), name = c("shannon")) ``` @@ -309,7 +309,7 @@ plots <- lapply( colour_by = "patient_status") # Weergave verbeteren -plots <- lapply(plots, "+", +plots <- lapply(plots, "+", theme( axis.text.x = element_blank(), axis.title.x = element_blank(), diff --git a/inst/pages/introductory_workflow_french_version.qmd b/inst/pages/introductory_workflow_french_version.qmd index 949b7817b..76e2dbcc3 100644 --- a/inst/pages/introductory_workflow_french_version.qmd +++ b/inst/pages/introductory_workflow_french_version.qmd @@ -106,7 +106,7 @@ manière : ```{r} #| label: showAssay -assay(tse)[1:5,1:10] +assay(tse)[1:5,1:10] ``` Alors que **colData** ([@sec-add-or-modify-data]) est accessible via : @@ -207,9 +207,9 @@ Toutes ces variations sont appelées diversité alpha. # Estimate (observed) richness tse_alpha <- addAlpha( - tse, - assay.type = "counts", - index = "observed", + tse, + assay.type = "counts", + index = "observed", name="observed") # Check some of the first values in colData @@ -225,12 +225,12 @@ visualiser cela. ```{r} #| label: plotColdata plotColData( - tse_alpha, - "observed", - "cohort", + tse_alpha, + "observed", + "cohort", colour_by = "patient_status") + - - theme(axis.text.x = element_text(angle=45,hjust=1)) + + + theme(axis.text.x = element_text(angle=45,hjust=1)) + labs(y=expression(Richness[Observed])) ``` @@ -251,9 +251,9 @@ notre TreeSE. ```{r} #| label: calculateDiversity tse_alpha <- addAlpha( - tse_alpha, + tse_alpha, assay.type = "counts", - index = c("shannon"), + index = c("shannon"), name = c("shannon")) ``` @@ -272,7 +272,7 @@ plots <- lapply( colour_by = "patient_status") # Fine-tune visual appearance -plots <- lapply(plots, "+", +plots <- lapply(plots, "+", theme( axis.text.x = element_blank(), axis.title.x = element_blank(), diff --git a/inst/pages/machine_learning.qmd b/inst/pages/machine_learning.qmd index 6530efb31..3d8ff30ee 100644 --- a/inst/pages/machine_learning.qmd +++ b/inst/pages/machine_learning.qmd @@ -128,7 +128,7 @@ library(mikropml) df <- assay df[["diagnosis"]] <- labels -# Run random forest +# Run random forest results <- run_ml( df, "rf", outcome_colname = "diagnosis", kfold = 2, cv_times = 5, training_frac = 0.8) @@ -146,7 +146,7 @@ In this second model, we try to improve the prediction accuracy. We address the imbalanced data by applying class weights. Moreover, we utilize a more complex model, XGBoost, which is another commonly used algorithm in bioinformatics, usually one of the best performing models -for tabular data [@SHWARTZZIV202284]. We use +for tabular data [@SHWARTZZIV202284]. We use [`caret`](https://cran.r-project.org/web/packages/caret/index.html) package which offers a R framework for machine learning. @@ -166,7 +166,7 @@ class_weights <- weights[ match(labels, names(weights)) ] # Specify train control train_control <- trainControl( method = "cv", number = 5, - classProbs = TRUE, + classProbs = TRUE, savePredictions = "final", allowParallel = TRUE, summaryFunction = twoClassSummary @@ -186,7 +186,7 @@ tune_grid <- expand.grid( # Train the model model <- train( - x = assay, + x = assay, y = labels, preProcess = c("center", "scale"), method = "xgbTree", diff --git a/inst/pages/miaverse.qmd b/inst/pages/miaverse.qmd index 86ab3e927..343d94e67 100644 --- a/inst/pages/miaverse.qmd +++ b/inst/pages/miaverse.qmd @@ -127,13 +127,13 @@ The following DA methods support `(Tree)SummarizedExperiment`. - [philr](http://bioconductor.org/packages/devel/bioc/html/philr.html) (@Silverman2017) phylogeny-aware phILR transformation - [MicrobiotaProcess](https://bioconductor.org/packages/release/bioc/html/MicrobiotaProcess.html) for the "tidy" analysis of microbiome and other ecological data - [Tools for Microbiome Analysis](https://microsud.github.io/Tools-Microbiome-Analysis/) -site listed over 130 R packages for microbiome data science in 2023. +site listed over 130 R packages for microbiome data science in 2023. Many of these are not in Bioconductor, or do not directly support the data containers used in this book but can be often used with minor modifications. The phyloseq-based tools can be used by converting the TreeSE data into phyloseq with `convertToPhyloseq`. -### Open microbiome data +### Open microbiome data Hundreds of published microbiome datasets are readily available in these data containers (see [@sec-example-data]). @@ -157,7 +157,7 @@ script. ```{r} #| label: all_packages #| eval: FALSE -#| +#| # URL of the raw CSV file on GitHub. It includes all packages needed. url <- "https://raw.githubusercontent.com/microbiome/OMA/devel/oma_packages/oma_packages.csv" @@ -241,6 +241,6 @@ and [@sec-support]. - TreeSummarizedExperiment is derived from SummarizedExperiment class. - miaverse is based on TreeSummarizedExperiment data container. - We can borrow methods from packages utilizing SingleCellExperiment and -SummarizedExperiment. +SummarizedExperiment. ::: diff --git a/inst/pages/mmuphin_meta_analysis.qmd b/inst/pages/mmuphin_meta_analysis.qmd index 2c84bb13a..c7642cb62 100644 --- a/inst/pages/mmuphin_meta_analysis.qmd +++ b/inst/pages/mmuphin_meta_analysis.qmd @@ -99,13 +99,13 @@ library(dplyr) se_relative <- sampleMetadata |> filter(study_name %in% c( - "FrankelAE_2017", "GopalakrishnanV_2018", "LeeKA_2022", + "FrankelAE_2017", "GopalakrishnanV_2018", "LeeKA_2022", "MatsonV_2018", "PetersBA_2019", "WindTT_2020")) |> returnSamples("relative_abundance", rownames = "short") se_pathway <- sampleMetadata |> filter(study_name %in% c( - "FrankelAE_2017", "GopalakrishnanV_2018", "LeeKA_2022", + "FrankelAE_2017", "GopalakrishnanV_2018", "LeeKA_2022", "MatsonV_2018", "PetersBA_2019", "WindTT_2020")) |> returnSamples("pathway_abundance", rownames = "short") @@ -232,7 +232,7 @@ pcoa_before <- cmdscale(D_before, eig = TRUE) ord_before <- as.data.frame(pcoa_before$points) percent_var_before <- round(pcoa_before$eig / sum(pcoa_before$eig) * 100, 1)[1:2] before_labels <- c( - paste('Axis 1 (', percent_var_before[1], '%)', sep = ''), + paste('Axis 1 (', percent_var_before[1], '%)', sep = ''), paste('Axis 2 (', percent_var_before[2], '%)', sep = '')) before_phrase <- paste('Before (PERMANOVA R2 = ', R2_before, '%)', sep = '') @@ -244,7 +244,7 @@ R2_after <- round(fit_adonis_after$aov.tab[1, 5]*100, 1) pcoa_after <- cmdscale(D_after, eig = TRUE) ord_after <- as.data.frame(pcoa_after$points) percent_var_after <- round(pcoa_after$eig / sum(pcoa_after$eig) * 100, 1)[1:2] -after_labels <- c(paste('Axis 1 (', percent_var_after[1], '%)', sep = ''), +after_labels <- c(paste('Axis 1 (', percent_var_after[1], '%)', sep = ''), paste('Axis 2 (', percent_var_after[2], '%)', sep = '')) after_phrase <- paste('After (PERMANOVA R2 = ', R2_after, '%)', sep = '') @@ -252,22 +252,22 @@ colnames(ord_after) <- c('PC1', 'PC2') ord_after$Study <- data_meta$study_name # Ordination Plot -p_before <- ggplot(ord_before, aes(x = PC1, y = PC2, color = Study)) + - geom_point(size = 4) + - theme_bw() + - xlab(before_labels[1]) + - ylab(before_labels[2]) + - ggtitle(before_phrase) + - theme(plot.title = element_text(hjust = 0.5)) + +p_before <- ggplot(ord_before, aes(x = PC1, y = PC2, color = Study)) + + geom_point(size = 4) + + theme_bw() + + xlab(before_labels[1]) + + ylab(before_labels[2]) + + ggtitle(before_phrase) + + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.position ="none") -p_after <- ggplot(ord_after, aes(x = PC1, y = PC2, color = Study)) + - geom_point(size = 4) + - theme_bw() + - xlab(after_labels[1]) + - ylab(after_labels[2]) + - ggtitle(after_phrase) + - theme(plot.title = element_text(hjust = 0.5)) + +p_after <- ggplot(ord_after, aes(x = PC1, y = PC2, color = Study)) + + geom_point(size = 4) + + theme_bw() + + xlab(after_labels[1]) + + ylab(after_labels[2]) + + ggtitle(after_phrase) + + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.position ="none") p_before + p_after @@ -295,7 +295,7 @@ for (i in 1:length(transform)) { exposure = "resvar", data = data_meta, control = list(analysis_method = 'LM', transform = transform[i])) - + num[i] <- sum(fit_lm_meta$meta_fits$qval.fdr < 0.05, na.rm = TRUE) } @@ -320,7 +320,7 @@ meta_fits <- fit_lm_meta$meta_fits meta_fits %>% filter(qval.fdr < 0.05) %>% - arrange(coef) %>% + arrange(coef) %>% mutate(feature = factor(feature, levels = feature)) %>% ggplot(aes(y = coef, x = feature)) + geom_bar(stat = "identity") + @@ -423,7 +423,7 @@ pcoa_before <- cmdscale(D_before, eig = TRUE) ord_before <- as.data.frame(pcoa_before$points) percent_var_before <- round(pcoa_before$eig / sum(pcoa_before$eig) * 100, 1)[1:2] before_labels <- c( - paste('Axis 1 (', percent_var_before[1], '%)', sep = ''), + paste('Axis 1 (', percent_var_before[1], '%)', sep = ''), paste('Axis 2 (', percent_var_before[2], '%)', sep = '')) before_phrase <- paste('Before (PERMANOVA R2 = ', R2_before, '%)', sep = '') @@ -436,7 +436,7 @@ pcoa_after <- cmdscale(D_after, eig = TRUE) ord_after <- as.data.frame(pcoa_after$points) percent_var_after <- round(pcoa_after$eig / sum(pcoa_after$eig) * 100, 1)[1:2] after_labels <- c( - paste('Axis 1 (', percent_var_after[1], '%)', sep = ''), + paste('Axis 1 (', percent_var_after[1], '%)', sep = ''), paste('Axis 2 (', percent_var_after[2], '%)', sep = '')) after_phrase <- paste('After (PERMANOVA R2 = ', R2_after, '%)', sep = '') @@ -444,22 +444,22 @@ colnames(ord_after) <- c('PC1', 'PC2') ord_after$Study <- data_meta$study_name # Ordination Plot -p_before <- ggplot(ord_before, aes(x = PC1, y = PC2, color = Study)) + - geom_point(size = 4) + - theme_bw() + - xlab(before_labels[1]) + - ylab(before_labels[2]) + - ggtitle(before_phrase) + - theme(plot.title = element_text(hjust = 0.5)) + +p_before <- ggplot(ord_before, aes(x = PC1, y = PC2, color = Study)) + + geom_point(size = 4) + + theme_bw() + + xlab(before_labels[1]) + + ylab(before_labels[2]) + + ggtitle(before_phrase) + + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.position ="none") -p_after <- ggplot(ord_after, aes(x = PC1, y = PC2, color = Study)) + - geom_point(size = 4) + - theme_bw() + - xlab(after_labels[1]) + - ylab(after_labels[2]) + - ggtitle(after_phrase) + - theme(plot.title = element_text(hjust = 0.5)) + +p_after <- ggplot(ord_after, aes(x = PC1, y = PC2, color = Study)) + + geom_point(size = 4) + + theme_bw() + + xlab(after_labels[1]) + + ylab(after_labels[2]) + + ggtitle(after_phrase) + + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.position ="none") p <- p_before + p_after @@ -482,7 +482,7 @@ for (i in 1:length(transform)) { exposure = "resvar", data = data_meta, control = list(analysis_method = 'LM', transform = transform[i])) - + num[i] <- sum(fit_lm_meta$meta_fits$pval < 0.05, na.rm = TRUE) } @@ -535,7 +535,7 @@ meta_fits2 <- fit_lm_meta2$meta_fits # Create the figure meta_fits2 %>% filter(pval < 0.05) %>% - arrange(coef) %>% + arrange(coef) %>% mutate(feature = factor(feature, levels = feature)) %>% ggplot(aes(y = coef, x = feature)) + theme(axis.text.y = element_text(size = 10)) + diff --git a/inst/pages/msea.qmd b/inst/pages/msea.qmd index 11136c00c..75ccc6209 100644 --- a/inst/pages/msea.qmd +++ b/inst/pages/msea.qmd @@ -13,7 +13,7 @@ detect the modest but coordinated changes in pre-specified sets of related microbial features. Such a set might include all the microbes in a specific pathway or microbial genes that have been shown to be co-regulated based on previously published studies. Like GSEA, MSEA aggregates the per-feature -statistics across microbes within a microbe set. This corresponds to the +statistics across microbes within a microbe set. This corresponds to the hypothesis that many relevant phenotype differences are manifested by small but consistent changes in a set of features. @@ -54,9 +54,9 @@ se_relative <- sampleMetadata |> # Create sample metadata # ########################## -sample_metadata <- - colData(se_relative) %>% - as.data.frame() %>% filter(visit_number == 1) %>% +sample_metadata <- + colData(se_relative) %>% + as.data.frame() %>% filter(visit_number == 1) %>% .[, c("age", "disease", "antibiotics_current_use")] ################# @@ -95,17 +95,17 @@ study. library(Maaslin2) fit_data = Maaslin2( - input_data = feature_species, - input_metadata = sample_metadata, + input_data = feature_species, + input_metadata = sample_metadata, normalization = "NONE", - output = "output_species", + output = "output_species", fixed_effects = c("disease", "age", "antibiotics_current_use")) ``` Unlike gene expression studies, we do not have well-defined signatures or modules for microbiome data. Here, we will construct data-driven modules using -weighted gene co-expression network analysis (WGCNA) [@Langfelder2008], +weighted gene co-expression network analysis (WGCNA) [@Langfelder2008], [@Geistlinger2023]. We aim to ensure that the effect of disease and other covariates has been removed by working on the residuals. Following the WGCNA tutorial, our first step will be to check whether there are any outliers in our @@ -151,7 +151,7 @@ connectivity graphic for the selection of this power parameter. powers = c(c(1:20), seq(from = 22, to=30, by=2)) sft = pickSoftThreshold( datExpr, powerVector = powers, verbose = 5, dataIsExpr = TRUE, - RsquaredCut = 0.30) + RsquaredCut = 0.30) ``` @@ -162,16 +162,16 @@ soft threshold parameter selected above. power = sft$powerEstimate net = blockwiseModules( - datExpr, + datExpr, power = power, - corFnc="bicor", - corOptions=list(maxPOutliers=0.1), - networkType ="unsigned", + corFnc="bicor", + corOptions=list(maxPOutliers=0.1), + networkType ="unsigned", maxBlockSize = ncol(datExpr), minModuleSize = 3, - TOMType = "unsigned", - reassignThreshold = 0, - mergeCutHeight = 0, + TOMType = "unsigned", + reassignThreshold = 0, + mergeCutHeight = 0, verbose = 3) #################### @@ -241,26 +241,26 @@ library(gsEasy) run_MSEA <- function( microbeSet, # A list ranked_features, # Ranked list of featured - filter.count = 3, + filter.count = 3, seed = 1234, fdr.correction = 'BH') { - - + + ################### # Filter out sets # ################## - + microbeSet0 <- microbeSet cond <- sapply(microbeSet0, function(x) length(x) > filter.count) microbeSet <- microbeSet0[cond] lengthmicrobeSet <- as.data.frame( reshape2::melt(lapply(microbeSet, function(x) length(x)))) colnames(lengthmicrobeSet) <- c('Freq','Set') - + ################ # Classic MSEA # ################ - + set.seed(seed) enrichment <- as.data.frame( sapply(microbeSet, function(set) gset(S = set, r = ranked_features))) @@ -268,13 +268,13 @@ run_MSEA <- function( enrichment <- rownames_to_column(enrichment, 'Set') enrichment <- merge(enrichment, lengthmicrobeSet, 'Set') enrichment$qval <- p.adjust(enrichment$ES, fdr.correction) - + ########## # Return # ########## - + return(enrichment) - + } ``` @@ -322,14 +322,14 @@ We can plot the enrichment scores to visualize the MSEA results. ```{r visualzation} -p <- MSEA %>% - arrange(-pval) %>% - mutate(ID = factor(ID, levels = ID)) %>% +p <- MSEA %>% + arrange(-pval) %>% + mutate(ID = factor(ID, levels = ID)) %>% ggplot(aes(y = -log10(pval), x = ID)) + - geom_bar(stat = "identity", fill = 'cornflowerblue') + theme_bw() + - coord_flip() + - ggtitle('Statistically significant modules associated with disease') + - xlab('') + + geom_bar(stat = "identity", fill = 'cornflowerblue') + theme_bw() + + coord_flip() + + ggtitle('Statistically significant modules associated with disease') + + xlab('') + ylab('MSEA enrichment score') print(p) @@ -356,7 +356,7 @@ project and follow the same steps as before. ```{r input-data} ########################## -# Load HMP2 pathway data # +# Load HMP2 pathway data # ########################## se_pathway <- sampleMetadata |> @@ -367,8 +367,8 @@ se_pathway <- sampleMetadata |> # Create sample metadata # ########################## -sample_metadata <- colData(se_pathway) %>% - as.data.frame() %>% filter(visit_number == 1) %>% +sample_metadata <- colData(se_pathway) %>% + as.data.frame() %>% filter(visit_number == 1) %>% dplyr::select("age", "disease", "antibiotics_current_use") # Set reference @@ -381,10 +381,10 @@ sample_metadata$disease <- relevel(sample_metadata$disease, 'healthy') feature_pwys_t <- as.data.frame(assay(se_pathway)) feature_pwys_t <- rownames_to_column(feature_pwys_t, "ID") -feature_pwys_t <- feature_pwys_t %>% - filter(!grepl("\\|", ID)) %>% - filter(!ID %in% c('UNMAPPED', 'UNINTEGRATED')) %>% - column_to_rownames('ID') %>% +feature_pwys_t <- feature_pwys_t %>% + filter(!grepl("\\|", ID)) %>% + filter(!ID %in% c('UNMAPPED', 'UNINTEGRATED')) %>% + column_to_rownames('ID') %>% as.data.frame() ############################## @@ -404,10 +404,10 @@ construct the modules using residuals from the `MaAsLin2` models. ```{r MSEA-preparation} fit_data = Maaslin2( - input_data = feature_pwys, - input_metadata = sample_metadata, + input_data = feature_pwys, + input_metadata = sample_metadata, normalization = "NONE", - output = "output_pwys", + output = "output_pwys", fixed_effects = c("disease", "age", "antibiotics_current_use")) ########################## @@ -453,16 +453,16 @@ sft = pickSoftThreshold( power = sft$powerEstimate net = blockwiseModules( - datExpr, + datExpr, power = power, - corFnc = "bicor", - corOptions = list(maxPOutliers = 0.1), - networkType ="unsigned", + corFnc = "bicor", + corOptions = list(maxPOutliers = 0.1), + networkType ="unsigned", maxBlockSize = ncol(datExpr), minModuleSize = 3, - TOMType = "unsigned", - reassignThreshold = 0, - mergeCutHeight = 0, + TOMType = "unsigned", + reassignThreshold = 0, + mergeCutHeight = 0, verbose = 3) #################### @@ -538,14 +538,14 @@ MSEA$ID <- paste(MSEA$ID, ' (', MSEA$Size, ')', sep = '') # Plot # ######## -p <- MSEA %>% - arrange(-pval) %>% - mutate(ID = factor(ID, levels = ID)) %>% +p <- MSEA %>% + arrange(-pval) %>% + mutate(ID = factor(ID, levels = ID)) %>% ggplot(aes(y = -log10(pval), x = ID)) + - geom_bar(stat = "identity", fill = 'cornflowerblue') + theme_bw() + - coord_flip() + - ggtitle('Statistically significant modules associated with disease') + - xlab('') + + geom_bar(stat = "identity", fill = 'cornflowerblue') + theme_bw() + + coord_flip() + + ggtitle('Statistically significant modules associated with disease') + + xlab('') + ylab('MSEA enrichment score') print(p) diff --git a/inst/pages/multiassay_ordination.qmd b/inst/pages/multiassay_ordination.qmd index cd0f231b6..ba24d460d 100644 --- a/inst/pages/multiassay_ordination.qmd +++ b/inst/pages/multiassay_ordination.qmd @@ -44,9 +44,9 @@ library(MOFA2) # reticulate package is needed library(reticulate) # Let us assume that these have been installed already. -#reticulate::install_miniconda(force = TRUE) -#reticulate::use_miniconda(condaenv = "env1", required = FALSE) -#reticulate::py_install(packages = c("mofapy2"), pip = TRUE, python_version=3.6) +reticulate::install_miniconda(force = TRUE) +reticulate::use_miniconda(condaenv = "env1", required = FALSE) +reticulate::py_install(packages = c("mofapy2"), pip = TRUE, python_version=3.6) ``` The `mae` object could be used straight to create the MOFA model. Yet, @@ -66,10 +66,10 @@ remove them. #| label: mofa3 library(MOFA2) -# For simplicity, classify all high-fat diets as high-fat, and all the low-fat +# For simplicity, classify all high-fat diets as high-fat, and all the low-fat # diets as low-fat diets colData(mae)$Diet <- ifelse( - colData(mae)$Diet == "High-fat" | colData(mae)$Diet == "High-fat + XOS", + colData(mae)$Diet == "High-fat" | colData(mae)$Diet == "High-fat + XOS", "High-fat", "Low-fat") # Agglomerate microbiome data @@ -96,7 +96,7 @@ assays(mae[[3]]) <- assays(mae[[3]])["standardize"] # Building our mofa model model <- create_mofa_from_MultiAssayExperiment( mae, - groups = "Diet", + groups = "Diet", extract_metadata = TRUE) model ``` diff --git a/inst/pages/network_comparison.qmd b/inst/pages/network_comparison.qmd index 9de85e5f9..9b52ed607 100644 --- a/inst/pages/network_comparison.qmd +++ b/inst/pages/network_comparison.qmd @@ -6,8 +6,8 @@ chapterPreamble() ``` ```{r, echo=FALSE} -# The networks shown in this chapter take several minutes to generate. -# Therefore, the network objects have been stored in a folder and are loaded +# The networks shown in this chapter take several minutes to generate. +# Therefore, the network objects have been stored in a folder and are loaded # here to reduce the time needed to knit the book. networks_diet_file <- system.file( "extdata", "networks_diet.RData", package = "OMA") @@ -26,10 +26,10 @@ networks side by side, or quantitatively using differential network analysis tools. Two approaches for comparing networks between two conditions are considered -in this chapter: +in this chapter: - **Differential network analysis**, which analyzes differences in network -metrics and network structure. +metrics and network structure. - **Differential association analysis**, which focuses on differences in the strength of individual associations. @@ -40,7 +40,7 @@ Here we use [NetCoMi](https://github.com/stefpeschel/NetCoMi) [@peschel2021netcomi] for network comparison, which includes several **differential network analysis approaches** as well as functionality for *differential association analysis**, i.e., generating a differential network. -How to install NetCoMi from GitHub is explained in [@sec-network-learning]. +How to install NetCoMi from GitHub is explained in [@sec-network-learning]. The **PeerJ data set** [@potbhare2022skin] containing skin microbial profiles for 58 subjects is again used in this chapter. The dataset also includes @@ -64,14 +64,14 @@ tse <- peerj13075 ```{r preprocessing} # Agglomerate to genus level -tse <- agglomerateByRank(tse, rank = "genus") +tse <- agglomerateByRank(tse, rank = "genus") # Add relative abundances tse <- transformAssay( - tse, - assay.type = "counts", + tse, + assay.type = "counts", method = "relabundance", - MARGIN = "cols") + MARGIN = "cols") # Filter by prevalence tse <- subsetByPrevalent( @@ -119,7 +119,7 @@ spring_net_diet <- netConstruct( rep.num = 10, thresh = 0.05, Rmethod = "approx"), - sparsMethod = "none", + sparsMethod = "none", dissFunc = "signed", verbose = 3, seed = 13075) @@ -132,7 +132,7 @@ different from zero. ```{r netAnalyze_diet, fig.width=10, fig.height=10} spring_netprops_diet <- netAnalyze( - spring_net_diet, + spring_net_diet, clustMethod = "cluster_fast_greedy", hubPar = "eigenvector", normDeg = FALSE) @@ -167,12 +167,12 @@ plot(spring_netprops_diet, labelScale = FALSE, nodeSize = "eigenvector", nodeSizeSpread = 2, - nodeColor = "cluster", + nodeColor = "cluster", sameColThresh = 2, hubBorderCol = "darkgray", cexNodes = 2, edgeTranspHigh = 20, - title1 = "Mixed diet", + title1 = "Mixed diet", title2 = "Vegetarian", showTitle = TRUE, cexTitle = 2, @@ -183,7 +183,7 @@ par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), mar=c(0, 0, 0, 0), new=TRUE) plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n') legend(-0.2, -0.9, cex = 1.5, title = "estimated correlation:", - legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), + legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), bty = "n", horiz = TRUE) ``` @@ -201,12 +201,12 @@ plot(spring_netprops_diet, labelScale = FALSE, nodeSize = "eigenvector", nodeSizeSpread = 2, - nodeColor = "cluster", + nodeColor = "cluster", sameColThresh = 2, hubBorderCol = "darkgray", cexNodes = 2, edgeTranspHigh = 20, - title1 = "Mixed diet", + title1 = "Mixed diet", title2 = "Vegetarian", showTitle = TRUE, cexTitle = 2, @@ -216,7 +216,7 @@ plot(spring_netprops_diet, par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), mar=c(0, 0, 0, 0), new=TRUE) plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n') legend(-0.2, -0.8, cex = 1.7, title = "estimated correlation:", - legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), + legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), bty = "n", horiz = TRUE) ``` @@ -254,15 +254,15 @@ p_diet <- plot( rmSingles = "inboth", labelScale = FALSE, nodeSize = "clr", - nodeColor = "feature", - featVecCol = phyla, + nodeColor = "feature", + featVecCol = phyla, colorVec = colvec, nodeTransp = 20, sameColThresh = 2, highlightHubs = FALSE, cexNodes = 2, edgeTranspHigh = 20, - title1 = "Mixed diet", + title1 = "Mixed diet", title2 = "Vegetarian", showTitle = TRUE, cexTitle = 2, @@ -275,10 +275,10 @@ col_transp <- colToTransp(colvec, 20) par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), mar=c(0, 0, 0, 0), new=TRUE) plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n') legend(-0.15, -0.8, cex = 1.7, title = "estimated correlation:", - legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), + legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), bty = "n", horiz = TRUE) -legend(-0.15, 1.3, cex = 1.7, pt.cex = 2.5, title = "Phylum:", - legend=levels(phyla), col = col_transp, bty = "n", pch = 16) +legend(-0.15, 1.3, cex = 1.7, pt.cex = 2.5, title = "Phylum:", + legend=levels(phyla), col = col_transp, bty = "n", pch = 16) ``` ### Quantitative comparison @@ -300,7 +300,7 @@ before applying it to your data. ```{r netcomp_diet_permute, eval=FALSE} spring_netcomp_diet <- netCompare( - spring_netprops_diet, + spring_netprops_diet, permTest = TRUE, nPerm = 1000, cores = 6, @@ -311,7 +311,7 @@ spring_netcomp_diet <- netCompare( ``` ```{r summary_netcomp_diet} -summary(spring_netcomp_diet, +summary(spring_netcomp_diet, groupNames = c("Mixed diet", "Vegetarian"), numbNodes = 5) ``` @@ -382,12 +382,12 @@ spring_diffnet_unadj <- diffnet( adjust = "none") ``` -The `diffnet` object it now plotted using NetCoMi's plot function. +The `diffnet` object it now plotted using NetCoMi's plot function. ```{r diffnet_plot, fig.width=20, fig.height=14} -plot(spring_diffnet_unadj, - cexLabels = 2, - cexNodes = 0.7, +plot(spring_diffnet_unadj, + cexLabels = 1, + cexNodes = 1.3, cexLegend = 2.5, cexTitle = 3, mar = c(3,2,5,15), @@ -403,7 +403,7 @@ uncorrelated in the vegetarian group (such as Serratia and Citrobacter), the edge color is dark green. ```{r, eval=FALSE, echo=FALSE} -save(spring_diffnet, spring_net_diet, spring_netcomp_diet, +save(spring_diffnet, spring_net_diet, spring_netcomp_diet, file = "general/network_data/networks_diet.RData") ``` @@ -430,7 +430,7 @@ analysis methods, but focus only on changes in edge sets. that incorporate multiple local and global network measures. @jardim2019bionetstat present a tool "BioNetStat" for differential analysis of biological networks, which is able to compare certain network measures -between groups. +between groups. The [NetCoMi](https://github.com/stefpeschel/NetCoMi) package used for network comparison in this chapter includes the following @@ -450,4 +450,4 @@ Two methods (Fisher's z-test [@fisher1970statistical] and the Discordant method [@siska2016discordant]) are available for identifying differential correlations, and **permutation tests** for the more general case of identifying **differential associations**. See [@peschel2021netcomi] for -details. NetCoMi offers also a function for plotting a differential network. +details. NetCoMi offers also a function for plotting a differential network. diff --git a/inst/pages/network_learning.qmd b/inst/pages/network_learning.qmd index a5391fb29..cb00dc247 100644 --- a/inst/pages/network_learning.qmd +++ b/inst/pages/network_learning.qmd @@ -74,7 +74,7 @@ The following list gives a selection of compositionality aware approaches: + **Proportionality measures (proportionality aware by definition):** - [propr](https://github.com/tpq/propr) - [Shrinkage proportionality estimator](https://github.com/MichelleBadri/NormCorr-manuscript/blob/master/code/helpers/norm_functions.R); applied in [@sec-shrinkage-prop] - + 4. **Sparsification:** Transforming the estimated associations directly into adjacencies would lead to a dense network where all nodes are connected and only weighted network measures are meaningful. Therefore, the association @@ -132,8 +132,8 @@ dim(tse) ``` ```{r, echo=FALSE} -# The networks shown in this chapter take several minutes to generate. -# Therefore, the network objects have been stored in a folder and are loaded +# The networks shown in this chapter take several minutes to generate. +# Therefore, the network objects have been stored in a folder and are loaded # here to reduce the time needed to knit the book. spring_networks_file <- system.file( "extdata", "spring_networks.RData", package = "OMA") @@ -176,14 +176,14 @@ steps: ```{r} # Agglomerate to genus level -tse <- agglomerateByRank(tse, rank = "genus") +tse <- agglomerateByRank(tse, rank = "genus") # Add relative abundances tse <- transformAssay( - tse, - assay.type = "counts", + tse, + assay.type = "counts", method = "relabundance", - MARGIN = "cols") + MARGIN = "cols") # Filter by prevalence tse <- subsetByPrevalent( @@ -206,7 +206,7 @@ dim(tse) As explained in [@sec-network-learning-workflow], we use [SPRING](https://github.com/GraceYoon/SPRING) ("Semi-Parametric Rank-based approach for INference in Graphical model") as association measure. We first -use the `SPRING` function directly to construct a conditional dependency graph. +use the `SPRING` function directly to construct a conditional dependency graph. Neither zero replacement nor normalization (steps 1 and 2 in our workflow) are required because SPRING uses a modified clr (mclr) transformation that @@ -236,8 +236,8 @@ library(SPRING) ```{r, eval=FALSE} set.seed(13075) spring_est <- SPRING( - t(assay(tse, "counts")), - Rmethod = "approx", + t(assay(tse, "counts")), + Rmethod = "approx", thresh = 0.05, lambdaseq = "data-specific") ``` @@ -283,24 +283,24 @@ transform_asso <- function(assoMat, thresh = NULL, dissTrans = "signed") { if (!is.null(thresh)) { assoMat[abs(assoMat) < thresh] <- 0 } - + # Compute dissimilarity matrix if (dissTrans == "signed") { dissMat <- sqrt(0.5 * (1 - assoMat)) } else { dissMat <- sqrt(1 - assoMat^2) } - + # Dissimilarity between nodes with zero correlation is set to 1 # (these nodes are unconnected and thus should have maximum dissimilarity) dissMat[assoMat == 0] <- 1 - + # Compute similarity matrix simMat <- 1 - dissMat - + # Turn into igraph object graphObj <- SpiecEasi::adj2igraph(simMat) - + return(list(graph = graphObj, adja = simMat, asso = assoMat, diss = dissMat)) } ``` @@ -324,7 +324,7 @@ preprocessing, association estimation, sparsification, and transformation. The returned `microNet` object can then be passed to `netAnalyze()` (the network analysis function) so that all necessary information is available for the network analysis workflow. - + ```{r, message=FALSE, warning=FALSE} library(NetCoMi) ``` @@ -344,7 +344,7 @@ netcomi_net <- netConstruct( filtTaxPar = list(numbSamp = 0.2), measure = "spring", measurePar = list(thresh = 0.05, Rmethod = "approx"), - sparsMethod = "none", + sparsMethod = "none", dissFunc = "signed", seed = 13075) ``` @@ -418,9 +418,9 @@ set.seed(13075) lay_fr <- layout_with_fr(spring_graph) par(mfrow = c(1,2)) -plot(spring_graph, layout = lay_fr, vertex.size = vsize, +plot(spring_graph, layout = lay_fr, vertex.size = vsize, vertex.label = NA, main = "SPRING network") -plot(netcomi_graph, layout = lay_fr, vertex.size = vsize, +plot(netcomi_graph, layout = lay_fr, vertex.size = vsize, vertex.label = NA, main = "NetCoMi network\n(with SPRING associations)") ``` @@ -480,7 +480,7 @@ head(vsize_df) ```{r plots_centrality, fig.height=10, fig.width=10} par(mfrow = c(2,2)) for (i in seq_along(centr_df)) { - plot(spring_graph, layout = lay_fr, vertex.size = vsize_df[, i], + plot(spring_graph, layout = lay_fr, vertex.size = vsize_df[, i], vertex.label = NA, main = colnames(centr_df)[i]) } ``` @@ -515,7 +515,7 @@ vsize_df <- get_vsizes(centr_df) ```{r plots_centrality_lcc, fig.height=10, fig.width=10} par(mfrow = c(2,2)) for (i in seq_along(centr_df)) { - plot(spring_graph, layout = lay_fr, vertex.size = vsize_df[, i], + plot(spring_graph, layout = lay_fr, vertex.size = vsize_df[, i], vertex.label = NA, main = colnames(centr_df)[i]) } ``` @@ -550,6 +550,7 @@ df <- data.frame( ggplot(data = df, aes(x = Degree, y = Fraction, group = 1)) + geom_line() + geom_point() + + labs(title = "Degree distribution") + theme_bw() ``` @@ -579,14 +580,14 @@ adja_sel <- spring_cor[sel, sel] ```{r clustered_heatmap, fig.width=5.8, fig.height=5} # Color vector col <- colorRamp2( - c(-1, -0.5, 0, 0.5, 1), + c(-1, -0.5, 0, 0.5, 1), c("royalblue4", "lightblue", "white", "orange", "firebrick3")) Heatmap( - adja_sel, - col = col, + adja_sel, + col = col, rect_gp = gpar(col = "gray", lwd = 1), - show_row_names = FALSE, + show_row_names = FALSE, show_column_names = FALSE, name = "Association") ``` @@ -644,7 +645,7 @@ resulting from Student's t-test in the lower triangle). See `?calcGCM` and ```{r netAnalyze_single, fig.width=7, fig.height=7, out.width='75%'} netcomi_netprops <- netAnalyze( - netcomi_net, + netcomi_net, clustMethod = "cluster_fast_greedy", hubPar = "eigenvector", normDeg = FALSE) @@ -707,17 +708,17 @@ plot(netcomi_netprops, labelScale = FALSE, nodeSize = "eigenvector", nodeSizeSpread = 3, - nodeColor = "cluster", + nodeColor = "cluster", hubBorderCol = "gray40", cexNodes = 1.8, edgeTranspHigh = 20, - title1 = "Network properties highlighted", + title1 = "Network properties highlighted", showTitle = TRUE, cexTitle = 2.3, mar = c(1, 3, 4, 8)) legend(0.7, 1.1, cex = 1.7, title = "estimated correlation:", - legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), + legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), bty = "n", horiz = TRUE) ``` @@ -747,28 +748,28 @@ plot(netcomi_netprops, shortenLabels = "intelligent", labelScale = FALSE, nodeSize = "mclr", - nodeColor = "feature", - featVecCol = phyla, + nodeColor = "feature", + featVecCol = phyla, colorVec = colvec, nodeTransp = 20, highlightHubs = FALSE, cexNodes = 1.8, edgeTranspHigh = 20, - title1 = "Data features highlighted", + title1 = "Data features highlighted", showTitle = TRUE, cexTitle = 2.3, mar = c(1, 10, 4, 6)) # Add legends legend(0.7, 1.1, cex = 1.7, title = "estimated correlation:", - legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), + legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"), bty = "n", horiz = TRUE) # Colors used in the legend should be equally transparent as in the plot col_transp <- colToTransp(colvec, 20) -legend(-1.8, 1.1, cex = 1.7, pt.cex = 2.5, title = "Phylum:", - legend=levels(phyla), col = col_transp, bty = "n", pch = 16) +legend(-1.8, 1.1, cex = 1.7, pt.cex = 2.5, title = "Phylum:", + legend=levels(phyla), col = col_transp, bty = "n", pch = 16) ``` **A few things to observe:** @@ -849,7 +850,7 @@ conditional dependence measures address the high dimensionality of microbiome da as an alternative measure of pairwise association for compositional data. The idea is that if the relative abundances between two taxa $i$ and $j$ are proportional, then their corresponding absolute abundances are also -proportional: +proportional: $\frac{\omega_i}{m} \propto \frac{\omega_j}{m} \Rightarrow \omega_i \propto \omega_j$, where $m$ is the sum of counts in the sample. It follows that proportionality is identical for the observed (relative) read counts and the true unobserved @@ -866,15 +867,15 @@ even with small sample sizes. ## Comparison of association measures In this section, we provide three additional examples for constructing a -network using each of the three types of association: +network using each of the three types of association: * Correlation using `SparCC` * Partial correlation using `SpiecEasi` * Proportionality using the shrinkage proportionality measure ```{r, echo=FALSE} -# The networks shown in this chapter take several minutes to generate. -# Therefore, the network objects have been stored in a folder and are loaded +# The networks shown in this chapter take several minutes to generate. +# Therefore, the network objects have been stored in a folder and are loaded # here to reduce the time needed to knit the book. networks_file <- system.file("extdata", "networks.RData", package = "OMA") load(networks_file) @@ -961,7 +962,7 @@ for Ecological Association and Statistical Inference") approach proposed by @Kurtz2015 to estimate a sparse conditional dependency graph. The neighborhood selection method ("MB") introduced by @meinshausen2006high is used for network learning. The approach is implemented in the R package -[SpiecEasi](https://github.com/zdk123/SpiecEasi). +[SpiecEasi](https://github.com/zdk123/SpiecEasi). ```{r, message=FALSE, warning=FALSE} library(SpiecEasi) @@ -970,8 +971,8 @@ library(SpiecEasi) ```{r, eval=FALSE} set.seed(13075) se_mb_est <- spiec.easi( - t(assay(tse, "counts")), - method = 'mb', nlambda = 20, + t(assay(tse, "counts")), + method = 'mb', nlambda = 20, pulsar.params = list(rep.num = 20)) ``` @@ -1015,17 +1016,17 @@ set.seed(13075) lay_fr <- layout_with_fr(se_mb_graph) par(mfrow = c(2,2)) -plot(sparcc_graph03, layout = lay_fr, vertex.size = vsize, +plot(sparcc_graph03, layout = lay_fr, vertex.size = vsize, vertex.label = NA, main = "SparCC (thresh 0.3)") -plot(sparcc_graph04, layout = lay_fr, vertex.size = vsize, +plot(sparcc_graph04, layout = lay_fr, vertex.size = vsize, vertex.label = NA, main = "SparCC (thresh 0.4)") -plot(prop_graph, layout = lay_fr, vertex.size = vsize, +plot(prop_graph, layout = lay_fr, vertex.size = vsize, vertex.label = NA, main = "Shrinkage proportionality\n(thresh 0.4)") -plot(se_mb_graph, layout = lay_fr, vertex.size = vsize, +plot(se_mb_graph, layout = lay_fr, vertex.size = vsize, vertex.label = NA, main = "SpiecEasi (MB)") ``` -**A few observations:** +**A few observations:** The density of SparCC (threshold 0.4), proportionality and SpiecEasi is comparable, while the SparCC correlation network with threshold 0.3 is much denser. However, there are edges in the proportionality and SpiecEasi networks @@ -1046,7 +1047,7 @@ graph objects we need for the analyses. ```{r graphlist} graphlist <- list( - SparCC = sparcc_graph04, + SparCC = sparcc_graph04, Proportionality = prop_graph, SpiecEasi = se_mb_graph, SPRING = spring_graph) @@ -1075,8 +1076,8 @@ for(i in seq_along(graphlist)) { # Data frame needed for ggplot2 df <- data.frame( - Degree = rep(seq_len(maxdeg), length(graphlist)), - Fraction = unlist(ddlist), + Degree = rep(seq_len(maxdeg), length(graphlist)), + Fraction = unlist(ddlist), Method = rep(names(graphlist), each = maxdeg)) ggplot(df, aes(x = Degree, y = Fraction, group = Method)) + @@ -1136,18 +1137,18 @@ for(i in seq_along(assolist)) { } else { showlegend <- FALSE } - + hm_list[[i]] <- Heatmap( - assolist[[i]], - col = col, + assolist[[i]], + col = col, rect_gp = gpar(col = "gray", lwd = 1), - show_row_names = FALSE, + show_row_names = FALSE, show_column_names = FALSE, - column_title = names(assolist)[i], + column_title = names(assolist)[i], name = "Association", show_heatmap_legend = showlegend) %>% - - draw() %>% + + draw() %>% grid.grabExpr() } diff --git a/inst/pages/quality_control.qmd b/inst/pages/quality_control.qmd index 229703dc3..f0e582c01 100644 --- a/inst/pages/quality_control.qmd +++ b/inst/pages/quality_control.qmd @@ -41,10 +41,10 @@ tse <- transformAssay(tse, MARGIN = "cols", method = "relabundance") # assay.type / assay.type / assay.type # depending on the mia package version plotAbundanceDensity( - tse, layout = "jitter", + tse, layout = "jitter", assay.type = "relabundance", - n = 40, point_size=1, point.shape=19, - point.alpha=0.1) + + n = 40, point_size=1, point.shape=19, + point.alpha=0.1) + scale_x_log10(label=scales::percent) ``` @@ -54,9 +54,9 @@ visualized as a density plot over a log-scaled axis, with ```{r, warning=FALSE, message=FALSE} plotAbundanceDensity( - tse, layout = "density", + tse, layout = "density", assay.type = "relabundance", - n = 5, colour.by="nationality", + n = 5, colour.by="nationality", point.alpha=1/10) + scale_x_log10() ``` @@ -73,7 +73,7 @@ which may be _conditionally abundant_ in a small number of samples. The population prevalence (frequency) at a 1% relative abundance threshold (`detection = 1/100` and `as.relative = TRUE`) can look -like this. +like this. ```{r exploration-prevalence} getPrevalence(tse, detection = 1/100, sort = TRUE, as.relative = TRUE) |> @@ -98,7 +98,7 @@ If the output is to be used for subsetting or storing the data in the ### Prevalence analysis -To investigate microbiome prevalence at a selected taxonomic level, two +To investigate microbiome prevalence at a selected taxonomic level, two approaches are available. First, the data can be agglomerated to the taxonomic level and then @@ -108,7 +108,7 @@ First, the data can be agglomerated to the taxonomic level and then # Agglomerate taxa abundances to Phylum level #and add the new table to the altExp slot altExp(tse,"Phylum") <- agglomerateByRank(tse, "Phylum") -# Check prevalence for the Phylum abundance table +# Check prevalence for the Phylum abundance table #from the altExp slot getPrevalence( altExp(tse,"Phylum"), detection = 1/100, sort = TRUE, @@ -149,7 +149,7 @@ prev ## Note The `detection` and `prevalence` thresholds are not the same, as -`detection` can be applied to relative counts or absolute counts depending on +`detection` can be applied to relative counts or absolute counts depending on whether `as.relative` is set `TRUE` or `FALSE` ::: @@ -170,14 +170,14 @@ are stored in the `altExp` slot. ```{r} rowData(altExp(tse,"Phylum"))$prevalence <- getPrevalence( - altExp(tse,"Phylum"), detection = 1/100, + altExp(tse,"Phylum"), detection = 1/100, sort = FALSE, assay.type = "counts", as.relative = TRUE) ``` The prevalences can then be plotted using the plotting functions from the `scater` package. - + ```{r, message=FALSE, warning=FALSE} library(scater) plotRowData(altExp(tse,"Phylum"), "prevalence", colour_by = "Phylum") @@ -191,9 +191,9 @@ tse <- agglomerateByRanks(tse) altExps(tse) <- lapply( altExps(tse), function(y){ rowData(y)$prevalence <- getPrevalence( - y, detection = 1/100, + y, detection = 1/100, sort = FALSE, - assay.type = "counts", + assay.type = "counts", as.relative = TRUE) return(y) }) @@ -211,7 +211,7 @@ top_phyla_mean <- getTop( x <- unsplitByRanks(tse, ranks = taxonomyRanks(tse)[1:6]) x <- addHierarchyTree(x) ``` - + After some preparation, the data are assembled and can be plotted with `plotRowTree`. @@ -241,22 +241,22 @@ microbiome data summaries. ```{r, message=FALSE} library(mia) data("GlobalPatterns", package="mia") -tse <- GlobalPatterns +tse <- GlobalPatterns ``` -### Top taxa +### Top taxa -The function `getTop` identifies top taxa in the data. +The function `getTop` identifies top taxa in the data. ```{r top-feature-taxo} # Pick the top taxa top_features <- getTop(tse, method="median", top=10) -# Check the information for these +# Check the information for these rowData(tse)[top_features, taxonomyRanks(tse)][1:5, 1:3] ``` -### Library size / read count +### Library size / read count The total counts per sample can be calculated using `perCellQCMetrics`/`addPerCellQC` from the `scater` package. The former one @@ -279,36 +279,36 @@ library(ggplot2) p1 <- ggplot(colData(tse)) + geom_histogram( aes(x = sum), color = "black", fill = "gray", bins = 30) + - labs(x = "Library size", y = "Frequency (n)") + + labs(x = "Library size", y = "Frequency (n)") + # scale_x_log10( - # breaks = scales::trans_breaks("log10", function(x) 10^x), + # breaks = scales::trans_breaks("log10", function(x) 10^x), # labels = scales::trans_format("log10", scales::math_format(10^.x))) + theme_bw() + # Removes the grid - theme(panel.grid.major = element_blank(), + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), # Adds y-axis - axis.line = element_line(colour = "black")) + axis.line = element_line(colour = "black")) library(dplyr) df <- as.data.frame(colData(tse)) %>% arrange(sum) %>% mutate(index = 1:n()) p2 <- ggplot(df, aes(y = index, x = sum/1e6)) + - geom_point() + + geom_point() + labs( - x = "Library size (million reads)", - y = "Sample index") + + x = "Library size (million reads)", + y = "Sample index") + theme_bw() + # Removes the grid - theme(panel.grid.major = element_blank(), + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), # Adds y-axis - axis.line = element_line(colour = "black")) + axis.line = element_line(colour = "black")) library(patchwork) p1 + p2 @@ -318,28 +318,28 @@ Library sizes and other variables from `colData` can be visualized using a specified function called `plotColData`. ```{r plot-viz-lib-size-2, fig.width=8, fig.height=4, fig.cap="Library sizes per sample."} -# Sort samples by read count, -# order the factor levels, +# Sort samples by read count, +# order the factor levels, # and store back to tse as DataFrame -# TODO: plotColData could include an option +# TODO: plotColData could include an option # for sorting samples based on colData variables colData(tse) <- as.data.frame(colData(tse)) %>% arrange(X.SampleID) %>% mutate(X.SampleID = factor(X.SampleID, levels=X.SampleID)) %>% DataFrame -plotColData(tse,"sum","X.SampleID", colour_by = "SampleType") + +plotColData(tse,"sum","X.SampleID", colour_by = "SampleType") + theme(axis.text.x = element_text(angle = 45, hjust=1)) + - labs(y = "Library size (N)", x = "Sample ID") + labs(y = "Library size (N)", x = "Sample ID") ``` ```{r plot-viz-lib-size-3, fig.width=8, fig.height=4, fig.cap="Library sizes per sample type."} -plotColData(tse,"sum","SampleType", colour_by = "SampleType") + +plotColData(tse,"sum","SampleType", colour_by = "SampleType") + theme(axis.text.x = element_text(angle = 45, hjust=1)) ``` In addition, data can be rarefied with [rarefyAssay](https://microbiome.github.io/mia/reference/rarefyAssay.html), -which normalizes the samples to an equal number of reads. +which normalizes the samples to an equal number of reads. This remains controversial, however, and strategies to mitigate the information loss in rarefaction have been proposed [@Schloss2024rarefaction1] [@Schloss2024rarefaction2]. @@ -352,9 +352,9 @@ Samples might be contaminated with exogenous sequences. The impact of each contaminant can be estimated based on its frequencies and concentrations across the samples. -The following +The following [decontam functions](https://microbiome.github.io/mia/reference/isContaminant.html) are based on [@davis2018simple] and support such functionality: -* `isContaminant`, `isNotContaminant` -* `addContaminantQC`, `addNotContaminantQC` +* `isContaminant`, `isNotContaminant` +* `addContaminantQC`, `addNotContaminantQC` diff --git a/inst/pages/resources.qmd b/inst/pages/resources.qmd index 735e8d142..30e573c7f 100644 --- a/inst/pages/resources.qmd +++ b/inst/pages/resources.qmd @@ -4,34 +4,34 @@ ### Data container documentation -* SingleCellExperiment [@R_SingleCellExperiment] +* SingleCellExperiment [@R_SingleCellExperiment] - + [Online tutorial](https://bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) - + [Project page](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) - + [Publication](https://doi.org/10.1038/s41592-019-0654-x) + + [Online tutorial](https://bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) + + [Project page](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) + + [Publication](https://doi.org/10.1038/s41592-019-0654-x) -* SummarizedExperiment [@R_SummarizedExperiment] +* SummarizedExperiment [@R_SummarizedExperiment] - + [Online tutorial](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html) - + [Project page](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) + + [Online tutorial](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html) + + [Project page](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) -* TreeSummarizedExperiment [@R_TreeSummarizedExperiment] +* TreeSummarizedExperiment [@R_TreeSummarizedExperiment] - + [Online tutorial](https://bioconductor.org/packages/release/bioc/vignettes/TreeSummarizedExperiment/inst/doc/Introduction_to_treeSummarizedExperiment.html) - + [Project page](https://www.bioconductor.org/packages/release/bioc/html/TreeSummarizedExperiment.html) - + [Publication](https://doi.org/10.12688/f1000research.26669.2) + + [Online tutorial](https://bioconductor.org/packages/release/bioc/vignettes/TreeSummarizedExperiment/inst/doc/Introduction_to_treeSummarizedExperiment.html) + + [Project page](https://www.bioconductor.org/packages/release/bioc/html/TreeSummarizedExperiment.html) + + [Publication](https://doi.org/10.12688/f1000research.26669.2) -* MultiAssayExperiment [@Ramos2017] +* MultiAssayExperiment [@Ramos2017] - + [Online tutorial](https://www.bioconductor.org/packages/release/bioc/vignettes/MultiAssayExperiment/inst/doc/MultiAssayExperiment.html) - + [Project page](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) - + [Publication](https://doi.org/10.1158/0008-5472.CAN-17-0344) + + [Online tutorial](https://www.bioconductor.org/packages/release/bioc/vignettes/MultiAssayExperiment/inst/doc/MultiAssayExperiment.html) + + [Project page](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) + + [Publication](https://doi.org/10.1158/0008-5472.CAN-17-0344) ### Other relevant containers -* [DataFrame](https://rdrr.io/bioc/S4Vectors/man/DataFrame-class.html) which behaves similarly to `data.frame`, yet efficient and fast when used with large datasets. -* [DNAString](https://rdrr.io/bioc/Biostrings/man/DNAString-class.html) along with `DNAStringSet`,`RNAString` and `RNAStringSet` efficient storage and handling of long biological sequences are offered within the Biostrings package [@R_Biostrings]. -* GenomicRanges ([@GenomicRanges2013]) offers an efficient representation and manipulation of genomic annotations and alignments, see e.g. `GRanges` and `GRangesList` at [An Introduction to the GenomicRangesPackage](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html). +* [DataFrame](https://rdrr.io/bioc/S4Vectors/man/DataFrame-class.html) which behaves similarly to `data.frame`, yet efficient and fast when used with large datasets. +* [DNAString](https://rdrr.io/bioc/Biostrings/man/DNAString-class.html) along with `DNAStringSet`,`RNAString` and `RNAStringSet` efficient storage and handling of long biological sequences are offered within the Biostrings package [@R_Biostrings]. +* GenomicRanges ([@GenomicRanges2013]) offers an efficient representation and manipulation of genomic annotations and alignments, see e.g. `GRanges` and `GRangesList` at [An Introduction to the GenomicRangesPackage](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html). [NGS Analysis Basics](http://girke.bioinformatics.ucr.edu/GEN242/tutorials/rsequences/rsequences/) provides a walk-through of the above-mentioned features with detailed examples. @@ -88,9 +88,9 @@ of the datasets increased. Further details on such results can be found in the The phyloseq container provides analogous methods to TreeSE. The following material can be used to familiarize with such alternative methods: - * [List of R tools for microbiome analysis](https://microsud.github.io/Tools-Microbiome-Analysis/) - * phyloseq [@McMurdie2013] - * [microbiome tutorial](http://microbiome.github.io/tutorials/) + * [List of R tools for microbiome analysis](https://microsud.github.io/Tools-Microbiome-Analysis/) + * phyloseq [@McMurdie2013] + * [microbiome tutorial](http://microbiome.github.io/tutorials/) * [microbiomeutilities](https://microsud.github.io/microbiomeutilities/) * [phyloseq/TreeSE cheatsheet](https://microbiome.github.io/OMA/docs/devel/pages/97_extra_materials.html#cheatsheet) * Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses [@Callahan2016]. @@ -104,10 +104,10 @@ for a kickstart to R programming. Further support resources are available through the Bioconductor project [@Huber2015]. - * [Base R and RStudio cheatsheets](https://geomoer.github.io/moer-base-r/cheatsheet.html) - * [Package-specific cheatsheets](https://www.rstudio.com/resources/cheatsheets/) - * [Visualization with ggplot2](https://ggplot2.tidyverse.org/) - * [R graphics cookbook](http://www.cookbook-r.com/Graphs/) + * [Base R and RStudio cheatsheets](https://geomoer.github.io/moer-base-r/cheatsheet.html) + * [Package-specific cheatsheets](https://www.rstudio.com/resources/cheatsheets/) + * [Visualization with ggplot2](https://ggplot2.tidyverse.org/) + * [R graphics cookbook](http://www.cookbook-r.com/Graphs/) ### Bioconductor Classes {#sec-bioc_intro} @@ -119,17 +119,17 @@ project [@Huber2015]. S4 class system has brought several useful features to the object-oriented programming paradigm within R, and it is constantly deployed in R/Bioconductor packages [@Huber2015]. - + | Online Document: -* Hervé Pagès, [A quick overview of the S4 class system](https://bioconductor.org/packages/release/bioc/vignettes/S4Vectors/inst/doc/S4QuickOverview.pdf). -* Laurent Gatto, [A practical tutorial on S4 programming](https://bioconductor.org/help/course-materials/2013/CSAMA2013/friday/afternoon/S4-tutorial.pdf) -* How S4 Methods Work [@Chambers2006] +* Hervé Pagès, [A quick overview of the S4 class system](https://bioconductor.org/packages/release/bioc/vignettes/S4Vectors/inst/doc/S4QuickOverview.pdf). +* Laurent Gatto, [A practical tutorial on S4 programming](https://bioconductor.org/help/course-materials/2013/CSAMA2013/friday/afternoon/S4-tutorial.pdf) +* How S4 Methods Work [@Chambers2006] | Books: -* John M. Chambers. Software for Data Analysis: Programming with R. Springer, New York, 2008. ISBN-13 978-0387759357 [@Chambers2008] -* I Robert Gentleman. R Programming for Bioinformatics. Chapman & Hall/CRC, New York, 2008. ISBN-13 978-1420063677 [@gentleman2008r] +* John M. Chambers. Software for Data Analysis: Programming with R. Springer, New York, 2008. ISBN-13 978-0387759357 [@Chambers2008] +* I Robert Gentleman. R Programming for Bioinformatics. Chapman & Hall/CRC, New York, 2008. ISBN-13 978-1420063677 [@gentleman2008r] ## Reproducible reporting with Quarto {#sec-quarto} @@ -159,21 +159,21 @@ to interactive reproducible reporting. Being able to use Quarto in R partly relies on your previous knowledge of Rmarkdown. The following resources can help you get familiar with Rmarkdown: - * [Online tutorial](https://www.markdowntutorial.com/) - * [Cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) - * [Documentation](https://rmarkdown.rstudio.com/lesson-1.html) - * [Dr. C Titus Brown's tutorial](https://rpubs.com/marschmi/RMarkdown) + * [Online tutorial](https://www.markdowntutorial.com/) + * [Cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) + * [Documentation](https://rmarkdown.rstudio.com/lesson-1.html) + * [Dr. C Titus Brown's tutorial](https://rpubs.com/marschmi/RMarkdown) -**Figure sources:** +**Figure sources:** **Original article** -- Huang R _et al_. (2021) TreeSummarizedExperiment: a S4 class +- Huang R _et al_. (2021) TreeSummarizedExperiment: a S4 class for data with hierarchical structure. F1000Research 9:1246. [@Huang2021] **Reference Sequence slot extension** -- Lahti L _et al_. (2020) [Upgrading the R/Bioconductor ecosystem for microbiome +- Lahti L _et al_. (2020) [Upgrading the R/Bioconductor ecosystem for microbiome research](https://doi.org/10.7490/ f1000research.1118447.1) F1000Research 9:1464 (slides). diff --git a/inst/pages/session_info.qmd b/inst/pages/session_info.qmd index 8479dfc5d..342946188 100644 --- a/inst/pages/session_info.qmd +++ b/inst/pages/session_info.qmd @@ -2,15 +2,15 @@ ## Docker image {#sec-docker-image} -A `Docker` image built from this repository is available here: +A `Docker` image built from this repository is available here: 👉 [ghcr.io/microbiome/oma](https://ghcr.io/microbiome/oma) 🐳 ::: {.callout-tip icon='true'} ## Get started now 🎉 -You can get access to all the packages used in this book in < 1 minute, -using this command in a terminal: +You can get access to all the packages used in this book in < 1 minute, +using this command in a terminal: ```{sh "docker", filename="bash"} #| eval: false @@ -21,7 +21,7 @@ docker run -it ghcr.io/microbiome/oma:devel R ## RStudio Server {-} -An RStudio Server instance can be initiated from the `Docker` image as follows: +An RStudio Server instance can be initiated from the `Docker` image as follows: ```{sh "rstudio", filename="bash"} #| eval: false @@ -32,7 +32,7 @@ docker run \ ghcr.io/microbiome/oma:devel ``` -The initiated RStudio Server instance will be available at +The initiated RStudio Server instance will be available at [https://localhost:8787](https://localhost:8787). ## Session info {#sec-session-info} @@ -44,7 +44,7 @@ The initiated RStudio Server instance will be available at ```{r "session info"} #| cache: false sessioninfo::session_info( - installed.packages()[,"Package"], + installed.packages()[,"Package"], include_base = TRUE ) ``` diff --git a/inst/pages/taxonomy.qmd b/inst/pages/taxonomy.qmd index 54c8903ed..010fd4830 100644 --- a/inst/pages/taxonomy.qmd +++ b/inst/pages/taxonomy.qmd @@ -19,7 +19,7 @@ and annotation data used. Therefore, the mia package expects a loose assembly of taxonomic information and assumes certain key aspects: -* Taxonomic information is given as character vectors or factors in the +* Taxonomic information is given as character vectors or factors in the `rowData` of a `SummarizedExperiment` object. * The columns containing the taxonomic information must be named `domain`, `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species` or with @@ -33,8 +33,8 @@ clusters of taxa that co-vary across samples. ## Assigning taxonomic information There are a number of methods to assign taxonomic information. We like to give -a short introduction about the methods available without ranking one over the -other. This has to be your choice based on the result for the individual +a short introduction about the methods available without ranking one over the +other. This has to be your choice based on the result for the individual dataset. ### DADA2 @@ -61,7 +61,7 @@ found on the [DECIPHER website](http://www2.decipher.codes/Classification.html). checkTaxonomy(tse) ``` -Since the `rowData` can contain other data, `taxonomyRanks` will return the +Since the `rowData` can contain other data, `taxonomyRanks` will return the columns `mia` assumes to contain the taxonomic information. ```{r} @@ -74,7 +74,7 @@ This can then be used to subset the `rowData` to columns needed. rowData(tse)[, taxonomyRanks(tse)] ``` -`taxonomyRankEmpty` checks for empty values in the given `rank` and returns a +`taxonomyRankEmpty` checks for empty values in the given `rank` and returns a logical vector of `length(x)`. ```{r} diff --git a/inst/pages/training.qmd b/inst/pages/training.qmd index fd85f2e00..702fdfb47 100644 --- a/inst/pages/training.qmd +++ b/inst/pages/training.qmd @@ -22,7 +22,7 @@ We recommend installing and setting up the relevant software packages on your own computer as this will support later use. The essential components to install include: -- [R (the latest official release)](https://www.r-project.org/) +- [R (the latest official release)](https://www.r-project.org/) - [RStudio](https://www.rstudio.com/products/rstudio/download/); choose "Rstudio Desktop" to download the latest version. Check the @@ -39,7 +39,7 @@ further examples from this tutorial, modifying and applying these techniques to your own data. Plain source code for the individual chapters of this book are available via [Github](https://github.com/microbiome/OMA/tree/master/R) - + - If you have access to CSC notebook you can find instructions from [here](https://microbiome.github.io/outreach/). @@ -49,8 +49,8 @@ We encourage you to familiarize yourself with the material and test examples in advance but this is optional: - [Introduction to data analysis with R and Bioconductor](https://carpentries-incubator.github.io/bioc-intro/) (for beginners with R) - -- [Short online videos](https://www.youtube.com/playlist?list=PLjiXAZO27elAJEptP59BN3whVJ61XIkST) on microbiome data science with R/Bioconductor + +- [Short online videos](https://www.youtube.com/playlist?list=PLjiXAZO27elAJEptP59BN3whVJ61XIkST) on microbiome data science with R/Bioconductor - [Quarto presentations](https://microbiome.github.io/outreach/index.html) diff --git a/inst/pages/transformation.qmd b/inst/pages/transformation.qmd index c56232980..bd0b83cd1 100644 --- a/inst/pages/transformation.qmd +++ b/inst/pages/transformation.qmd @@ -44,7 +44,7 @@ references. high-throughput assays (16S, metagenomic sequencing) is compositional by nature, even if the data is provided as counts [@Gloor2017]. - + * 'clr' Centered log ratio transformation [@Aitchison1986] is used to reduce data skewness and compositionality bias in relative abundances, while bringing the data to the logarithmic scale. This @@ -69,14 +69,14 @@ references. Karwowska2024]. * 'standardize' Standardize(or 'z-score') transformation scales data to zero - mean and unit variance. This is used to bring features (or samples) to more + mean and unit variance. This is used to bring features (or samples) to more comparable levels in terms of mean and scale of the values. This can enhance visualization and interpretation of the data * 'log', 'log2', 'log10' Logarithmic transformations, used e.g. to reduce data skewness. With compositional data, the `clr` (or `rclr`) transformation is often preferred. - + * 'hellinger' Hellinger transformation is equal to the square root of relative abundances. This ecological transformation can be useful if we are interested in changes in relative abundances. @@ -102,16 +102,16 @@ available in the function ::: {.callout-important} -Pseudocount is a small non-negative value added to the normalized data to avoid -taking the logarithm of zero. It's value can have a significant impact on the results when applying -a logarithm transformation to normalized data, as the logarithm transformation +Pseudocount is a small non-negative value added to the normalized data to avoid +taking the logarithm of zero. It's value can have a significant impact on the results when applying +a logarithm transformation to normalized data, as the logarithm transformation is a nonlinear operation that can fundamentally change the data distribution [@Costea2014]. -Pseudocount should be chosen consistently across all normalization methods being -compared, for example, by setting it to a value smaller than the minimum abundance -value before transformation. Some tools, like ancombc2, take into account the effect -of the pseudocount by performing sensitivity tests using multiple pseudocount +Pseudocount should be chosen consistently across all normalization methods being +compared, for example, by setting it to a value smaller than the minimum abundance +value before transformation. Some tools, like ancombc2, take into account the effect +of the pseudocount by performing sensitivity tests using multiple pseudocount values. See [@sec-differential-abundance]. ::: @@ -125,12 +125,12 @@ data("GlobalPatterns", package = "mia") tse <- GlobalPatterns # Transform "counts" assay to relative abundances ("relabundance"), with -# pseudocount 1 +# pseudocount 1 tse <- transformAssay( tse, assay.type = "counts", method = "relabundance", pseudocount = 1) # Transform relative abundance assay ("relabundance") to "clr", using -# pseudocount if necessary; name the resulting assay to "clr" +# pseudocount if necessary; name the resulting assay to "clr" tse <- transformAssay( x = tse, assay.type = "relabundance", method = "clr", pseudocount = TRUE, name = "clr") @@ -177,4 +177,3 @@ You can find more information on normalization from [OSCA book](https://bioconductor.org/books/3.18/OSCA.basic/normalization.html#normalization-transformation). ::: - diff --git a/inst/pages/visualization.qmd b/inst/pages/visualization.qmd index 25abe54f9..a7a057f9d 100644 --- a/inst/pages/visualization.qmd +++ b/inst/pages/visualization.qmd @@ -306,14 +306,14 @@ plots <- plotAbundance( order.col.by = "Clostridiales", features = "SampleType") -# Modify the legend of the first plot to be smaller +# Modify the legend of the first plot to be smaller plots[[1]] <- plots[[1]] + theme( legend.key.size = unit(0.3, 'cm'), legend.text = element_text(size = 6), legend.title = element_text(size = 8)) -# Modify the legend of the second plot to be smaller +# Modify the legend of the second plot to be smaller plots[[2]] <- plots[[2]] + theme( legend.key.height = unit(0.3, 'cm'), @@ -324,17 +324,17 @@ plots[[2]] <- plots[[2]] + # Load required packages library(ggpubr) -library(patchwork) +library(patchwork) # Combine legends legend <- wrap_plots( as_ggplot(get_legend(plots[[1]])), as_ggplot(get_legend(plots[[2]])), - ncol = 1) + ncol = 1) # Removes legends from the plots plots[[1]] <- plots[[1]] + theme(legend.position = "none") plots[[2]] <- plots[[2]] + - theme(legend.position = "none", axis.title.x=element_blank()) + theme(legend.position = "none", axis.title.x=element_blank()) # Combine plots plot <- wrap_plots(plots[[2]], plots[[1]], ncol = 1, heights = c(2, 10)) @@ -371,7 +371,7 @@ tse_phylum_subset <- transformAssay( assay.type = "counts", pseudocount = 1) # Does standardize-transformation tse_phylum_subset <- transformAssay( - tse_phylum_subset, assay.type = "clr", MARGIN = "rows", + tse_phylum_subset, assay.type = "clr", MARGIN = "rows", method = "standardize", name = "clr_z") # Get n most abundant taxa, and subsets the data by them @@ -397,7 +397,7 @@ taxa_hclust <- hclust(dist(mat), method = "complete") taxa_tree <- as.phylo(taxa_hclust) # Plot taxa tree -taxa_tree <- ggtree(taxa_tree) + +taxa_tree <- ggtree(taxa_tree) + theme(plot.margin=margin(0,0,0,0)) # removes margins # Get order of taxa in plot @@ -407,7 +407,7 @@ taxa_ordered <- get_taxa_name(taxa_tree) # taxa_tree ``` -Based on phylo tree, we decide to create three clusters. +Based on phylo tree, we decide to create three clusters. ```{r} #| label = "pheatmap3" @@ -419,7 +419,7 @@ taxa_clusters <- data.frame(clusters = taxa_clusters) taxa_clusters$clusters <- factor(taxa_clusters$clusters) # Order data so that it's same as in phylo tree -taxa_clusters <- taxa_clusters[taxa_ordered, , drop = FALSE] +taxa_clusters <- taxa_clusters[taxa_ordered, , drop = FALSE] # Prints taxa and their clusters taxa_clusters @@ -449,7 +449,7 @@ sample_hclust <- hclust(dist(t(mat)), method = "complete") sample_tree <- as.phylo(sample_hclust) # Plot sample tree -sample_tree <- ggtree(sample_tree) + layout_dendrogram() + +sample_tree <- ggtree(sample_tree) + layout_dendrogram() + theme(plot.margin=margin(0,0,0,0)) # removes margins # Get order of samples in plot @@ -465,9 +465,9 @@ sample_clusters <- factor(cutree(tree = sample_hclust, k = 3)) sample_data <- data.frame(clusters = sample_clusters) # Order data so that it's same as in phylo tree -sample_data <- sample_data[samples_ordered, , drop = FALSE] +sample_data <- sample_data[samples_ordered, , drop = FALSE] -# Order data based on +# Order data based on tse_phylum_subset <- tse_phylum_subset[ , rownames(sample_data)] # Add sample type data @@ -482,12 +482,12 @@ Now we can create heatmap with additional annotations. #| label = "pheatmap6" # Determines the scaling of colors # Scale colors -breaks <- seq(-ceiling(max(abs(mat))), ceiling(max(abs(mat))), +breaks <- seq(-ceiling(max(abs(mat))), ceiling(max(abs(mat))), length.out = ifelse( max(abs(mat))>5, 2*ceiling(max(abs(mat))), 10 ) ) colors <- colorRampPalette(c("darkblue", "blue", "white", "red", "darkred"))(length(breaks)-1) pheatmap( - mat, annotation_row = taxa_clusters, + mat, annotation_row = taxa_clusters, annotation_col = sample_data, breaks = breaks, color = colors) @@ -505,11 +505,11 @@ metadata(tse_phylum_subset)$anno_colors$SampleType <- c( # Create a plot sechm( - tse_phylum_subset, - features = rownames(tse_phylum_subset), - assayName = "clr", - do.scale = TRUE, - top_annotation = c("SampleType"), + tse_phylum_subset, + features = rownames(tse_phylum_subset), + assayName = "clr", + do.scale = TRUE, + top_annotation = c("SampleType"), gaps_at = "SampleType", cluster_cols = TRUE, cluster_rows = TRUE) ``` @@ -525,7 +525,7 @@ taxa_clusters$Feature <- rownames(taxa_clusters) taxa_clusters$Feature <- factor(taxa_clusters$Feature, levels = taxa_clusters$Feature) # Create annotation plot -row_annotation <- ggplot(taxa_clusters) + +row_annotation <- ggplot(taxa_clusters) + geom_tile(aes(x = NA, y = Feature, fill = clusters)) + coord_equal(ratio = 1) + theme( @@ -584,7 +584,7 @@ limits <- c(-maxval, maxval) breaks <- seq(from = min(limits), to = max(limits), by = 0.5) colours <- c("darkblue", "blue", "white", "red", "darkred") -heatmap <- ggplot(melted_mat) + +heatmap <- ggplot(melted_mat) + geom_tile(aes(x = Sample, y = Taxa, fill = clr_z)) + theme( axis.title.y = element_blank(), @@ -595,10 +595,10 @@ heatmap <- ggplot(melted_mat) + legend.key.height = unit(1, 'cm') ) + scale_fill_gradientn( - name = "CLR + Z transform", - breaks = breaks, - limits = limits, - colours = colours) + + name = "CLR + Z transform", + breaks = breaks, + limits = limits, + colours = colours) + scale_y_discrete(position = "right") heatmap @@ -629,16 +629,16 @@ plot <- row_annotation + plot_layout( design = design, guides = "collect", # Specify layout, collect legends - + # Adjust widths and heights to align plots. # When annotation plot is larger, it might not fit into # its column/row. # Then you need to make column/row larger. - + # Relative widths and heights of each column and row: # Currently, the width of the first column is 15 % and the height of # first two rows are 30 % the size of others - + # To get this work most of the times, you can adjust all sizes to be # 1, i.e. equal, but then the gaps between plots are larger. widths = c(0.15, 1, 1), @@ -666,9 +666,9 @@ design <- c( # plot(design) # Combine plots -plot <- taxa_tree + +plot <- taxa_tree + row_annotation + - sample_tree + + sample_tree + sample_clusters_annotation + sample_types_annotation + heatmap + @@ -723,7 +723,7 @@ tse_Genus <- runMDS( method = "bray", ncomponents = 3, # calculates three principal coordinates assay.type = "relative_abundance", # calculates Bray-Curtis from relative abundace - name = "MDS_bray") #name of the PCoA within the tse + name = "MDS_bray") #name of the PCoA within the tse e <- attr(reducedDim(tse_Genus, "MDS_bray"), "eig") rel_eig <- e / sum(e[e > 0]) diff --git a/inst/pages/wrangling.qmd b/inst/pages/wrangling.qmd index 1f5fdd560..d92f27878 100644 --- a/inst/pages/wrangling.qmd +++ b/inst/pages/wrangling.qmd @@ -13,7 +13,7 @@ SummarizedExperiment objects. See [@sec-containers] for basics of `TreeSE`. ## Subsetting {#sec-treese_subsetting} **Subsetting** data helps to draw the focus of an analysis on particular -sets of samples and / or features. When dealing with large datasets, +sets of samples and / or features. When dealing with large datasets, the subset of interest can be extracted and investigated separately. This might improve performance and reduce the computational load. @@ -42,7 +42,7 @@ the number of rows to decrease. data("GlobalPatterns", package="mia") tse <- GlobalPatterns # Show dimensions (features x samples) -dim(tse) +dim(tse) ``` ### Subset by sample (column-wise) @@ -52,7 +52,7 @@ samples of human origin (feces, skin or tongue), stored as `SampleType` within `colData(tse)` as well as in `tse`. First, we would like to see all the possible values that `SampleType` can have -and how frequent those are: +and how frequent those are: ```{r} # Inspect possible values for SampleType @@ -67,7 +67,7 @@ tse$SampleType %>% table() ```{r echo = FALSE} # Show the frequency of each value tse$SampleType %>% table() %>% kable() %>% - kableExtra::kable_styling("striped", latex_options="scale_down") %>% + kableExtra::kable_styling("striped", latex_options="scale_down") %>% kableExtra::scroll_box(width = "100%") ``` @@ -116,7 +116,7 @@ possible need for agglomeration. As previously, we would first like to see all the possible values that `Phylum` can have and how frequent those are: - + ```{r} # Inspect possible values for phylum # and show the first five values @@ -126,9 +126,9 @@ cat(paste(unique(rowData(tse)$Phylum)[1:5], collapse = "\n")) ```{r} # Show the frequency of each value -rowData(tse)$Phylum %>% table() %>% head() %>% +rowData(tse)$Phylum %>% table() %>% head() %>% kable() %>% - kableExtra::kable_styling("striped", latex_options="scale_down") %>% + kableExtra::kable_styling("striped", latex_options="scale_down") %>% kableExtra::scroll_box(width = "100%") ``` @@ -184,7 +184,7 @@ features of phyla Actinobacteria or Chlamydiae. # Subset by sample and feature and remove NAs tse_sub <- tse[ rowData(tse)$Phylum %in% c("Actinobacteria", "Chlamydiae") & - !is.na(rowData(tse)$Phylum), + !is.na(rowData(tse)$Phylum), tse$SampleType %in% c("Feces", "Skin", "Tongue")] # Show dimensions @@ -211,12 +211,12 @@ the samples. This can occur, for example, after data subsetting. In certain analyses, we might want to remove those instances. ```{r} -# Agglomerate data at Genus level +# Agglomerate data at Genus level tse_genus <- agglomerateByRank(tse, rank = "Genus", na.rm = FALSE) # List bacteria that we want to include genera <- c( - "Class:Thermoprotei", - "Genus:Sulfolobus", + "Class:Thermoprotei", + "Genus:Sulfolobus", "Genus:Sediminicola") # Subset data tse_sub <- tse_genus[genera, ] @@ -255,7 +255,7 @@ sum(rowSums(assay(tse_sub, "counts")) == 0) ``` We can see that there are bacteria that are not present in the samples we chose. -We can remove those bacteria from the data. +We can remove those bacteria from the data. ```{r} # Take only those bacteria that are present @@ -267,10 +267,10 @@ tse_sub ## Splitting -You can split the data based on variables by using the functions +You can split the data based on variables by using the functions `agglomerateByRanks` and `splitOn`. -`agglomerateByRanks` splits the data based on taxonomic ranks. Since the +`agglomerateByRanks` splits the data based on taxonomic ranks. Since the elements of the output list share columns, they can be stored into `altExp` by setting `as.list = FALSE`. @@ -278,7 +278,7 @@ setting `as.list = FALSE`. agglomerateByRanks(tse, as.list = TRUE) ``` -If you want to split the data based on another variable than taxonomic rank, use +If you want to split the data based on another variable than taxonomic rank, use `splitOn`. It works for row-wise and column-wise splitting. ```{r splitOn} @@ -315,8 +315,8 @@ head(tse$NewVariable) ## Melting data For several custom analysis and visualization packages, such as those from -`tidyverse`, the `SE` data can be converted to a long data.frame format with -`meltSE`. +`tidyverse`, the `SE` data can be converted to a long data.frame format with +`meltSE`. ```{r} library(mia) @@ -337,9 +337,9 @@ molten_tse[, 1:5] `mia` package has `mergeSEs` function that merges multiple `SummarizedExperiment` objects. For example, it is possible to combine -multiple `TreeSE` objects which each includes one sample. +multiple `TreeSE` objects which each includes one sample. -`mergeSEs` works much like standard joining operations. It combines rows and +`mergeSEs` works much like standard joining operations. It combines rows and columns and allows you to specify the merging method. ```{r merge1}