Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix formatting and remove trailing spaces #1289

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 31 additions & 31 deletions vignettes/phyloseq-analysis.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ theme_set(theme_bw())
See the [microbio_me_qiime tutorial](http://joey711.github.io/phyloseq/download-microbio.me.html) for more details and examples downloading and importing into phyloseq/R directly from this public database.

## Included Data
To facilitate testing and exploration of tools in phyloseq, this package includes example data from published studies. Many of the examples in this vignette use either the [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) or `enterotype` datasets as source data. The [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) data was described in a [2011 article in PNAS](http://www.pnas.org/content/early/2010/06/02/1000080107)([Caporaso 2011](http://www.pnas.org/content/early/2010/06/02/1000080107)), and compares the microbial communities of 25 environmental samples and three known "mock communities" --- a total of 9 sample types --- at a depth averaging 3.1 million reads per sample. The [human enterotype dataset](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html) was described in a [2011 article in Nature](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html) ([Arumugam 2011](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html)), which compares the faecal microbial communities from 22 subjects using complete shotgun DNA sequencing. The authors further compare these microbial communities with the faecal communities of subjects from other studies, for a total of 280 faecal samples / subjects, and 553 genera. Sourcing data from different studies invariable leads to gaps in the data for certain variables, and this is easily handled by `R's core `NA features.
To facilitate testing and exploration of tools in phyloseq, this package includes example data from published studies. Many of the examples in this vignette use either the [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) or `enterotype` datasets as source data. The [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) data was described in a [2011 article in PNAS](http://www.pnas.org/content/early/2010/06/02/1000080107)([Caporaso 2011](http://www.pnas.org/content/early/2010/06/02/1000080107)), and compares the microbial communities of 25 environmental samples and three known "mock communities" --- a total of 9 sample types --- at a depth averaging 3.1 million reads per sample. The [human enterotype dataset](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html) was described in a [2011 article in Nature](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html) ([Arumugam 2011](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html)), which compares the faecal microbial communities from 22 subjects using complete shotgun DNA sequencing. The authors further compare these microbial communities with the faecal communities of subjects from other studies, for a total of 280 faecal samples / subjects, and 553 genera. Sourcing data from different studies invariable leads to gaps in the data for certain variables, and this is easily handled by R's core `NA` features.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub's diff here is difficult to see. The change here is in the last sentence in the last four to five words.


Because this data is included in the package, the examples can easily be run on your own computer using the code shown in this vignette. The data is loaded into memory using the `data` command. Let's start by loading the [Global Patterns](http://www.pnas.org/content/early/2010/06/02/1000080107) data.

Expand Down Expand Up @@ -119,7 +119,7 @@ Alpha diversity estimators for samples in the *Global Patterns* dataset. Each pa

For further details, see the [plot_tree tutorial](http://joey711.github.io/phyloseq/plot_tree-examples.html)

phyloseq also contains a method for easily plotting an annotated phylogenetic tree with information regarding the sample in which a particular taxa was observed, and optionally the number of individuals that were observed.
phyloseq also contains a method for easily plotting an annotated phylogenetic tree with information regarding the sample in which a particular taxa was observed, and optionally the number of individuals that were observed.

For the sake of creating a readable tree, let's subset the data to just the [Chlamydiae](http://en.wikipedia.org/wiki/Chlamydiae) phylum, which consists of obligate intracellular pathogens and is present in only a subset of environments in this dataset.

Expand All @@ -139,8 +139,8 @@ Phylogenetic tree representation of the Chlamydiae species in the microbiome sam

For further details, see the [plot_bar tutorial](http://joey711.github.io/phyloseq/plot_bar-examples.html)

In the following example we use the included "enterotype" dataset ([Arumugam 2011](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html)).
In the following example we use the included "enterotype" dataset ([Arumugam 2011](http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html)).

```{r}
data(enterotype)
```
Expand All @@ -151,20 +151,20 @@ par(mar = c(10, 4, 4, 2) + 0.1) # make more room on bottom margin
N <- 30
barplot(sort(taxa_sums(enterotype), TRUE)[1:N]/nsamples(enterotype), las=2)
```
An example exploratory barplot using base `R graphics and the `taxa_sums and `nsamples functions.
An example exploratory barplot using base R graphics and the `taxa_sums` and `nsamples` functions.

Note that this first barplot is clipped at the `r N`th OTU. This was chosen because `ntaxa(enterotype) =``r ntaxa(enterotype)` OTUs would not be legible on the plot. As you can see, the relative abundances have decreased dramatically by the 10th-ranked OTU.
Note that this first barplot is clipped at the `r N`th OTU. This was chosen because `ntaxa(enterotype) =``r ntaxa(enterotype)` OTUs would not be legible on the plot. As you can see, the relative abundances have decreased dramatically by the 10th-ranked OTU.

So what are these OTUs? In the `enterotype` dataset, only a single taxonomic rank type is present:
```{r}
rank_names(enterotype)
```
This means the OTUs in this dataset have been grouped at the level of genera, and no other taxonomic grouping/transformation is possible without additional information (like might be present in a phylogenetic tree, or with further taxonomic classification analysis).

We need to know which taxonomic rank classifiers, if any, we have available to specify in the second barplot function in this example, `plot_bar(). We have already observed how quickly the abundance decreases with rank, so wo we will subset the enterotype dataset to the most abundant `N taxa in order to make the barplot legible on this page.
We need to know which taxonomic rank classifiers, if any, we have available to specify in the second barplot function in this example, `plot_bar()`. We have already observed how quickly the abundance decreases with rank, so we will subset the `enterotype` dataset to the ten most abundant taxa in order to make the barplot legible on this page.

```{r}
TopNOTUs <- names(sort(taxa_sums(enterotype), TRUE)[1:10])
TopNOTUs <- names(sort(taxa_sums(enterotype), TRUE)[1:10])
ent10 <- prune_taxa(TopNOTUs, enterotype)
print(ent10)
```
Expand Down Expand Up @@ -297,7 +297,7 @@ Scree plot of the PCoA used to create Figure 5 from the "Global Patterns" articl
Next, we will reproduce Figure 5 from the "Global Patterns" article([Caporaso 2011](http://www.pnas.org/content/early/2010/06/02/1000080107)), but separating the three axes into 2 plots using `plot_ordination()`.

```{r GPfig5ax1213}
(p12 <- plot_ordination(GlobalPatterns, GloPa.pcoa, "samples", color="SampleType") +
(p12 <- plot_ordination(GlobalPatterns, GloPa.pcoa, "samples", color="SampleType") +
geom_point(size=5) + geom_path() + scale_colour_hue(guide = FALSE) )
(p13 <- plot_ordination(GlobalPatterns, GloPa.pcoa, "samples", axes=c(1, 3),
color="SampleType") + geom_line() + geom_point(size=5) )
Expand Down Expand Up @@ -354,20 +354,20 @@ The correspondence analysis (CA) scree plot of the "Global Patterns" dataset.
Now let's investigate how the samples behave on the first few CA axes.

```{r GPCA1234}
(p12 <- plot_ordination(GP, gpca, "samples", color="SampleType") +
(p12 <- plot_ordination(GP, gpca, "samples", color="SampleType") +
geom_line() + geom_point(size=5) )
(p34 <- plot_ordination(GP, gpca, "samples", axes=c(3, 4), color="SampleType") +
(p34 <- plot_ordination(GP, gpca, "samples", axes=c(3, 4), color="SampleType") +
geom_line() + geom_point(size=5) )
```
First 4 axes of Correspondence Analysis (CA) of the "Global Patterns" dataset ([Caporaso 2011](http://www.pnas.org/content/early/2010/06/02/1000080107)).

A clear feature of these plots is that the feces and mock communities cluster tightly together, far away from all other samples on the first axis (CA1). The skin and tongue samples separate similarly, but on the second axis. Taken together, it appears that the first two axes are best explained by the separation of human-associated "environments" from the other non-human environments in the dataset, with a secondary separation of tongue and skin samples from feces.
A clear feature of these plots is that the feces and mock communities cluster tightly together, far away from all other samples on the first axis (CA1). The skin and tongue samples separate similarly, but on the second axis. Taken together, it appears that the first two axes are best explained by the separation of human-associated "environments" from the other non-human environments in the dataset, with a secondary separation of tongue and skin samples from feces.

We will now investigate further this top-level structure of the data, using an additional feature of correspondence analysis that allows us to compare the relative contributions of individual taxa on the same graphical space: the "biplot". However, because we just displayed the position of samples in the ordination and there are often many thousands of OTUs, we will focus on creating an interpretable plot of the OTUs. For creating graphics that combine the two plots, try the `"biplot"` or `"split"` option for `type` in `plot_ordination()`.

```{r GPCAspecplot0}
p1 <- plot_ordination(GP, gpca, "species", color="Phylum")
(p1 <- ggplot(p1$data, p1$mapping) + geom_point(size=5, alpha=0.5) +
(p1 <- ggplot(p1$data, p1$mapping) + geom_point(size=5, alpha=0.5) +
facet_wrap(~Phylum) + scale_colour_hue(guide = FALSE) )
```

Expand All @@ -386,20 +386,20 @@ These figures reveal some useful patterns and interesting outliers, but what if
```{r GPCAjitter0}
library("reshape2")
# Melt the species-data.frame, DF, to facet each CA axis separately
mdf <- melt(p1$data[, c("CA1", "CA2", "Phylum", "Family", "Genus")],
mdf <- melt(p1$data[, c("CA1", "CA2", "Phylum", "Family", "Genus")],
id=c("Phylum", "Family", "Genus") )
# Select some special outliers for labelling
LF <- subset(mdf, variable=="CA2" & value < -1.0)
# build plot: boxplot summaries of each CA-axis, with labels
p <- ggplot(mdf, aes(Phylum, value, color=Phylum)) +
geom_boxplot() +
facet_wrap(~variable, 2) +
p <- ggplot(mdf, aes(Phylum, value, color=Phylum)) +
geom_boxplot() +
facet_wrap(~variable, 2) +
scale_colour_hue(guide = FALSE) +
theme_bw() +
theme_bw() +
theme( axis.text.x = element_text(angle = -90, vjust = 0.5) )
# Add the text label layer, and render ggplot graphic
(p <- p + geom_text(data=subset(LF, !is.na(Family)),
mapping = aes(Phylum, value+0.1, color=Phylum, label=Family),
mapping = aes(Phylum, value+0.1, color=Phylum, label=Family),
vjust=0,
size=2))
```
Expand All @@ -417,20 +417,20 @@ In this figure we've used the `threshold` parameter to omit all but phyla accoun

### Double Principle Coordinate Analysis (DPCoA)

Here is a quick example illustrating the use of Double Principal Coordinate Analysis (DPCoA~\cite{Pavoine2004523), using the using the `ordinate()` function in phyloseq, as well as the "biplot" option for `plot_ordination(). For a description that includes an applied example using the "enterotype" dataset and comparison with UniFrac/PCoA, see Fukuyama et al~\cite{fukuyama2012com.
Here is a quick example illustrating the use of Double Principal Coordinate Analysis (DPCoA~\cite{Pavoine2004523), using the using the `ordinate()` function in phyloseq, as well as the "biplot" option for `plot_ordination(). For a description that includes an applied example using the "enterotype" dataset and comparison with UniFrac/PCoA, see Fukuyama et al~\cite{fukuyama2012com.

```{r GPdpcoa01}
# Perform ordination
GP.dpcoa <- ordinate(GP, "DPCoA")
GP.dpcoa <- ordinate(GP, "DPCoA")
# Generate default ordination bi-plot
pdpcoa <-
pdpcoa <-
plot_ordination(
physeq = GP,
ordination = GP.dpcoa,
physeq = GP,
ordination = GP.dpcoa,
type="biplot",
color="SampleType",
color="SampleType",
shape="Phylum")
# Adjust the shape scale manually
# Adjust the shape scale manually
# to make taxa hollow and samples filled (advanced)
shape.fac <- pdpcoa$data$Phylum
man.shapes <- c(19, 21:25)
Expand All @@ -456,23 +456,23 @@ plot_ordination(GP, GP.dpcoa, type="split",

### distance(): Central Distance Function

Many comparisons of microbiome samples, including the graphical model and the PCoA analysis, require a calculation for the relative dissimilarity/distance between one microbial community and another. The phyloseq-package provides a general "wrapper" function for calculating ecological distance matrices between the samples in an experiment.
Many comparisons of microbiome samples, including the graphical model and the PCoA analysis, require a calculation for the relative dissimilarity/distance between one microbial community and another. The phyloseq-package provides a general "wrapper" function for calculating ecological distance matrices between the samples in an experiment.

`distance()` currently supports 43 method options, as well as user-provided arbitrary methods via an interface to vegan's `designdist()` function. Currrently only sample-wise distances are supported (the `type` argument), but eventually species-wise (OTU-wise) distances will be supported as well. In addition to supporting any of the method options to the three main distance functions of the vegan-package~\cite{veganpkg --- including the 14 distances of the `vegdist()` function and all 24 ecological distances reviewed in Koleff et al. 2003~\cite{Koleff2003JAE coded by `betadiver` --- `distance()` will eventually support many of the distance methods in the ade4-package~\cite{Dray:2007vs, and others. It also supports the Fast UniFrac distance function (Section~\ref{sec:unifrac) included in phyloseq as native R code, and a wrapper for retreiving the sample-distances from Double Principal Coordinate Analysis (DPCoA).

The function takes a `phyloseq-class` object and an argument indicating the distance type; and it returns a `dist-class distance matrix.

```{r distancefun}
data(esophagus)
distance(esophagus, "bray")
distance(esophagus, "bray")
distance(esophagus, "wunifrac") # weighted UniFrac
distance(esophagus, "jaccard") # vegdist jaccard
distance(esophagus, "g") # betadiver method option "g"
```


### UniFrac and weighted UniFrac
UniFrac is a recently-defined~\cite{Lozupone:2005gn and popular distance metric to summarize the difference between pairs of ecological communities. All UniFrac variants use a phylogenetic tree of the relationship among taxa as central information to calculating the distance between two samples/communities. An unweighted UniFrac distance matrix only considers the presence/absence of taxa, while weighted UniFrac accounts for the relative abundance of taxa as well as their phylogenetic distance. Prior to phyloseq, a non-parallelized, non-Fast implementation of the unweighted UniFrac was available in \R{ packages (`picante::unifrac`~\cite{Kembel:2010ft). In the phyloseq package we provide optionally-parallelized implementations of Fast UniFrac~\cite{Hamady:2009fk (both weighted and unweighted, with plans for additional UniFrac variants), all of which return a sample-wise distance matrix from any `phyloseq-class object that contains a phylogenetic tree component.
UniFrac is a recently-defined~\cite{Lozupone:2005gn and popular distance metric to summarize the difference between pairs of ecological communities. All UniFrac variants use a phylogenetic tree of the relationship among taxa as central information to calculating the distance between two samples/communities. An unweighted UniFrac distance matrix only considers the presence/absence of taxa, while weighted UniFrac accounts for the relative abundance of taxa as well as their phylogenetic distance. Prior to phyloseq, a non-parallelized, non-Fast implementation of the unweighted UniFrac was available in \R{ packages (`picante::unifrac`~\cite{Kembel:2010ft). In the phyloseq package we provide optionally-parallelized implementations of Fast UniFrac~\cite{Hamady:2009fk (both weighted and unweighted, with plans for additional UniFrac variants), all of which return a sample-wise distance matrix from any `phyloseq-class object that contains a phylogenetic tree component.

The following is an example calculating the UniFrac distance (both weighted and unweighted) matrix using the "esophagus" example dataset:

Expand All @@ -494,7 +494,7 @@ data(GlobalPatterns)
load(system.file("doc", "Unweighted_UniFrac.RData", package="phyloseq"))
# Manually define color-shading vector based on sample type.
colorScale <- rainbow(length(levels(get_variable(GlobalPatterns, "SampleType"))))
cols <- colorScale[get_variable(GlobalPatterns, "SampleType")]
cols <- colorScale[get_variable(GlobalPatterns, "SampleType")]
GP.tip.labels <- as(get_variable(GlobalPatterns, "SampleType"), "character")
# This is the actual hierarchical clustering call, specifying average-link clustering
GP.hclust <- hclust(GPUF, method="average")
Expand All @@ -510,7 +510,7 @@ One of our recommended approaches to this problem was described in
McMurdie and Holmes (2014) [Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible](http://dx.plos.org/10.1371/journal.pcbi.1003531).
PLoS Computational Biology. 10(4):e1003531

Some reproducible demonstrations of this approach are included in
Some reproducible demonstrations of this approach are included in
[the phyloseq extensions repository](http://joey711.github.io/phyloseq-extensions/extensions-index.html), the `phyloseq_to_deseq2` function,
as well as a separate vignetted dedicated to this topic
(phyloseq and DESeq2 on Colorectal Cancer Data).
Expand Down