-
Notifications
You must be signed in to change notification settings - Fork 33
/
PartIIphyloseq.tex
555 lines (472 loc) · 24.5 KB
/
PartIIphyloseq.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
\subsection*{phyloseq}
\BioCpkg{phyloseq}\cite{phyloseqplosone}
is an R package to import, store, analyze, and graphically display complex
phylogenetic sequencing data that has already been
clustered into Operational Taxonomic Units (OTUs)
or more appropriately denoised,
and it is most useful when there is also associated
sample data,
phylogeny,
and/or taxonomic assignment of each taxa.
\BioCpkg{phyloseq} leverages and builds upon
many of the tools available in R
for ecology and phylogenetic analysis (\citeCRANpkg{vegan}, \citeCRANpkg{ade4}, \citeCRANpkg{ape}),
while also using advanced/flexible graphic systems (\citeCRANpkg{ggplot2})
to easily produce publication-quality graphics of complex phylogenetic data.
The {\tt phyloseq} package uses a specialized system of S4 data classes
to store all related phylogenetic sequencing data
as a single, self-consistent, self-describing experiment-level object,
making it easier to share data and reproduce analyses.
In general, phyloseq seeks to facilitate the use of R
for efficient interactive and reproducible analysis
of amplicon count data jointly with important sample covariates.
\subsection*{Further documentation}
This tutorial shows a useful example workflow,
but many more analyses are available to you in phyloseq, and R in general,
than can fit in a single workflow.
The \href{http://joey711.github.io/phyloseq/}{phyloseq home page}
is a good place to begin browsing additional phyloseq documentation,
as are the three vignettes included within the package,
and linked directly at
\href{http://bioconductor.org/packages/release/bioc/html/phyloseq.html}{the phyloseq release page on Bioconductor}.
%\subsection*{Load Packages}
\subsection*{Loading Data}
Many use cases result in the need to import and combine different
data into a phyloseq class object,
this can be done using th \Rfunction{import\_biom}
function to read recent QIIME format files, older files
can still be imported with \Rfunction{import\_qiime}.
More complete details can be found on the
\href{https://www.bioconductor.org/packages/release/bioc/vignettes/phyloseq/inst/doc/phyloseq-FAQ.html}{phyloseq FAQ page}.
In the previous section the results of \BioCpkg{dada2} sequence processing
were organized into a phyloseq object.
This object was also saved in R-native serialized RDS format.
We will re-load this here for completeness as the initial object \Robject{p0}.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlkwd{library}\hlstd{(}\hlstr{"phyloseq"}\hlstd{)}
\hlkwd{library}\hlstd{(}\hlstr{"gridExtra"}\hlstd{)}
\hlstd{ps} \hlkwb{=} \hlkwd{readRDS}\hlstd{(}\hlstr{"data/ps.rds"}\hlstd{)}
\hlstd{ps}
\end{alltt}
\begin{verbatim}
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 389 taxa and 360 samples ]
## sample_data() Sample Data: [ 360 samples by 14 sample variables ]
## tax_table() Taxonomy Table: [ 389 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 389 tips and 387 internal nodes ]
\end{verbatim}
\end{kframe}
\end{knitrout}
\subsubsection*{Shiny-phyloseq}
It can be beneficial to start the data exploration
process interactively, this often saves time
in detecting outliers and specific features of the data.
\href{http://joey711.github.io/shiny-phyloseq/}{Shiny-phyloseq}
\cite{mcmurdie2015}
is an interactive web application that provides a graphical user interface to
the phyloseq package.
The object just loaded into the R session in this workflow
is suitable for this graphical interaction with Shiny-phyloseq.
\subsection*{Filtering}
\BioCpkg{phyloseq}
provides useful tools for filtering, subsetting, and agglomerating taxa --
a task that is often appropriate or even necessary
for effective analysis of microbiome count data.
In this subsection, we graphically explore the prevalence of taxa
in the example dataset,
and demonstrate how this can be used as a filtering criteria.
One of the reasons to filter in this way
is to avoid spending much time analyzing taxa
that were seen only rarely among samples.
This also turns out to be a useful filter of noise
(taxa that are actually just artifacts of the data collection process),
a step that should probably be considered essential for datasets constructed
via heuristic OTU-clustering methods,
which are notoriously prone to generating spurious taxa.
\subsubsection*{Taxonomic Filtering}
In many biological settings, the set of all organisms from all samples are well-represented in the available taxonomic reference database.
When (and only when) this is the case, it is reasonable or even advisable to filter taxonomic features
for which a high-rank taxonomy could not be assigned.
Such ambiguous features in this setting are almost always sequence artifacts that don't exist in nature.
It should be obvious that such a filter is not appropriate for samples from poorly characterized or novel specimens, at least until the possibility of taxonomic novelty can be satisfactorily rejected.
Phylum is a useful taxonomic rank to consider using for this purpose, but others may work effectively for your data.
To begin, create a table of read counts for each Phylum present in the dataset.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# Show available ranks in the dataset}
\hlkwd{rank_names}\hlstd{(ps)}
\end{alltt}
\begin{verbatim}
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus"
\end{verbatim}
\begin{alltt}
\hlcom{# Create table, number of features for each phyla}
\hlkwd{table}\hlstd{(}\hlkwd{tax_table}\hlstd{(ps)[,} \hlstr{"Phylum"}\hlstd{],} \hlkwc{exclude} \hlstd{=} \hlkwa{NULL}\hlstd{)}
\end{alltt}
\begin{verbatim}
##
## Actinobacteria Bacteroidetes
## 13 23
## Candidatus_Saccharibacteria Cyanobacteria/Chloroplast
## 1 4
## Deinococcus-Thermus Firmicutes
## 1 327
## Fusobacteria Proteobacteria
## 1 11
## Tenericutes Verrucomicrobia
## 1 1
## <NA>
## 6
\end{verbatim}
\end{kframe}
\end{knitrout}
This shows a few phyla for which only one feature was observed.
Those may be worth filtering, and we'll check that next.
First, notice that in this case, six features were annotated with a Phylum of NA.
These features are probably artifacts in a dataset like this,
and should be removed.
The following ensures that features with ambiguous phylum annotation are also removed.
Note the flexibility in defining strings that should be considered ambiguous annotation.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{ps0} \hlkwb{<-} \hlkwd{subset_taxa}\hlstd{(ps,} \hlopt{!}\hlkwd{is.na}\hlstd{(Phylum)} \hlopt{& !}\hlstd{Phylum} \hlopt{%in%} \hlkwd{c}\hlstd{(}\hlstr{""}\hlstd{,} \hlstr{"uncharacterized"}\hlstd{))}
\end{alltt}
\end{kframe}
\end{knitrout}
A useful next step is to explore feature \emph{prevalence} in the dataset,
which we will define here as
the number of samples in which a taxon appears at least once.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# Compute prevalence of each feature, store as data.frame}
\hlstd{prevdf} \hlkwb{=} \hlkwd{apply}\hlstd{(}\hlkwc{X} \hlstd{=} \hlkwd{otu_table}\hlstd{(ps0),}
\hlkwc{MARGIN} \hlstd{=} \hlkwd{ifelse}\hlstd{(}\hlkwd{taxa_are_rows}\hlstd{(ps0),} \hlkwc{yes} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{no} \hlstd{=} \hlnum{2}\hlstd{),}
\hlkwc{FUN} \hlstd{=} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{}\hlkwd{sum}\hlstd{(x} \hlopt{>} \hlnum{0}\hlstd{)\})}
\hlcom{# Add taxonomy and total read counts to this data.frame}
\hlstd{prevdf} \hlkwb{=} \hlkwd{data.frame}\hlstd{(}\hlkwc{Prevalence} \hlstd{= prevdf,}
\hlkwc{TotalAbundance} \hlstd{=} \hlkwd{taxa_sums}\hlstd{(ps0),}
\hlkwd{tax_table}\hlstd{(ps0))}
\end{alltt}
\end{kframe}
\end{knitrout}
Are there phyla that are comprised of mostly low-prevalence features?
Compute the total and average prevalences of the features in each phylum.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{plyr}\hlopt{::}\hlkwd{ddply}\hlstd{(prevdf,} \hlstr{"Phylum"}\hlstd{,} \hlkwa{function}\hlstd{(}\hlkwc{df1}\hlstd{)\{}\hlkwd{cbind}\hlstd{(}\hlkwd{mean}\hlstd{(df1}\hlopt{$}\hlstd{Prevalence),}\hlkwd{sum}\hlstd{(df1}\hlopt{$}\hlstd{Prevalence))\})}
\end{alltt}
\begin{verbatim}
## Phylum 1 2
## 1 Actinobacteria 120.2 1562
## 2 Bacteroidetes 265.5 6107
## 3 Candidatus_Saccharibacteria 280.0 280
## 4 Cyanobacteria/Chloroplast 64.2 257
## 5 Deinococcus-Thermus 52.0 52
## 6 Firmicutes 179.2 58614
## 7 Fusobacteria 2.0 2
## 8 Proteobacteria 59.1 650
## 9 Tenericutes 234.0 234
## 10 Verrucomicrobia 104.0 104
\end{verbatim}
\end{kframe}
\end{knitrout}
Deinococcus-Thermus appeared in just over one percent of samples,
and Fusobacteria appeared in just 2 samples total.
In some cases it might be worthwhile to explore
these two phyla in more detail despite this
(though probably not Fusobacteria's two samples).
For the purposes of this example, though,
they will be filtered from the dataset.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# Define phyla to filter}
\hlstd{filterPhyla} \hlkwb{=} \hlkwd{c}\hlstd{(}\hlstr{"Fusobacteria"}\hlstd{,} \hlstr{"Deinococcus-Thermus"}\hlstd{)}
\hlcom{# Filter entries with unidentified Phylum.}
\hlstd{ps1} \hlkwb{=} \hlkwd{subset_taxa}\hlstd{(ps0,} \hlopt{!}\hlstd{Phylum} \hlopt{%in%} \hlstd{filterPhyla)}
\hlstd{ps1}
\end{alltt}
\begin{verbatim}
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 381 taxa and 360 samples ]
## sample_data() Sample Data: [ 360 samples by 14 sample variables ]
## tax_table() Taxonomy Table: [ 381 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 381 tips and 379 internal nodes ]
\end{verbatim}
\end{kframe}
\end{knitrout}
\subsubsection*{Prevalence Filtering}
The previous filtering steps are considered \emph{supervised},
because they relied on prior information
that is external to this experiment
(a taxonomic reference database).
This next filtering step is completely \emph{unsupervised},
relying only on the data in this experiment,
and a parameter that we will choose after exploring the data.
Thus, this filtering step can be applied even in settings
where taxonomic annotation is unavailable or unreliable.
First, explore the relationship of
prevalence and total read count
for each feature.
Sometimes this reveals outliers that should probably be removed,
and also provides insight into the ranges of either feature
that might be useful.
This aspect depends quite a lot on the experimental design
and goals of the downstream inference,
so keep these in mind.
It may even be the case that different types of downstream inference
require different choices here.
There is no reason to expect ahead of time that
a single filtering workflow is appropriate for all analysis.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# Subset to the remaining phyla}
\hlstd{prevdf1} \hlkwb{=} \hlkwd{subset}\hlstd{(prevdf, Phylum} \hlopt{%in%} \hlkwd{get_taxa_unique}\hlstd{(ps1,} \hlstr{"Phylum"}\hlstd{))}
\hlkwd{ggplot}\hlstd{(prevdf1,} \hlkwd{aes}\hlstd{(TotalAbundance, Prevalence} \hlopt{/} \hlkwd{nsamples}\hlstd{(ps0),}\hlkwc{color}\hlstd{=Phylum))} \hlopt{+}
\hlcom{# Include a guess for parameter}
\hlkwd{geom_hline}\hlstd{(}\hlkwc{yintercept} \hlstd{=} \hlnum{0.05}\hlstd{,} \hlkwc{alpha} \hlstd{=} \hlnum{0.5}\hlstd{,} \hlkwc{linetype} \hlstd{=} \hlnum{2}\hlstd{)} \hlopt{+} \hlkwd{geom_point}\hlstd{(}\hlkwc{size} \hlstd{=} \hlnum{2}\hlstd{,} \hlkwc{alpha} \hlstd{=} \hlnum{0.7}\hlstd{)} \hlopt{+}
\hlkwd{scale_x_log10}\hlstd{()} \hlopt{+} \hlkwd{xlab}\hlstd{(}\hlstr{"Total Abundance"}\hlstd{)} \hlopt{+} \hlkwd{ylab}\hlstd{(}\hlstr{"Prevalence [Frac. Samples]"}\hlstd{)} \hlopt{+}
\hlkwd{facet_wrap}\hlstd{(}\hlopt{~}\hlstd{Phylum)} \hlopt{+} \hlkwd{theme}\hlstd{(}\hlkwc{legend.position}\hlstd{=}\hlstr{"none"}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
\begin{figure}
\includegraphics[width=\linewidth]{phyloseqfigure/plotprevalence-1}
\caption{Taxa prevalence versus total counts. Each point is a different taxa. Exploration of the data in this way is often useful for selecting filtering parameters, like the minimum prevalence criteria we will used to filter the data above.}
\end{figure}
Sometimes a natural separation in the dataset reveals itself,
or at least, a conservative choice that is in a stable region
for which small changes to the choice would have
minor or no effect on the biological interpreation (stability).
Here no natural separation is immediately evident,
but it looks like we might reasonably define a prevalence threshold
in a range of zero to ten percent or so.
Take care that this choice does not introduce bias
into a downstream analysis of association of differential abundance.
The following uses five percent of all samples
as the prevalence threshold.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# Define prevalence threshold as 5% of total samples}
\hlstd{prevalenceThreshold} \hlkwb{=} \hlnum{0.05} \hlopt{*} \hlkwd{nsamples}\hlstd{(ps0)}
\hlstd{prevalenceThreshold}
\end{alltt}
\begin{verbatim}
## [1] 18
\end{verbatim}
\begin{alltt}
\hlcom{# Execute prevalence filter, using `prune_taxa()` function}
\hlstd{keepTaxa} \hlkwb{=} \hlkwd{rownames}\hlstd{(prevdf1)[(prevdf1}\hlopt{$}\hlstd{Prevalence} \hlopt{>=} \hlstd{prevalenceThreshold)]}
\hlstd{ps2} \hlkwb{=} \hlkwd{prune_taxa}\hlstd{(keepTaxa, ps0)}
\end{alltt}
\end{kframe}
\end{knitrout}
\subsection*{Agglomerate taxa}
When there is known to be a lot of
species or sub-species functional redundancy
in a microbial community,
it might be useful to agglomerate the data features
corresponding to closely related taxa.
Ideally we would know the functional redundancies perfectly ahead of time,
in which case we would agglomerate taxa using those defined relationships
and the \Rfunction{merge\_taxa()} function in phyloseq.
That kind of exquisite functional data is usually not available,
and different pairs of microbes will have different sets of overlapping functions,
complicating the matter of defining appropriate grouping criteria.
While not necessarily the most useful or functionally-accurate
criteria for grouping microbial features
(sometimes far from accurate),
taxonomic agglomeration has the advantage of being much easier to define ahead of time.
This is because taxonomies are usually defined
with a comparatively simple tree-like graph structure
that has a fixed number of internal nodes, called ``ranks''.
This structure is simple enough for the phyloseq package
to represent taxonomies as table of taxonomy labels.
Taxonomic agglomeration groups all the ``leaves'' in the hierarchy
that descend from the user-prescribed agglomerating rank, this is sometimes called
`glomming'.
The following example code shows how one
would combine all features that descend from the same genus.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# How many genera would be present after filtering?}
\hlkwd{length}\hlstd{(}\hlkwd{get_taxa_unique}\hlstd{(ps2,} \hlkwc{taxonomic.rank} \hlstd{=} \hlstr{"Genus"}\hlstd{))}
\end{alltt}
\begin{verbatim}
## [1] 49
\end{verbatim}
\begin{alltt}
\hlstd{ps3} \hlkwb{=} \hlkwd{tax_glom}\hlstd{(ps2,} \hlstr{"Genus"}\hlstd{,} \hlkwc{NArm} \hlstd{=} \hlnum{TRUE}\hlstd{)}
\end{alltt}
\end{kframe}
\end{knitrout}
If taxonomy is not available or not reliable,
tree-based agglomeration is a "taxonomy-free"
alternative to combine data features corresponding to closely-related taxa.
In this case, rather than taxonomic rank, the user specifies
a tree height corresponding to the phylogenetic distance between features
that should define their grouping.
This is very similar to ``OTU Clustering'',
except that in many OTU Clustering algorithms the sequence distance being used
does not have the same (or any) evolutionary definition.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{h1} \hlkwb{=} \hlnum{0.4}
\hlstd{ps4} \hlkwb{=} \hlkwd{tip_glom}\hlstd{(ps2,} \hlkwc{h} \hlstd{= h1)}
\end{alltt}
\end{kframe}
\end{knitrout}
Here phyloseq's \Robject{plot\_tree()} function
compare the original unfiltered data,
the tree after taxonoic agglomeration,
and the tree after phylogenetic agglomeration.
These are stored as separate plot objects,
then rendered together in one combined graphic
using \Robject{gridExtra::grid.arrange}.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{multiPlotTitleTextSize} \hlkwb{=} \hlnum{8}
\hlstd{p2tree} \hlkwb{=} \hlkwd{plot_tree}\hlstd{(ps2,} \hlkwc{method} \hlstd{=} \hlstr{"treeonly"}\hlstd{,}
\hlkwc{ladderize} \hlstd{=} \hlstr{"left"}\hlstd{,}
\hlkwc{title} \hlstd{=} \hlstr{"Before Agglomeration"}\hlstd{)} \hlopt{+}
\hlkwd{theme}\hlstd{(}\hlkwc{plot.title} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{size} \hlstd{= multiPlotTitleTextSize))}
\hlstd{p3tree} \hlkwb{=} \hlkwd{plot_tree}\hlstd{(ps3,} \hlkwc{method} \hlstd{=} \hlstr{"treeonly"}\hlstd{,}
\hlkwc{ladderize} \hlstd{=} \hlstr{"left"}\hlstd{,} \hlkwc{title} \hlstd{=} \hlstr{"By Genus"}\hlstd{)} \hlopt{+}
\hlkwd{theme}\hlstd{(}\hlkwc{plot.title} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{size} \hlstd{= multiPlotTitleTextSize))}
\hlstd{p4tree} \hlkwb{=} \hlkwd{plot_tree}\hlstd{(ps4,} \hlkwc{method} \hlstd{=} \hlstr{"treeonly"}\hlstd{,}
\hlkwc{ladderize} \hlstd{=} \hlstr{"left"}\hlstd{,} \hlkwc{title} \hlstd{=} \hlstr{"By Height"}\hlstd{)} \hlopt{+}
\hlkwd{theme}\hlstd{(}\hlkwc{plot.title} \hlstd{=} \hlkwd{element_text}\hlstd{(}\hlkwc{size} \hlstd{= multiPlotTitleTextSize))}
\end{alltt}
\end{kframe}
\end{knitrout}
\begin{figure}[H]
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# group plots together}
\hlkwd{grid.arrange}\hlstd{(}\hlkwc{nrow} \hlstd{=} \hlnum{1}\hlstd{, p2tree, p3tree, p4tree)}
\end{alltt}
\end{kframe}
{\centering \includegraphics[width=\maxwidth]{phyloseqfigure/plotglomtree-1}
}
\end{knitrout}
\caption{The original tree (left), taxonomic agglomeration at Genus rank (middle), phylogeentic agglomeration at a fixed distance of 0.4 (right).}
\label{fig:glomsummary}
\end{figure}
\subsection*{Abundance value transformation}
It is usually necessary to transform microbiome count data
to account for differences in library size, variance, scale, etc.
The phyloseq package provides a flexible interface for defining
new functions to accomplish these transformations of the abundance values
via the \Robject{transform\_sample\_counts()} function.
The first argument to this function is the phyloseq object you want to transform,
and the second argument is an R function that defines the transformation.
The R function is applied sample-wise, expecting that the first unnamed argument
is a vector of taxa counts in the same order as the phyloseq object.
Additional arguments are passed on to the function
specified in the second argument,
providing an explicit means to include
pre-computed values, previously defined parameters/thresholds,
or any other object that might be appropriate
for computing the transformed values of interest.
This example begins by defining a custom plot function,
\Robject{plot\_abundance()},
that uses phyloseq's \Robject{psmelt()} function
to define a relative abundance graphic.
We will use this to compare more easily differences
in scale and distribution of the abundance values
in our phyloseq object before and after transformation.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{plot_abundance} \hlkwb{=} \hlkwa{function}\hlstd{(}\hlkwc{physeq}\hlstd{,}\hlkwc{title} \hlstd{=} \hlstr{""}\hlstd{,}
\hlkwc{Facet} \hlstd{=} \hlstr{"Order"}\hlstd{,} \hlkwc{Color} \hlstd{=} \hlstr{"Phylum"}\hlstd{)\{}
\hlcom{# Arbitrary subset, based on Phylum, for plotting}
\hlstd{p1f} \hlkwb{=} \hlkwd{subset_taxa}\hlstd{(physeq, Phylum} \hlopt{%in%} \hlkwd{c}\hlstd{(}\hlstr{"Firmicutes"}\hlstd{))}
\hlstd{mphyseq} \hlkwb{=} \hlkwd{psmelt}\hlstd{(p1f)}
\hlstd{mphyseq} \hlkwb{<-} \hlkwd{subset}\hlstd{(mphyseq, Abundance} \hlopt{>} \hlnum{0}\hlstd{)}
\hlkwd{ggplot}\hlstd{(}\hlkwc{data} \hlstd{= mphyseq,} \hlkwc{mapping} \hlstd{=} \hlkwd{aes_string}\hlstd{(}\hlkwc{x} \hlstd{=} \hlstr{"sex"}\hlstd{,}\hlkwc{y} \hlstd{=} \hlstr{"Abundance"}\hlstd{,}
\hlkwc{color} \hlstd{= Color,} \hlkwc{fill} \hlstd{= Color))} \hlopt{+}
\hlkwd{geom_violin}\hlstd{(}\hlkwc{fill} \hlstd{=} \hlnum{NA}\hlstd{)} \hlopt{+}
\hlkwd{geom_point}\hlstd{(}\hlkwc{size} \hlstd{=} \hlnum{1}\hlstd{,} \hlkwc{alpha} \hlstd{=} \hlnum{0.3}\hlstd{,}
\hlkwc{position} \hlstd{=} \hlkwd{position_jitter}\hlstd{(}\hlkwc{width} \hlstd{=} \hlnum{0.3}\hlstd{))} \hlopt{+}
\hlkwd{facet_wrap}\hlstd{(}\hlkwc{facets} \hlstd{= Facet)} \hlopt{+} \hlkwd{scale_y_log10}\hlstd{()}\hlopt{+}
\hlkwd{theme}\hlstd{(}\hlkwc{legend.position}\hlstd{=}\hlstr{"none"}\hlstd{)}
\hlstd{\}}
\end{alltt}
\end{kframe}
\end{knitrout}
The transformation in this case converts the counts
from each sample into their frequencies,
often referred to as \emph{proportions}
or \emph{relative abundances}.
This function is so simple that it is easiest
to define it within the function call to
\Robject{transform\_sample\_counts()}.
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlcom{# Transform to relative abundance. Save as new object.}
\hlstd{ps3ra} \hlkwb{=} \hlkwd{transform_sample_counts}\hlstd{(ps3,} \hlkwa{function}\hlstd{(}\hlkwc{x}\hlstd{)\{x} \hlopt{/} \hlkwd{sum}\hlstd{(x)\})}
\end{alltt}
\end{kframe}
\end{knitrout}
Now plot the abundance values before and after transformation.
\begin{figure}[H]
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{plotBefore} \hlkwb{=} \hlkwd{plot_abundance}\hlstd{(ps3,}\hlstr{""}\hlstd{)}
\hlstd{plotAfter} \hlkwb{=} \hlkwd{plot_abundance}\hlstd{(ps3ra,}\hlstr{""}\hlstd{)}
\hlcom{# Combine each plot into one graphic.}
\hlkwd{grid.arrange}\hlstd{(}\hlkwc{nrow} \hlstd{=} \hlnum{2}\hlstd{, plotBefore, plotAfter)}
\end{alltt}
\end{kframe}
{\centering \includegraphics[width=\maxwidth]{phyloseqfigure/abundancetransformation3-1}
}
\end{knitrout}
\caption{
Comparison of original abundances (top panel)
and relative abundances (lower).
}
\end{figure}
\subsection*{Subset by taxonomy}
Notice on the previous plot that
\emph{Lactobacillales} appears to be a taxonomic Order
with bimodal abundance profile in the data.
We can check for a taxonomic explanation of this pattern
by plotting just that taxonomic subset of the data.
For this, we subset with the \Rfunction{subset\_taxa()} function,
and then specify a more precise taxonomic rank
to the \Robject{Facet} argument of the \Robject{plot\_abundance} function
that we defined above.
\begin{figure}[H]
\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
\hlstd{psOrd} \hlkwb{=} \hlkwd{subset_taxa}\hlstd{(ps3ra, Order} \hlopt{==} \hlstr{"Lactobacillales"}\hlstd{)}
\hlkwd{plot_abundance}\hlstd{(psOrd,} \hlkwc{Facet} \hlstd{=} \hlstr{"Genus"}\hlstd{,} \hlkwc{Color} \hlstd{=} \hlkwa{NULL}\hlstd{)}
\end{alltt}
\end{kframe}
{\centering \includegraphics[width=\maxwidth]{phyloseqfigure/subsettaxa-1}
}
\end{knitrout}
\caption{Violin plot of the relative abundances of Lactobacillales taxonomic Order,
grouped by host sex and genera.
Here it is clear that the apparent biomodal distribution
of Lactobacillales on the previous plot
was the result of a mixture of two different genera,
with the typical \emph{Lactobacillus} relative abundance
much larger than \emph{Streptococcus}.}
\end{figure}