-
Notifications
You must be signed in to change notification settings - Fork 18
/
Import_data.Rmd
338 lines (267 loc) · 17.8 KB
/
Import_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: "Data importation in PLNmodels"
author: "PLN team"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 4
df_print: paged
bibliography: article/PLNreferences.bib
link-citations: yes
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{import}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
screenshot.force = FALSE,
echo = TRUE,
rows.print = 5,
message = FALSE,
warning = FALSE)
```
## Preliminaries
This vignette documents the data format used in **PLNmodel** by `PLN` and its variants. It also shows how to create an object in the proper format for further analyses from (i) tabular data, (ii) biom-class objects and (iii) phyloseq-class objects.
## Format description
We illustrate the format using trichoptera data set, a full description of which can be found in [the corresponding vignette](Trichoptera.html).
```{r data_load}
library(PLNmodels)
data(trichoptera)
```
The trichoptera data set is a list made of two data frames: `Abundance` (hereafter referred to as the _counts_) and `Covariate` (hereafter the _covariates_).
```{r trichoptera_structure}
str(trichoptera, max.level = 1)
```
The covariates include, among others, the wind, pressure and humidity.
```{r covariates_overview}
names(trichoptera$Covariate)
```
In the PLN framework, we model the counts from the covariates, let's say wind and pressure, using a Poisson Log-Normal model. Most models in R use the so-called _formula interface_ and it would thus be natural to write something like
```{r first_try, eval = FALSE}
PLN(Abundance ~ Wind + Pressure, data = trichoptera)
```
Unfortunately and unlike many generalized linear models, the response in PLN is intrinsically **multivariate**: it has 17 dimensions in our example. The left hand side (LHS) must encode a multivariate response across multiple samples, using a 2D-array (e.g. a matrix or a data frame).
We must therefore prepare a data structure where `Abundance` refers to a count *matrix* whereas `Wind` and `Pressure` refer to *vectors* before feeding it to `PLN`. That's the purpose of `prepare_data`.
```{r prepare_data_first_look}
trichoptera2 <- prepare_data(counts = trichoptera$Abundance,
covariates = trichoptera$Covariate)
str(trichoptera2)
```
If you look carefully, you can notice a few difference between `trichoptera` and `trichoptera2`:
- the first is a `list` whereas the second is a `data.frame`[^1];
- `Abundance` is a matrix-column of `trichoptera2` that you can extract using the usual functions `[` and `[[` to retrieve the count matrix;
- `trichoptera2` has an additional `Offset` column (more on that later).
## Computing offsets
It is common practice when modeling count data to introduce an offset term to control for different sampling efforts, exposures, baselines, etc. The *proper way* to compute sample-specific offsets in still debated and may vary depending on the field. There are nevertheless a few popular methods:
- Total Sum Scaling (TSS), where the offset of a sample is the total count in that sample
- Cumulative Sum Scaling (CSS), introduced in [@CSS], where the offset of a sample if the cumulative sum of counts in that sample, up to a quantile determined in a data driven way.
- Relative Log-Expression (RLE), implemented in [@DESeq2], where all samples are used to compute a reference sample, each sample is compared to the reference sample using log-ratios and the offset is the median log-ratio.
- Geometric Mean of Pairwise Ratio (GMPR), introduced in [@GMPR] where each sample is compared to each other to compute a median log-ratio and the offset of a sample is the geometric means of those pairwise ratios.
- Wrench, introduced in [@Kumar2018] and fully implemented in the [Wrench package](https://bioconductor.org/packages/release/bioc/html/Wrench.html), where all samples are used to compute reference proportions and each sample is compared to the reference using ratios (and **not log-ratios**) of proportions to compute compositional correction factors. In that case, the offset is the product of (geometrically centered) compositional factors and (geometrically centered) depths.
Each of these offset be computed from a counts matrix using the `compute_offset` function and changing its `offset` argument:
```{r compute_offset}
## same as compute_offset(trichoptera$Abundance, offset = "TSS")
compute_offset(trichoptera$Abundance)
```
In this particular example, the counts are too sparse and sophisticated offset methods all fail (numeric output hidden)
```{r other_offsets, warning=TRUE, error = TRUE, results='hide'}
compute_offset(trichoptera$Abundance, "CSS")
compute_offset(trichoptera$Abundance, "RLE")
compute_offset(trichoptera$Abundance, "GMPR")
```
We can mitigate this problem for the RLE offset by adding pseudocounts to the counts although doing so has its own drawbacks.
```{r pseudocounts}
compute_offset(trichoptera$Abundance, "RLE", pseudocounts = 1)
```
A better solution consists in using only positive counts to compute the offsets:
```{r poscounts}
compute_offset(trichoptera$Abundance, "RLE", type = "poscounts")
```
Finally, we can use wrench to compute the offsets:
```{r wrench}
compute_offset(trichoptera$Abundance, "Wrench")
```
**Note** TSS is the only methods that produces offset on the same scale as the counts, all others produces offsets that are (hopefully) *proportional* to library sizes but on a different scale. To force the offsets to be on the same scale as the counts for all methods, you can use the option `scale = "count"`.
```{r wrench_counts}
compute_offset(trichoptera$Abundance, "Wrench", scale = "count")
```
## Building data frame using `prepare_data`
We'll already learned that `prepare_data` can join counts and covariates into a single data.frame. It can also compute offset through `compute_offset` and does so by default with `offset = "TSS"`, hence the `Offset` column in `trichoptera2`. You can change the offset method and provide additional arguments that will passed on to `compute_offset`.
```{r prepare_data_other_offset}
str(prepare_data(trichoptera$Abundance,
trichoptera$Covariate,
offset = "RLE", pseudocounts = 1))
```
Different communities use different standard for the count data where samples are either or columns of the counts matrix. `prepare_data` uses heuristics to guess the direction of the counts matrix (or fail informatively doing so) and automatically transpose it if needed.
Finally, `prepare_data` enforces sample-consistency between the counts and the covariates and automatically trims away:
- samples for which only covariates or only counts are available;
- samples with no positive counts
For example, if we remove the first sample from the counts and the last one from the covariates, we end up with 49 - 2 = 47 samples left, as expected.
```{r trim_down_samples}
nrow(prepare_data(trichoptera$Abundance[-1, ], ## remove first sample
trichoptera$Covariate[-49,] ## remove last sample
))
```
## Importing data from biom and phyloseq objects using `prepare_data_from_[phyloseq|biom]`
Community composition data are quite popular in microbial ecology and usually stored in flat files using the [biom format](http://biom-format.org/) and/or imported in R as phyloseq-class objects [@phyloseq] using the Bioconductor [phyloseq](https://joey711.github.io/phyloseq/) package.
<!-- We provide helper functions to directly import data from a biom file (or biom-class object) and a phyloseq-class object. -->
We show here how to import data from a biom file (or biom-class object) and form a phyloseq-class object.
### Reading from a biom file
Reading from a biom file requires the bioconductor package [biomformat](https://www.bioconductor.org/packages/release/bioc/html/biomformat.html). This package is **not** a standard dependency of PLNmodels and needs to be installed separately.
You can easily prepare your data from a biom file using the following steps:
- read your biom file with `biomformat::read_biom()`
- extract the count table with `biomformat::biom_data()`
- extract the covariates with `biomformat::sample_metadata()` (or build your own)
- feed them to `prepare_data`
as illustrated below:
<!-- Note that the covariates **must** be stored in the biom object as they are automatically extracted, reading a biom without covariates results in an error. -->
```{r import_biom, eval = FALSE}
## If biomformat is not installed, uncomment the following lines
# if (!requireNamespace("BiocManager", quietly = TRUE)) {
# install.packages("BiocManager")
# }
# BiocManager::install("biomformat")
library(biomformat)
biomfile <- system.file("extdata", "rich_dense_otu_table.biom", package = "biomformat")
biom <- biomformat::read_biom(biomfile)
## extract counts
counts <- as(biomformat::biom_data(biom), "matrix")
## extract covariates (or prepare your own)
covariates <- biomformat::sample_metadata(biom)
## prepare data
my_data <- prepare_data(counts = counts, covariates = covariates)
str(my_data)
```
### Reading from a phyloseq-class object
Likewise, preparing data from a phyloseq-class object requires the bioconductor package [phyloseq](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html). This package is **not** a standard dependency of PLNmodels and needs to be installed separately.
You can easily prepare your data from a phyloseq object using the following steps:
- extract the count table with `phyloseq::otu_table()`
- extract the covariates with `phyloseq::sample_data()` (or build your own)
- feed them to `prepare_data`
as illustrated below:
<!-- Note that the covariates **must** be stored in the phyloseq-class object as they are automatically extracted, importing a phyloseq object with no `sample_data` component results in an error. -->
```{r import_phyloseq, eval = FALSE}
## If biomformat is not installed, uncomment the following lines
# if (!requireNamespace("BiocManager", quietly = TRUE)) {
# install.packages("BiocManager")
# }
# BiocManager::install("phyloseq")
library(phyloseq)
data("enterotype")
## extract counts
counts <- as(phyloseq::otu_table(enterotype), "matrix")
## extract covariates (or prepare your own)
covariates <- phyloseq::sample_data(enterotype)
## prepare data
my_data <- prepare_data(counts = counts, covariates = covariates)
str(my_data)
```
## Mathematical details about the offsets
We detail here the mathematical background behind the various offsets and the way they are computed. Note $\mathbf{Y} = (Y_{ij})$ the counts matrix where $Y_{ij}$ is the count of species $j$ in sample $i$. Assume that there are $p$ species and $n$ samples in total. The offset of sample $i$ is noted $O_i$ and computed in the following way.
### Total Sum Scaling
Offsets are simply the total counts of a sample (frequently called depths in the metabarcoding literature):
$$
O_i = \sum_{j=1}^p Y_{ij}
$$
### Cumulative Sum Scaling
Positive counts are used to compute sample-specific quantiles $q_i^l$ and cumulative sums $s_i^l$ defined as
$$
q_i^l = \min \{q \text{ such that } \sum_j 1_{Y_{ij} \leq q} \geq l \sum_j 1_{Y_{ij} > 0} \} \qquad s_i^l = \sum_{j: Y_{ij} \leq q_i^l} Y_{ij}
$$
The sample-specific quantiles are then used to compute reference quantiles defined as $q^l = \text{median} \{q^i_l\}$ and median average deviation around the quantile $q^l$ as $d^l = \text{median} |q_i^l - q^l|$. The method then searches for the smallest quantile $l$ for which it detects instability, defined as large relative increase in the $d^l$. Formally, $\hat{l}$ is the smallest $l$ satisfying $\frac{d^{l+1} - d^l}{d^l} \geq 0.1$. The scaling sample-specific offset are then chosen as:
$$
O_i = s_i^{\hat{l}} / \text{median}_i \{ s_i^{\hat{l}} \}
$$
Dividing by the median of the $s_i^{\hat{l}}$ ensures that offsets are centered around $1$ and compare sizes differences with respect to the reference sample. Note also that the reference quantiles $q^l$ can be computed using either the median (default, as in the original @CSS paper) or the mean, by specifying `reference = mean`, as implemented in `metagenomeseq`.
### Relative Log Expression
A reference sample $(q_j)_j$ is first built by computing the geometric means of each species count:
$$
q_j = \exp \left( \frac{1}{n} \sum_{i} \log(Y_{ij})\right)
$$
Each sample is then compared to the reference sample to compute one ratio per species and the final offset $O_i$ is the median of those ratios:
$$
O_i = \text{median}_j \frac{Y_{ij}}{q_j}
$$
The method fails when no species is shared across all sample (as all $q_j$ are then $0$) or when a sample shares less than 50% of species with the reference (in which case the median of the ratios may be null or infinite). The problem can be alleviated by adding pseudocounts to the $c_{ij}$ with `pseudocounts = 1` or using positive counts in the computations (`type = "poscounts"`)
### Geometric Mean of Pairwise Ratio
This method is similar to RLE but does create a reference sample. Instead, each sample is compared to each other to compute a median ratio (similar to RLE)
$$
r_{ii'} = {\text{median}}_{j: Y_{ij}.Y_{i'j} > 0} \frac{Y_{ij}}{Y_{i'j}}
$$
The offset is then taken as the median of all the $r_{ii'}$:
$$
O_i = \text{median}_{i' != i} r_{ii'}
$$
The method fails when there is only one sample in the data set or when a sample shares no species with any other.
### Wrench normalisation
This method is fully detailed in @Kumar2018 and we only provide a barebone implementation corresponding to the defaults parameters of `Wrench::wrench()`. Assume that samples belong to $K$ discrete groups and note $g_i$ the group of sample $i$. Wrench is based on the following (simplified) log-normal model for counts:
$$
Y_{ij} \sim \pi_{ij} \delta_0 + (1 - \pi_{ij})\log\mathcal{N}(\mu_{ij}, \sigma^2_j)
$$
where the $Y_{ij}$ are independent and the mean $\mu_{ij}$ is decomposed as:
$$
\mu_{ij} = \underbrace{\log{p_{0j}}}_{\text{log-ref. prop.}} + \underbrace{\log{d_i}}_{\text{log-depth}} + \underbrace{\log{\zeta_{0g_i}}}_{\text{log effect of group } g_i} + \underbrace{a_{i}}_{\text{(f|m)ixed effect}} + \underbrace{b_{ij}}_{\text{mixed effects}}
$$
where the random effects are independents centered gaussian and the depths is the total sum of counts:
$$
\begin{align*}
d_i & = \sum_{j=1}^p c_{ij} \\
b_{ij} & \sim \mathcal{N}(0, \eta^2_{g_i}) \\
\end{align*}
$$
The **net** log fold change $\theta_{ij}$ of the **proportion ratio** $r_{ij} = c_{ij} / d_i p_{0j}$ of species $j$ relative to the reference is $\log(\theta_{ij}) \overset{\Delta}{=} \mathbb{E}[\log(r_{ij}) | a_i, b_{ij}] = \log{\zeta_{0g_i}} + a_i + b_{ij}$. We can decompose it as $\theta_{ij} = \Lambda_i^{-1} v_{ij}$ where $\Lambda_i^{-1}$ is the *compositional correction factor* and $v_{ij}$ is the fold change of **true abundances**.
With the above notations, the net fold change compounds both the fold change of true abundances and the compositional correction factors. With the assumption that the $b_{ij}$ are centered, $\log(\hat{\Lambda}_i)$ can be estimated through a robust average of the $\hat{\theta}_{ij}$, which can themselves be computed from the log-ratio of proportions.
We detail here how the different parameters and/or effects are estimated.
- The reference proportions $p_{0j}$ are constructed as averages of the sample proportions $p_{ij}$ and the ratio are derived from both quantities
$$
p_{ij} = \frac{Y_{ij}}{\sum_{j=1}^p Y_{ij}} \qquad p_{0j} = \frac{1}{n} \sum_{i=1}^n p_{ij} \qquad r_{ij} = \frac{p_{ij}}{p_{0j}}
$$
- The probabilities of absence $\pi_{ij}$ are estimated by fitting the following Bernoulli models:
$$
1_{\{Y_{ij} = 0\}} \sim \mathcal{B}(\pi_{j}^{d_i})
$$
and setting $\hat{\pi}_{ij} = \hat{\pi}^{d_i}$
- The species variances $\sigma^2_j$ are estimated by fitting the following linear model (with no zero-inflation component)
$$
\log Y_{ij} \sim \log(d_i) + \mu_{g_i} + \mathcal{N}(0, \sigma^2_j)
$$
Note that in the original `Wrench::wrench()`, the log depth $\log(d_i)$ is used as predictor but I believe it makes more sense to use it an offset.
- set the group proportions $p_{gj}$ and group ratios $r_{gj}$ to:
$$
p_{gj} = \frac{\sum_{i : g_i = g} Y_{ij}}{\sum_{j, i : g_i = g} Y_{ij}} \qquad r_{gj} = \frac{p_{gj}}{p_{0j}}
$$
- Estimate the location and dispersion parameters as:
$$
\hat{\zeta}_{0g} = \frac{\sum_{j=1}^p r_{gj}}{p}
\qquad
\log{r_{g.}} = \frac{\sum_{j: r_{gj} \neq 1} \log{r_{gj}}}{\sum_{j: r_{gj} \neq 0} 1}
\qquad
\hat{\eta}_{g}^2 = \frac{\sum_{j: r_{gj} \neq 1} (\log{r_{gj}} - \log{r_{g.}})^2}{\sum_{j: r_{gj} \neq 0} 1}
$$
- Estimate the mixed effects as shrunken (and scaled) averages of the ratios
$$
\hat{a}_i = \frac{\sum_{j = 1}^p \frac{1}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j} (\log{r_{ij}} - \log{\hat{\zeta}_{0g_i}})}{\sum_{j = 1}^p \frac{1}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j}}
\qquad
\hat{b}_{ij} = \frac{\hat{\eta}^2_{g_i}}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j} \left( \log{r_{ij}} - \log\hat{\zeta}_{0g_i} - \hat{a}_i\right)
$$
- Estimate the regularized ratios as:
$$
\hat{\theta}_{ij} = \exp\left( \log\hat{\zeta}_{0g_i} + \hat{a}_i + \hat{b}_{ij} \right)
$$
- Estimate the compositional correction factors as (weighted) means of the regularized (and possibly corrected) ratios:
$$
\hat{\Lambda}_i =
\begin{cases}
\sum_{j = 1}^p \hat{\theta}_{ij} \bigg/ p& \text{ if type = "simple"} \\
\sum_{j = 1}^p \hat{\theta}_{ij} e^{-\hat{\sigma}_j^2 / 2} / w_{ij} \bigg/ \sum_{j=1}^p 1/w_{ij} & \text{ if type = "wrench"} \\
\end{cases}
$$
where $w_{ij} = (1 - \hat{\pi}_{ij})(\hat{\pi}_{ij} + e^{\hat{\sigma}_j^2 + \hat{\eta}_i^2} - 1)$. The correction term $e^{\hat{\sigma}_j^2 / 2}$ arises from the relation $\mathbb{E}[r_{ij} | r_{ij} > 0] = \theta_{ij} e^{\sigma_j^2/2}$ and the weight $w_{ij}$ are marginal variances: $\mathbb{V}[r_{ij}] = w_{ij}$.
The offsets are then the product of compositional correction factors and depths:
$$
O_i = \frac{\hat{\Lambda}_i}{(\prod_{i = 1}^n \hat{\Lambda}_i)^{1/n}} \times \frac{d_i}{(\prod_{i = 1}^n d_i)^{1/n}}
$$
[^1]: although a `data.frame` is technically a `list`
## References