vignettes/Import_data.Rmd

---
title: "Data importation in PLNmodels"
author: "PLN team"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 4
    df_print: paged
bibliography: article/PLNreferences.bib
link-citations: yes
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{import}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  screenshot.force = FALSE, 
  echo = TRUE,
  rows.print = 5,
  message = FALSE,
  warning = FALSE)
```

## Preliminaries

This vignette documents the data format used in **PLNmodel** by `PLN` and its variants. It also shows how to create an object in the proper format for further analyses from (i) tabular data, (ii) biom-class objects and (iii) phyloseq-class objects.

## Format description

We illustrate the format using trichoptera data set, a full description of which can be found in [the corresponding vignette](Trichoptera.html).

```{r data_load}
library(PLNmodels)
data(trichoptera)
```

The trichoptera data set is a list made of two data frames: `Abundance` (hereafter referred to as the _counts_) and `Covariate` (hereafter the _covariates_). 

```{r trichoptera_structure}
str(trichoptera, max.level = 1)
```

The covariates include, among others, the wind, pressure and humidity. 

```{r covariates_overview}
names(trichoptera$Covariate)
```

In the PLN framework, we model the counts from the covariates, let's say wind and pressure, using a Poisson Log-Normal model. Most models in R use the so-called _formula interface_ and it would thus be natural to write something like 
```{r first_try, eval = FALSE}
PLN(Abundance ~ Wind + Pressure, data = trichoptera)
```

Unfortunately and unlike many generalized linear models, the response in PLN is intrinsically **multivariate**: it has 17 dimensions in our example. The left hand side (LHS) must encode a multivariate response across multiple samples, using a 2D-array (e.g. a matrix or a data frame). 

We must therefore prepare a data structure where `Abundance` refers to a count *matrix* whereas `Wind` and `Pressure` refer to *vectors* before feeding it to `PLN`. That's the purpose of `prepare_data`. 

```{r prepare_data_first_look}
trichoptera2 <- prepare_data(counts     = trichoptera$Abundance, 
                             covariates = trichoptera$Covariate)
str(trichoptera2)
```

If you look carefully, you can notice a few difference between `trichoptera` and `trichoptera2`:

- the first is a `list` whereas the second is a `data.frame`[^1];
- `Abundance` is a matrix-column of `trichoptera2` that you can extract using the usual functions `[` and `[[` to retrieve the count matrix;
- `trichoptera2` has an additional `Offset` column (more on that later). 

## Computing offsets

It is common practice when modeling count data to introduce an offset term to control for different sampling efforts, exposures, baselines, etc. The *proper way* to compute sample-specific offsets in still debated and may vary depending on the field. There are nevertheless a few popular methods:

- Total Sum Scaling (TSS), where the offset of a sample is the total count in that sample
- Cumulative Sum Scaling (CSS), introduced in [@CSS], where the offset of a sample if the cumulative sum of counts in that sample, up to a quantile determined in a data driven way. 
- Relative Log-Expression (RLE), implemented in [@DESeq2], where all samples are used to compute a reference sample, each sample is compared to the reference sample using log-ratios and the offset is the median log-ratio. 
- Geometric Mean of Pairwise Ratio (GMPR), introduced in [@GMPR] where each sample is compared to each other to compute a median log-ratio and the offset of a sample is the geometric means of those pairwise ratios. 
- Wrench, introduced in [@Kumar2018] and fully implemented in the [Wrench package](https://bioconductor.org/packages/release/bioc/html/Wrench.html), where all samples are used to compute reference proportions and each sample is compared to the reference using ratios (and **not log-ratios**) of proportions to compute compositional correction factors. In that case, the offset is the product of (geometrically centered) compositional factors and (geometrically centered) depths. 

Each of these offset be computed from a counts matrix using the `compute_offset` function and changing its `offset` argument:
```{r compute_offset}
## same as compute_offset(trichoptera$Abundance, offset = "TSS")
compute_offset(trichoptera$Abundance) 
```

In this particular example, the counts are too sparse and sophisticated offset methods all fail (numeric output hidden)
```{r other_offsets, warning=TRUE, error = TRUE, results='hide'} 
compute_offset(trichoptera$Abundance, "CSS")
compute_offset(trichoptera$Abundance, "RLE")
compute_offset(trichoptera$Abundance, "GMPR")
```

We can mitigate this problem for the RLE offset by adding pseudocounts to the counts although doing so has its own drawbacks.  
```{r pseudocounts}
compute_offset(trichoptera$Abundance, "RLE", pseudocounts = 1)
```

A better solution consists in using only positive counts to compute the offsets:
```{r poscounts}
compute_offset(trichoptera$Abundance, "RLE", type = "poscounts")
```

Finally, we can use wrench to compute the offsets:
```{r wrench}
compute_offset(trichoptera$Abundance, "Wrench")
```

**Note** TSS is the only methods that produces offset on the same scale as the counts, all others produces offsets that are (hopefully) *proportional* to library sizes but on a different scale. To force the offsets to be on the same scale as the counts for all methods, you can use the option `scale = "count"`.

```{r wrench_counts}
compute_offset(trichoptera$Abundance, "Wrench", scale = "count")
```


## Building data frame using `prepare_data`

We'll already learned that `prepare_data` can join counts and covariates into a single data.frame. It can also compute offset through `compute_offset` and does so by default with `offset = "TSS"`, hence the `Offset` column in `trichoptera2`. You can change the offset method and provide additional arguments that will passed on to `compute_offset`. 

```{r prepare_data_other_offset}
str(prepare_data(trichoptera$Abundance, 
             trichoptera$Covariate, 
             offset = "RLE", pseudocounts = 1))
```

Different communities use different standard for the count data where samples are either or columns of the counts matrix. `prepare_data` uses heuristics to guess the direction of the counts matrix (or fail informatively doing so) and automatically transpose it if needed. 

Finally, `prepare_data` enforces sample-consistency between the counts and the covariates and automatically trims away:
- samples for which only covariates or only counts are available;
- samples with no positive counts

For example, if we remove the first sample from the counts and the last one from the covariates, we end up with 49 - 2  = 47 samples left, as expected. 

```{r trim_down_samples}
nrow(prepare_data(trichoptera$Abundance[-1, ], ## remove first sample
                  trichoptera$Covariate[-49,]  ## remove last sample
                  ))
```

## Importing data from biom and phyloseq objects using `prepare_data_from_[phyloseq|biom]`

Community composition data are quite popular in microbial ecology and usually stored in flat files using the [biom format](http://biom-format.org/) and/or imported in R as phyloseq-class objects [@phyloseq] using the Bioconductor [phyloseq](https://joey711.github.io/phyloseq/) package.

<!-- We provide helper functions to directly import data from a biom file (or biom-class object) and a phyloseq-class object.  -->

We show here how to import data from a biom file (or biom-class object) and form a phyloseq-class object.  

### Reading from a biom file

Reading from a biom file requires the bioconductor package [biomformat](https://www.bioconductor.org/packages/release/bioc/html/biomformat.html). This package is **not** a standard dependency of PLNmodels and needs to be installed separately.

You can easily prepare your data from a biom file using the following steps:

- read your biom file with `biomformat::read_biom()`
- extract the count table with `biomformat::biom_data()`
- extract the covariates with `biomformat::sample_metadata()` (or build your own)
- feed them to `prepare_data`

as illustrated below:
<!-- Note that the covariates **must** be stored in the biom object as they are automatically extracted, reading a biom without covariates results in an error.  -->

```{r import_biom, eval = FALSE}
## If biomformat is not installed, uncomment the following lines
# if (!requireNamespace("BiocManager", quietly = TRUE)) {
#   install.packages("BiocManager")
# }
# BiocManager::install("biomformat")
library(biomformat)
biomfile <- system.file("extdata", "rich_dense_otu_table.biom", package = "biomformat")
biom <- biomformat::read_biom(biomfile)
## extract counts
counts <- as(biomformat::biom_data(biom), "matrix")
## extract covariates (or prepare your own)
covariates <- biomformat::sample_metadata(biom)
## prepare data
my_data <- prepare_data(counts = counts, covariates = covariates)
str(my_data)
```

### Reading from a phyloseq-class object

Likewise, preparing data from a phyloseq-class object requires the bioconductor package [phyloseq](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html). This package is **not** a standard dependency of PLNmodels and needs to be installed separately.

You can easily prepare your data from a phyloseq object using the following steps:

- extract the count table with `phyloseq::otu_table()`
- extract the covariates with `phyloseq::sample_data()` (or build your own)
- feed them to `prepare_data`

as illustrated below:

<!-- Note that the covariates **must** be stored in the phyloseq-class object as they are automatically extracted, importing a phyloseq object with no `sample_data` component results in an error.  -->

```{r import_phyloseq, eval = FALSE}
## If biomformat is not installed, uncomment the following lines
# if (!requireNamespace("BiocManager", quietly = TRUE)) {
#   install.packages("BiocManager")
# }
# BiocManager::install("phyloseq")
library(phyloseq)
data("enterotype")
## extract counts
counts <- as(phyloseq::otu_table(enterotype), "matrix")
## extract covariates (or prepare your own)
covariates <- phyloseq::sample_data(enterotype)
## prepare data
my_data <- prepare_data(counts = counts, covariates = covariates)
str(my_data)
```

## Mathematical details about the offsets

We detail here the mathematical background behind the various offsets and the way they are computed. Note $\mathbf{Y} = (Y_{ij})$ the counts matrix where $Y_{ij}$ is the count of species $j$ in sample $i$. Assume that there are $p$ species and $n$ samples in total. The offset of sample $i$ is noted $O_i$ and computed in the following way. 

### Total Sum Scaling

Offsets are simply the total counts of a sample (frequently called depths in the metabarcoding literature):
$$
O_i = \sum_{j=1}^p Y_{ij}
$$

### Cumulative Sum Scaling 

Positive counts are used to compute sample-specific quantiles $q_i^l$ and cumulative sums $s_i^l$ defined as
$$
q_i^l = \min \{q \text{ such that } \sum_j 1_{Y_{ij} \leq q} \geq l \sum_j 1_{Y_{ij} > 0} \} \qquad s_i^l = \sum_{j: Y_{ij} \leq q_i^l} Y_{ij}
$$
The sample-specific quantiles are then used to compute reference quantiles defined as $q^l = \text{median} \{q^i_l\}$ and median average deviation around the quantile $q^l$ as $d^l = \text{median} |q_i^l - q^l|$. The method then searches for the smallest quantile $l$ for which it detects instability, defined as large relative increase in the $d^l$. Formally, $\hat{l}$ is the smallest $l$ satisfying $\frac{d^{l+1} - d^l}{d^l} \geq 0.1$. The scaling sample-specific offset are then chosen as:
$$
O_i = s_i^{\hat{l}} / \text{median}_i \{ s_i^{\hat{l}} \}
$$
Dividing by the median of the $s_i^{\hat{l}}$ ensures that offsets are centered around $1$ and compare sizes differences with respect to the reference sample. Note also that the reference quantiles $q^l$ can be computed using either the median (default, as in the original @CSS paper) or the mean, by specifying `reference = mean`, as implemented in `metagenomeseq`. 

### Relative Log Expression

A reference sample $(q_j)_j$ is first built by computing the geometric means of each species count:
$$
q_j = \exp \left( \frac{1}{n} \sum_{i} \log(Y_{ij})\right)
$$
Each sample is then compared to the reference sample to compute one ratio per species and the final offset $O_i$ is the median of those ratios:
$$
O_i = \text{median}_j \frac{Y_{ij}}{q_j}
$$
The method fails when no species is shared across all sample (as all $q_j$ are then $0$) or when a sample shares less than 50% of species with the reference (in which case the median of the ratios may be null or infinite). The problem can be alleviated by adding pseudocounts to the $c_{ij}$ with `pseudocounts = 1` or using positive counts in the computations (`type = "poscounts"`)

### Geometric Mean of Pairwise Ratio 

This method is similar to RLE but does create a reference sample. Instead, each sample is compared to each other to compute a median ratio (similar to RLE)
$$
r_{ii'} = {\text{median}}_{j: Y_{ij}.Y_{i'j} > 0} \frac{Y_{ij}}{Y_{i'j}}
$$
The offset is then taken as the median of all the $r_{ii'}$:
$$
O_i = \text{median}_{i' != i} r_{ii'}
$$
The method fails when there is only one sample in the data set or when a sample shares no species with any other. 

### Wrench normalisation

This method is fully detailed in @Kumar2018 and we only provide a barebone implementation corresponding to the defaults parameters of `Wrench::wrench()`. Assume that samples belong to $K$ discrete groups and note $g_i$ the group of sample $i$. Wrench is based on the following (simplified) log-normal model for counts:
$$
Y_{ij} \sim \pi_{ij} \delta_0 + (1 - \pi_{ij})\log\mathcal{N}(\mu_{ij}, \sigma^2_j)
$$
where the $Y_{ij}$ are independent and the mean $\mu_{ij}$ is decomposed as:
$$
\mu_{ij} = \underbrace{\log{p_{0j}}}_{\text{log-ref. prop.}} +  \underbrace{\log{d_i}}_{\text{log-depth}} + \underbrace{\log{\zeta_{0g_i}}}_{\text{log effect of group } g_i} + \underbrace{a_{i}}_{\text{(f|m)ixed effect}} + \underbrace{b_{ij}}_{\text{mixed effects}}
$$
where the random effects are independents centered gaussian and the depths is the total sum of counts:
$$
\begin{align*}
d_i & = \sum_{j=1}^p c_{ij} \\
b_{ij} & \sim \mathcal{N}(0, \eta^2_{g_i}) \\
\end{align*}
$$

The **net** log fold change $\theta_{ij}$ of the **proportion ratio** $r_{ij} = c_{ij} / d_i p_{0j}$ of species $j$ relative to the reference is $\log(\theta_{ij}) \overset{\Delta}{=} \mathbb{E}[\log(r_{ij}) | a_i, b_{ij}] = \log{\zeta_{0g_i}} + a_i + b_{ij}$. We can decompose it as $\theta_{ij} = \Lambda_i^{-1} v_{ij}$ where $\Lambda_i^{-1}$ is the *compositional correction factor* and $v_{ij}$ is the fold change of **true abundances**.   

With the above notations, the net fold change compounds both the fold change of true abundances and the compositional correction factors. With the assumption that the $b_{ij}$ are centered, $\log(\hat{\Lambda}_i)$ can be estimated through a robust average of the $\hat{\theta}_{ij}$, which can themselves be computed from the log-ratio of proportions. 

We detail here how the different parameters and/or effects are estimated. 

- The reference proportions $p_{0j}$ are constructed as averages of the sample proportions $p_{ij}$ and the ratio are derived from both quantities
$$
p_{ij} = \frac{Y_{ij}}{\sum_{j=1}^p Y_{ij}} \qquad p_{0j} = \frac{1}{n} \sum_{i=1}^n p_{ij} \qquad r_{ij} = \frac{p_{ij}}{p_{0j}}
$$
- The probabilities of absence $\pi_{ij}$ are estimated by fitting the following Bernoulli models:
$$
1_{\{Y_{ij} = 0\}} \sim \mathcal{B}(\pi_{j}^{d_i})
$$
and setting $\hat{\pi}_{ij} = \hat{\pi}^{d_i}$
- The species variances $\sigma^2_j$ are estimated by fitting the following linear model (with no zero-inflation component)
$$
\log Y_{ij} \sim \log(d_i) + \mu_{g_i} + \mathcal{N}(0, \sigma^2_j)
$$
Note that in the original `Wrench::wrench()`, the log depth $\log(d_i)$ is used as predictor but I believe it makes more sense to use it an offset. 
- set the group proportions $p_{gj}$ and group ratios $r_{gj}$ to:
$$
p_{gj} = \frac{\sum_{i : g_i = g} Y_{ij}}{\sum_{j, i : g_i = g} Y_{ij}} \qquad r_{gj} = \frac{p_{gj}}{p_{0j}}
$$
- Estimate the location and dispersion parameters as:
$$
\hat{\zeta}_{0g} = \frac{\sum_{j=1}^p r_{gj}}{p} 
\qquad
\log{r_{g.}} = \frac{\sum_{j: r_{gj} \neq 1} \log{r_{gj}}}{\sum_{j: r_{gj} \neq 0} 1}   
\qquad
\hat{\eta}_{g}^2 = \frac{\sum_{j: r_{gj} \neq 1} (\log{r_{gj}} - \log{r_{g.}})^2}{\sum_{j: r_{gj} \neq 0} 1}
$$
- Estimate the mixed effects as shrunken (and scaled) averages of the ratios
$$
  \hat{a}_i = \frac{\sum_{j = 1}^p \frac{1}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j} (\log{r_{ij}} - \log{\hat{\zeta}_{0g_i}})}{\sum_{j = 1}^p \frac{1}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j}}
  \qquad 
  \hat{b}_{ij} = \frac{\hat{\eta}^2_{g_i}}{\hat{\eta}^2_{g_i} + \hat{\sigma}^2_j} \left( \log{r_{ij}} - \log\hat{\zeta}_{0g_i} - \hat{a}_i\right)
$$
- Estimate the regularized ratios as:
$$
\hat{\theta}_{ij} = \exp\left( \log\hat{\zeta}_{0g_i} + \hat{a}_i + \hat{b}_{ij} \right)
$$
- Estimate the compositional correction factors as (weighted) means of the regularized (and possibly corrected) ratios:
$$
\hat{\Lambda}_i = 
\begin{cases}
\sum_{j = 1}^p \hat{\theta}_{ij} \bigg/ p& \text{ if type = "simple"} \\
\sum_{j = 1}^p \hat{\theta}_{ij} e^{-\hat{\sigma}_j^2 / 2} /  w_{ij} \bigg/ \sum_{j=1}^p 1/w_{ij}  & \text{ if type = "wrench"} \\
\end{cases}
$$
where $w_{ij} = (1 - \hat{\pi}_{ij})(\hat{\pi}_{ij} + e^{\hat{\sigma}_j^2 + \hat{\eta}_i^2} - 1)$. The correction term $e^{\hat{\sigma}_j^2 / 2}$ arises from the relation $\mathbb{E}[r_{ij} | r_{ij} > 0] = \theta_{ij} e^{\sigma_j^2/2}$ and the weight $w_{ij}$ are marginal variances: $\mathbb{V}[r_{ij}] = w_{ij}$. 

The offsets are then the product of compositional correction factors and depths:
$$
O_i = \frac{\hat{\Lambda}_i}{(\prod_{i = 1}^n \hat{\Lambda}_i)^{1/n}} \times \frac{d_i}{(\prod_{i = 1}^n d_i)^{1/n}}
$$

[^1]: although a `data.frame` is technically a `list`

## References