tutorial/tutorial.Rmd

---
title: 'SCRuB Tutorial'
output: 
    html_document:
       theme: darkly
---
<style>
    body { background-color: #162138;
            text-color: whitesmoke}
</style>
<img src='../Images/SCRuB_logo.png' align="right" height="139" />


In this tutorial,  we demonstrate how SCRuB should be used to remove contamination from microbial samples. All you need to run SCRuB is a count matrix of control and non-control samples. To run the SCRuB with its full spatial component, which is strongly recommended, the well locations of each sample must also be provided.

-----------------------

# {.tabset}

## Loading the data

The starting point for SCRuB is a matrix, in which each row represents a sample, and each column represents each sample's read counts that correspond to a certain taxa (e.g ASV, OTU, species). In general, we recommend to run SCRuB on the most granular version of the data possible, and so data should only be grouped to higher orders of phylogeny (e.g. genus, family...) *after* completing the SCRuB pipeline. 

We start by loading the necessary packages, such as `SCRUB`. We also include the use of `tidyverse` within our tutorial
```{r, message=FALSE}
set.seed(1)
library(tidyverse)
library(SCRuB)
```

As an example dataset, we use processed samples publicly available from Qiita (**link**), from a dataset of plasma samples, made public by Poore et al. In the ensuing pipeline, we demonstrate how SCRuB can be used to remove the contamination from the plasma using the control samples, and pub liscly available metadata published by Poore et al.

```{r}
data <- read.csv('plasma_data.csv', row.names=1) %>% as.matrix()
dim(data)
```

Next, we load a metadta file, which is in the format required for `SCRuB`. Before running SCRuB, we recommend that users make sure their metadata file formats are aligned with the one shown here.
```{r}
metadata <- read.csv('plasma_metadata.csv', row.names=1)
metadata %>% head()
```


## SCRuB

### Running SCRuB
```{r}
scr_out <- SCRuB(data, metadata, c("control blank DNA extraction", "control blank library prep") )
```

### Evaluating the outputs

Let's take a look at the results. The estimated level of contamination was very low, as the fitted `p` parameters indicate our samples are close to `3%` contamination.
```{r}
scr_out$p %>% boxplot()
```

To take a look at the cleaned samples, refer to the `decontaminated_samples` entry. 
```{r}
decontaminated_samples <- scr_out$decontaminated_samples
decontaminated_samples[1:10, 25:40]
```


An output from ever inner iteration of the SCRUB function is the `gamma` parameter, which is a vector representing the estimated relative abundance of a contamination community.
```{r}
scr_out$inner_iterations$`control blank library prep`$gamma %>% plot()
```


It is strongly recommended to incorporate the well metadata of each sample into SCRuB, as this makes it possible ot directly account for potential well leakage into negative controls, reducing the risk of wrongly removing certain species during decontamination.

One additional output when incoprorating the well metadata is the `alpha` parameters, which is a `n_control` by `n_samples + 1` matrix that represents the estimated level of leakage into each control. The last column of the `alpha` matrix represents the estimated proportion of each control that originates from the contamination source. In this case, `SCRuB` estimated a very high level of leakage for many DNA extraction samples, with a median estimated well-to-well leakage of `60%`. 

```{r}
boxplot( scr_out$inner_iterations$`control blank DNA extraction`$alpha[, ncol(scr_out$inner_iterations$`control blank DNA extraction`$alpha)] )
```