fastq-quality-control.Rmd

---
title: "FASTQ Quality Control"
author: "Sam Buckberry"
date: "23/10/2021"
output: github_document
---
Load R libraries 
```{r, message=FALSE}
library(stringr)
library(magrittr)
library(Rfastp)
library(DT)
library(ggplot2)
library(cowplot)
```

### FASTQ quality control and filtering

The first step in analysing high-throughput short-read sequencing is to check the quality of the reads before starting any analysis. The quality of the data are fundamentally important for maintaining confidence in the conclusions drawn from the experiment. Here we will perform quality checks and pre-processing of RNA-seq data before we begin to quantify gene expression.  

Here is some example code for generating a quality control report and trimming of sequencing adapters for one paired-end RNA-seq sample using the `Rfastp` package that is a wrapper for the [Fastp](https://github.com/OpenGene/fastp) suite of tools.  

*No need to run the below code chunk during the workshop*
```{r, eval=FALSE}
read1 <- "fastq/1001-Effector_CD4pos_T-S_chr22_R1.fastq.gz"
read2 <- "fastq/1001-Effector_CD4pos_T-S_chr22_R2.fastq.gz"

out_prefix <- str_replace(string = read1,
                          pattern = "_R1.fastq.gz",
                          replacement = "_filt_fastp")

fastp_json_report <- rfastp(read1 = read1, read2 = read2, 
                            outputFastq = out_prefix,
                            adapterTrimming = TRUE,
                            cutLowQualTail = TRUE,
                            cutLowQualFront = TRUE,
                            thread = 1)
```

As we have more than one RNA-seq libray to process, we can write an R function to process all the samples at once.  
```{r, cache=TRUE, message=FALSE, error=FALSE}
r1_list <- list.files(path = "fastq/",
                      pattern = "chr22_R1.fastq.gz",
                      full.names = TRUE)

r2_list <- list.files(path = "fastq/",
                      pattern = "chr22_R2.fastq.gz",
                      full.names = TRUE)

fq_df <- data.frame(r1 = r1_list, r2 = r2_list,
                    prefix = str_replace(r1_list,
                                        pattern = "_chr22_R1.fastq.gz",
                                        replacement = "_filt_fastp"))

# Function to run fastp for all libraries 
multi_fastp <- function(x){
    
    fastp_json_report <- rfastp(read1 = fq_df$r1[x],
                                read2 = fq_df$r2[x], 
                            outputFastq = fq_df$prefix[x],
                            adapterTrimming = TRUE,
                            cutLowQualTail = TRUE,
                            cutLowQualFront = TRUE,
                            thread = 1)
    return(fastp_json_report)
}

report_list <- lapply(1:nrow(fq_df), multi_fastp)

```

If you've successfully run the above code block, you will see there are `.html` reports for each pair of fastq files in the `fastq/` folder.  

Now let's generate a table of read quality statistics generated by fastp.  
```{r, cache=TRUE}

# Short function to extract summary data from fastp
get_report <- function(x){unlist(report_list[[x]]$summary)}

# Make a table of the report data.
# Some re-formatting of the numbers is required. 
report_table <- lapply(1:length(report_list), get_report) %>%
    do.call(cbind, .) %>% 
    format(scientific=FALSE,
           digits = 2,
           drop0trailing = TRUE)

report_table <- data.frame(report_table)
colnames(report_table) <- fq_df$prefix

datatable(data = report_table)
```

Make some plots of the statistics.  
```{r, cache=TRUE}
cp_list <- lapply(X = 1:length(report_list),
                  FUN = function(x){Rfastp::curvePlot(report_list[[x]]) + ggtitle(fq_df$prefix[x])})

cp_list
```

If you have [MultiQC](https://multiqc.info/) installed in your PATH, you can generate an aggreate report using the following system call from R and then inspect the html report titled `rna-seq-multiqc-report.html`
```{r, cache=TRUE}
system("multiqc --filename rna-seq-multiqc-report.html fastq/")
```

---

```{r}
sessionInfo()
```