03-QC.Rmd


# Sequence metrics and quality Control

## The Study Design

**3 replicates** of control mice and **3 replicates** of water avoidance stress induced mice.  
We will run our own analysis on these publically-available data and compare the Differential Gene Expression and Pathway Enrichment results to those published here:   
https://www.mdpi.com/2218-1989/13/3/453

Why have a control?

Why have replicates?


## Data Access

Start a new screen 
```{bash, eval=FALSE}
screen -S DGE
```

Make a new directory branch (~/DGE_workshop/**reads**)  

Enter your new directory  


From your home directory:
```{bash, eval=FALSE}
cd DGE_workshop

mkdir reads

cd reads
```


We have a list of 6 Sequencing Run Result (**SRR**) unique identifiers for sample sequences stored in the Sequence Read Archive (SRA) database managed by NCBI. We can use this simple list (in .txt format) to download the sequences that correspond to each of the 6 SRR identifiers in the list.  

Though we only have 6 samples, the sequencing **depth** is very high, making the download time ~ 1hr. 

<span style="color:red">Time will not allow for us to complete this command, so we will cancel the command after checking it works.</span>. 


Download fastq sequence files from NCBI's SRA:
```{bash, eval=FALSE}
conda activate sra-tools

for i in `cat SRRlist.txt`; do
  fasterq-dump --outdir . ${line}
done < ~/DGE_workshop/reads/SRRlist.txt
```

Softlink the data (after you cancel the command above):
```{bash, eval=FALSE}
ln -s ~/DGE_workshop/reads/*.gz .
```

## fastp Quality Control  


It will take ~ 15 minutes to run **fastp** on the .fastq sequencing files for all 6 samples using 10 threads.  

Activate the fastp environment:
```{bash, eval=FALSE}
conda deactivate

conda activate fastp
```


```{bash, eval=FALSE}
for file in SRR*; do
 bn=`basename ${file} .fastq.gz`
 echo $bn
done
```


Check the quality of our 6 sample sequences using **fastp**:
```{bash, eval=FALSE}
for file in SRR*; do
 bn=`basename ${file} .fastq.gz`
 fastp -w 10 -h "${file}.html" -i "${file}" -o "${bn}.fastp.fastq.gz"
done
```


Download the .html report to your computer and open it in an internet browser (right-click on the file and select browser)

It is essential that you know this basic command, and **you should not need this prompt at this point**:

From <span style="color:red">**YOUR LOCAL TERMINAL**</span>
```{bash, eval=FALSE}
scp -P 2403 agomez@inbre.ncgr.org:~/DGE_workshop/reads/SRR23869771.fastq.html
```


## Checking Sequence Metrics

awk "NR" variable:  
There are several built-in awk variables.  
One of them is "NR" (Number of Rows/Records).  
It contains line number and can be used to determine the total number of lines in a file.  


The following example illustrates one use of this built-in variable.

Prints the line number of the first 10 lines
```{bash, eval=FALSE}
zcat SRR23869771.fastp.fastq.gz | awk '{print NR}' | head
```

awk modulo arithmatic:  
The modulo operator, "%", returns the remainder of division.  
The expression "5 % 2", for example, would evaluate to 1, while "9 % 3" would evaluate to 0 because the dividend (9) is a multiple of the divisor (3).

```{bash, eval=FALSE}
zcat SRR23869771.fastp.fastq.gz | awk '{ if ( NR % 4 == 2) print length ($1)}' | head
```


```{bash, eval=FALSE}
zcat SRR23869771.fastp.fastq.gz | awk '{ if ( NR % 4 == 2) print $0}' | wc -l
```


New Section with some advanced commands

```{bash, eval=FALSE}
mkdir -p DGE_workshop/reads

cd DGE_workshop/reads
```