-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-QC.Rmd
136 lines (84 loc) · 3.42 KB
/
03-QC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# Sequence metrics and quality Control
## The Study Design
**3 replicates** of control mice and **3 replicates** of water avoidance stress induced mice.
We will run our own analysis on these publically-available data and compare the Differential Gene Expression and Pathway Enrichment results to those published here:
https://www.mdpi.com/2218-1989/13/3/453
Why have a control?
Why have replicates?
## Data Access
Start a new screen
```{bash, eval=FALSE}
screen -S DGE
```
Make a new directory branch (~/DGE_workshop/**reads**)
Enter your new directory
From your home directory:
```{bash, eval=FALSE}
cd DGE_workshop
mkdir reads
cd reads
```
We have a list of 6 Sequencing Run Result (**SRR**) unique identifiers for sample sequences stored in the Sequence Read Archive (SRA) database managed by NCBI. We can use this simple list (in .txt format) to download the sequences that correspond to each of the 6 SRR identifiers in the list.
Though we only have 6 samples, the sequencing **depth** is very high, making the download time ~ 1hr.
<span style="color:red">Time will not allow for us to complete this command, so we will cancel the command after checking it works.</span>.
Download fastq sequence files from NCBI's SRA:
```{bash, eval=FALSE}
conda activate sra-tools
for i in `cat SRRlist.txt`; do
fasterq-dump --outdir . ${line}
done < ~/DGE_workshop/reads/SRRlist.txt
```
Softlink the data (after you cancel the command above):
```{bash, eval=FALSE}
ln -s ~/DGE_workshop/reads/*.gz .
```
## fastp Quality Control
It will take ~ 15 minutes to run **fastp** on the .fastq sequencing files for all 6 samples using 10 threads.
Activate the fastp environment:
```{bash, eval=FALSE}
conda deactivate
conda activate fastp
```
```{bash, eval=FALSE}
for file in SRR*; do
bn=`basename ${file} .fastq.gz`
echo $bn
done
```
Check the quality of our 6 sample sequences using **fastp**:
```{bash, eval=FALSE}
for file in SRR*; do
bn=`basename ${file} .fastq.gz`
fastp -w 10 -h "${file}.html" -i "${file}" -o "${bn}.fastp.fastq.gz"
done
```
Download the .html report to your computer and open it in an internet browser (right-click on the file and select browser)
It is essential that you know this basic command, and **you should not need this prompt at this point**:
From <span style="color:red">**YOUR LOCAL TERMINAL**</span>
```{bash, eval=FALSE}
scp -P 2403 [email protected]:~/DGE_workshop/reads/SRR23869771.fastq.html
```
## Checking Sequence Metrics
awk "NR" variable:
There are several built-in awk variables.
One of them is "NR" (Number of Rows/Records).
It contains line number and can be used to determine the total number of lines in a file.
The following example illustrates one use of this built-in variable.
Prints the line number of the first 10 lines
```{bash, eval=FALSE}
zcat SRR23869771.fastp.fastq.gz | awk '{print NR}' | head
```
awk modulo arithmatic:
The modulo operator, "%", returns the remainder of division.
The expression "5 % 2", for example, would evaluate to 1, while "9 % 3" would evaluate to 0 because the dividend (9) is a multiple of the divisor (3).
```{bash, eval=FALSE}
zcat SRR23869771.fastp.fastq.gz | awk '{ if ( NR % 4 == 2) print length ($1)}' | head
```
```{bash, eval=FALSE}
zcat SRR23869771.fastp.fastq.gz | awk '{ if ( NR % 4 == 2) print $0}' | wc -l
```
New Section with some advanced commands
```{bash, eval=FALSE}
mkdir -p DGE_workshop/reads
cd DGE_workshop/reads
```