bookdown-demo.Rmd

---
title: "Differential Expression Workshop NCGR"
site: bookdown::bookdown_site
documentclass: book
bibliography: [book.bib, packages.bib]
# url: your book url like https://bookdown.org/yihui/bookdown
# cover-image: path to the social sharing image like images/cover.jpg
description: |
  This is the material for a Differential Gene Expression workshop offered by NCGR in collaboration
  with NM-INBRE.
link-citations: yes
github-repo: ncgr/DE-Bookdown
---

# url: your book url like https://bookdown.org/yihui/bookdown

Placeholder


## Getting Started{-}
## Prerequisites
## Installs
## Agenda
## Connecting to the linux server

<!--chapter:end:index.Rmd-->


# Linux Basics

Placeholder


## Connecting to the linux server
## What are Linux, bash, and the NCGR Server?
### A little shell... aka the $ prompt is the command line interface{-}
### Directory Structure{-}
### Find the shell in system you’ll use to log into the NCGR’s server
### Log on to NCGR's server 
### Now that I logged on, where am I?
## Linux basics: Part I:
### Understanding Directories
### Listing options
### Navigation
### Files: creating with touch command
### History command
### Files: creating by redirecting standard out
### File name completion with tab
### Files: moving files from one filename to another
#### Syntax: mv sourcefilename destinationfilename{-}
### Files: copying files from one filename to another
#### Syntax: cp sourcefilename destinationfilename{-}
### Files: securely copying files between your laptop and NCGRs server 
#### Syntax: scp [options] sourcepath destinationpath{-}
### Files and directories: removing files is deleting files
#### Syntax: rm [options] filename{-}
### Tool box: How to abort a command/process
## Linux basics: Part II
### Files: Symbolic links and the soft link (-s)
#### Syntax: ln -s FileYouWantToLink/PointTo NameYouWantToGiveIt{-}
### Understanding a fasta file format
#### The pipe operator will redirect output of a command to another command.{-}
### Understanding fastq (fq) file format
##### What is the difference between this and a fasta file?{-}
### Using grep (global regular expression print) to extract metrics
#### Syntax: grep [options] "expression" filename{-}
### Working with compressed files
### Start ^ and end $ symbols
### Files: parsing and creating data-subsets
#### Exercise:{-}
### Files: parsing and creating data-subsets 
### Files: parsing and creating data-subsets 
### Files: parsing and creating data-subsets 
### Revisiting table1 and *previous* awk command 
### Files: **S**tream **ED**itor (sed)
#### Syntax: sed s/pattern/replacement{-}
### The Bash "for" Loop
#### Syntax: for variablename in filenameexpression; do command ${variablename}; done {-}
### Help with command syntax
### Exercises
## Linux basics: Part III
### Using the Screen Command{-}
### Exercise{-}
### More Exercises

<!--chapter:end:02-Linux.Rmd-->

---
output:
  pdf_document: default
  html_document: default
---
# Sequence metrics and quality Control

## The Study Design

**3 replicates** of control mice and **3 replicates** of water avoidance stress induced mice.  
We will run our own analysis on these publically-available data and compare the Differential Gene Expression and Pathway Enrichment results to those published here:   
https://www.mdpi.com/2218-1989/13/3/453

Why have a control?

Why have replicates?


## Data Access

Start a screen 
```{bash, eval=FALSE}
screen -S DGE
```

Make a new directory branch (~/DGE_workshop/**reads**)  

Enter your new directory  


Option 1 - From your home directory:
```{bash, eval=FALSE}
cd DGE_workshop

mkdir reads

cd reads
```


We have a list of 6 Sequencing Run Result (**SRR**) unique identifiers for samples stored in the Sequence Read Archive (SRA) database managed by NCBI. We can use this simple list (in .txt format) to download the sequences that correspond to each of the 6 SRR identifiers in the list. 
Though we only have 6 samples, the sequencing **depth** is very high, making the download time ~ 1hr. 

<span style="color:red">Time will not allow for us to complete this command, so we will cancel the command after checking it works.</span>  


Download fastq sequence files from NCBI's SRA:
```{bash, eval=FALSE}
conda activate sra-tools

for i in `cat SRRlist.txt`; do
  fasterq-dump --outdir . ${line}
done < ~/DGE_workshop/reads/SRRlist.txt

```

Softlink the data (after you cancel the command above):
```{bash, eval=FALSE}
ln -s ~/DGE_workshop/reads/*.gz .
```

## fastp Quality Control  


It will take ~ 15 minutes to run **fastp** on the .fastq sequencing files for all 6 samples using 10 threads.  

Activate the fastp environment:
```{bash, eval=FALSE}
conda deactivate

conda activate fastp
```


```{bash}
for file in SRR*; do
 bn=`basename ${file} .fastq.gz`
 echo $bn
done
```


Check the quality of our 6 sample sequences using **fastp**:
```{bash, eval=FALSE}

for file in SRR*; do
 bn=`basename ${file} .fastq.gz`
 fastp -w 10 -h "${file}.html" -i "${file}" -o "${bn}.fastp.fastq.gz"
done

```

Download and view the .html report.

## Checking Sequence Metrics

awk "NR" variable:  
There are several built-in awk variables. One of them is "NR" (Number of Rows/Records). It contains line number and can be used to determine the total number of lines in a file.  
The following example illustrates one use of this built-in variable.

Prints the line number of the first 10 lines
```{bash, eval=FALSE}
zcat SRR23869771.fastp.fastq.gz | awk '{print NR}' | head

```
awk modulo arithmatic:  
The modulo operator, "%", returns the remainder of division.  
The expression "5 % 2", for example, would evaluate to 1, while "9 % 3" would evaluate to 0 because the dividend (9) is a multiple of the divisor (3).
```{bash, eval=FALSE}

zcat SRR23869771.fastp.fastq.gz | awk '{ if ( NR % 4 == 2) print length ($1)}' | head

zcat SRR23869771.fastp.fastq.gz | awk '{ if ( NR % 4 == 2) print $0}' | wc -l

```


 Option 2 - <span style="color:green">Advanced.</span>
```{bash, eval=FALSE}
mkdir -p DGE_workshop/reads

cd DGE_workshop/reads


<!--chapter:end:03-QC.Rmd-->


# Read Alignment

Placeholder


## Reference acquisition
## HISAT2
## .sam format
## Cigar string
## Alignment metrics
## .sam/.bam conversion and alignment sorting

<!--chapter:end:04-Alignment.Rmd-->


# Abundance Estimation

Placeholder


## Format check
## FeatureCounts
### Generate counts from the alignments directory {-}

<!--chapter:end:05-Abundancy_Estimation.Rmd-->


# Differential Gene Expression Analysis

Placeholder


## R data import
## Format metadata
## Normalization
## DESeq2
## Heatmap
## Gene count plots

<!--chapter:end:06-DGE.Rmd-->


# Additional visualization

Placeholder


## Graphing with ggplot2
## Data and setup
## Scatter plots
## Volcano plot
## Boxplots
## Violin plots
## Interactive Plots
## Homework (optional)

<!--chapter:end:07-Additional_Visualization.Rmd-->


# Pathway Enrichment Analysis

Placeholder


## Parsing
## Cytoscape and ClueGO
## Additional options
## CluePedia

<!--chapter:end:08-PE.Rmd-->


# Coexpression Networks

Placeholder


## Preparing the environment
## Read in data
## Check Samples and Genes
## Sample outliers
## Get the phenotype data
## Filter and normalize genes        
## Soft-threshold
## Create Network
## Get the module eigengenes
## Compare modules to phenotypes
## Driver genes

<!--chapter:end:09-CoexpressionNetworks.Rmd-->


# Sequencing Technologies and Lab Protocols

Placeholder


## Sequencing technologies
#### Short Reads{-}
#### Long Reads{-}
### Sanger Sequencing
## Illumina Sequencing
## PacBio Sequencing
## Oxford Nanopore Sequencing

<!--chapter:end:10-SeqTechLabProtocols.Rmd-->


# DESeq2 Model Design

Placeholder


## One Factor
## Two Factors
## A little bit on statistical tests
## Time series
## More resources

<!--chapter:end:11-ModelsStatistics.Rmd-->


# Wrap Up

Placeholder


## Acknowledgement
## Survey
## Questions
## Server access and acknowledgements
## Bookdown document
## Zoom recordings

<!--chapter:end:12-WrapUp.Rmd-->