CHIPseq Manual

Version 1.0.6

May 31, 2019

BICF ChIP-seq Pipeline

Introduction

BICF ChIPseq is a bioinformatics best-practice analysis pipeline used for ChIP-seq (chromatin immunoprecipitation sequencing) data analysis at BICF at UT Southwestern Department of Bioinformatics.

The pipeline uses Nextflow, a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.

This pipeline is primarily used with a SLURM cluster on the BioHPC Cluster. However, the pipeline should be able to run on any system that supports Nextflow.

Additionally, the pipeline is designed to work with Astrocyte Workflow System using a simple web interface.

Current version of the software and issue reports are at https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis

To download the current version of the software

$ git clone [email protected]:BICF/Astrocyte/chipseq_analysis.git

Input files

1) Fastq Files

You will need the full path to the files for the Bash Scipt

2) Design File

The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
1. sample_id a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
2. experiment_id biosample_treatment_factor; same name given for all replicates of treatment. Will be used for the consensus header.
3. biosample symbol for tissue type or cell line
4. factor symbol for antibody target
5. treatment symbol of treatment applied
6. replicate a number, usually from 1-3 (i.e. 1)
7. control_id sample_id name that is the control for this sample
8. fastq_read1 name of fastq file 1 for SE or PC data
9. fastq_read2 name of fastq file 2 for PE data
See HERE for an example design file, paired-end
See HERE for an example design file, single-end

3) Bash Script

You will need to create a bash script to run the CHIPseq pipeline on BioHPC
This pipeline has been optimized for the correct partition
See HERE for an example bash script
The parameters that must be specified are:
- --reads '/path/to/files/name.fastq.gz'
- --designFile '/path/to/file/design.txt',
- --genome 'GRCm38', 'GRCh38', or 'GRCh37' (if you need to use another genome contact the BICF)
- --pairedEnd 'true' or 'false' (where 'true' is PE and 'false' is SE; default 'false')
- --outDir (optional) path and folder name of the output data, example: /home2/s000000/Desktop/Chipseq_output (if not specficied will be under workflow/output/)

Pipeline

There are 11 steps to the pipeline
1. Check input files
2. Trim adaptors TrimGalore!
3. Aligned trimmed reads with bwa, and sorts/converts to bam with samtools
4. Mark duplicates with Sambamba, and filter reads with samtools
5. Quality metrics with deep tools
6. Calculate cross-correlation using PhantomPeakQualTools
7. Call peaks with MACS
8. Calculate consensus peaks
9. Annotate all peaks using ChipSeeker
10. Calculate Differential Binding Activity with DiffBind (If more than 1 rep in more than 1 experiment)
11. Use MEME-ChIP to find motifs in original peaks

See FLOWCHART

Output Files

Folder	File	Description
design	N/A	Inputs used for analysis; can ignore
trimReads	*_trimming_report.txt	report detailing how many reads were trimmed
trimReads	*_trimmed.fq.gz	trimmed fastq files used for analysis
alignReads	*.srt.bam.flagstat.qc	QC metrics from the mapping process
alignReads	*.srt.bam	sorted bam file
filterReads	*.dup.qc	QC metrics of find duplicate reads (sambamba)
filterReads	*.filt.nodup.bam	filtered bam file with duplicate reads removed
filterReads	*.filt.nodup.bam.bai	indexed filtered bam file
filterReads	*.filt.nodup.flagstat.qc	QC metrics of filtered bam file (mapping stats, samtools)
filterReads	*.filt.nodup.pbc.qc	QC metrics of library complexity
convertReads	*.filt.nodup.bedse.gz	bed alignment in BEDPE format
convertReads	*.filt.nodup.tagAlign.gz	bed alignent in BEDPE format, same as bedse unless samples are paired-end
multiqcReport	multiqc_report.html	Quality control report of NRF, PBC1, PBC2, NSC, and RSC. Also contains software versions and references to cite.
experimentQC	coverage.pdf	plot to assess the sequencing depth of a given sample
experimentQC	*_fingerprint.pdf	plot to determine if the antibody-treatment enriched sufficiently
experimentQC	heatmeap_SpearmanCorr.pdf	plot of Spearman correlation between samples
experimentQC	heatmeap_PearsonCorr.pdf	plot of Pearson correlation between samples
experimentQC	sample_mbs.npz	array of multiple BAM summaries
crossReads	*.cc.plot.pdf	Plot of cross-correlation to assess signal-to-noise ratios
crossReads	*.cc.qc	cross-correlation metrics. File HEADER
callPeaksMACS	pooled/*pooled.fc_signal.bw	bigwig data file; raw fold enrichment of sample/control
callPeaksMACS	pooled/*pooled_peaks.xls	Excel file of peaks
callPeaksMACS	pooled/*.pvalue_signal.bw	bigwig data file; sample/control signal adjusted for pvalue significance
callPeaksMACS	pooled/*_pooled.narrowPeak	peaks file; see HERE for ENCODE narrowPeak header format
consensusPeaks	*.rejected.narrowPeak	peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks	*.replicated.narrowPeak	peaks supported by multiple testing (replicates and pseudo-replicates)
peakAnnotation	*.chipseeker_annotation.tsv	annotated narrowPeaks file
peakAnnotation	*.chipseeker_pie.pdf	pie graph of where narrow annotated peaks occur
peakAnnotation	*.chipseeker_upsetplot.pdf	upsetplot showing the count of overlaps of the genes with different annotated location
motifSearch	*_memechip/index.html	interactive HTML link of MEME output
motifSearch	sorted-*.replicated.narrowPeak	Top 600 peaks sorted by p-value; input for motifSearch
motifSearch	*_memechip/combined.meme	MEME identified motifs
diffPeaks	heatmap.pdf	Use only for replicated samples; heatmap of relationship of peak location and peak intensity
diffPeaks	normcount_peaksets.txt	Use only for replicated samples; peak set values of each sample
diffPeaks	pca.pdf	Use only for replicated samples; PCA of peak location and peak intensity
diffPeaks	*_diffbind.bed	Use only for replicated samples; bed file of peak locations between replicates
diffPeaks	*_diffbind.csv	Use only for replicated samples; CSV file of peaks between replicates
plotProfile	plotProfile.png	Plot profile of the TSS region
plotProfile	computeMatrix.gz	Compute Matrix from deeptools to create custom plots other than plotProfile

Common Quality Control Metrics

These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
1. multiqcReport/multiqc_report.html: follow the ChiP-seq standards HERE;
2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See HERE for more details.
3. crossReads/*cc.plot.pdf: make sure your sample data has the correct signal intensity and location. See HERE for more details.
4. crossReads/*.cc.qc: Column 9 (NSC) should be > 1.1 for experiment and < 1.1 for input. Column 10 (RSC) should be > 0.8 for experiment and < 0.8 for input. See HERE for more details.
5. experimentQC/coverage.pdf, experimentQC/heatmeap_SpearmanCorr.pdf, experimentQC/heatmeap_PearsonCorr.pdf: See HERE for more details.

Common Errors

If you find an error, please let the BICF know and we will add it here.

Citation

Please cite individual programs and versions used HERE, and the pipeline doi:10.5281/zenodo.2648844. Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).

Programs and Versions

python/3.6.1-2-anaconda website citation
trimgalore/0.4.1 website citation
cutadapt/1.9.1 website citation
bwa/intel/0.7.12 website citation
samtools/1.6 website citation
sambamba/0.6.6 website citation
bedtools/2.26.0 website citation
deeptools/2.5.0.1 website citation
phantompeakqualtools/1.2 website citation
macs/2.1.0-20151222 website citation
UCSC_userApps/v317 website citation
R/3.4.1 website citation
SPP/1.14
meme/4.11.1-gcc-openmpi website citation
ChIPseeker website citation
DiffBind website citation

Credits

This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility (BICF), in the Department of Bioinformatics.

Name		Name	Last commit message	Last commit date
Latest commit History 693 Commits
.gitlab/merge_request_templates		.gitlab/merge_request_templates
docs		docs
test_data		test_data
vizapp		vizapp
workflow		workflow
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
astrocyte_pkg.yml		astrocyte_pkg.yml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CHIPseq Manual

Version 1.0.6

May 31, 2019

BICF ChIP-seq Pipeline

Introduction

Input files

1) Fastq Files

2) Design File

3) Bash Script

Pipeline

Output Files

Common Quality Control Metrics

Common Errors

Citation

Programs and Versions

Credits

About

Releases

Packages

Contributors 4

Languages

License

JAMKuttan/ChIPseq_Analysis

Folders and files

Latest commit

History

Repository files navigation

CHIPseq Manual

Version 1.0.6

May 31, 2019

BICF ChIP-seq Pipeline

Introduction

Input files

1) Fastq Files

2) Design File

3) Bash Script

Pipeline

Output Files

Common Quality Control Metrics

Common Errors

Citation

Programs and Versions

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages