Cut&Tag Data Processing Guide

Intro:

Cleavage Under Targets & Tagmentation (CUT&Tag)is an ChIP-like profiling strategy in which primary antibodies are bound to chromatin proteins in situ after nuclei have been permeabilized. Secondary antibodies covalently bound to the cut-and-paste transposase Tn5 are then bound to the primary antibody. Activation of the transposase simultaneously digests DNA and adds high throughput sequencing adapters (referred to as ‘tagmentation’) for paired-end DNA sequencing. Cut&Tag libraries have a much higher signal-to-noise ratio than traditional ChIP libraries and need far fewer cells as well.

Below is documentation for set-up and a single script for parsing Cut&Tag data based off of the data analysis protocol presented here from the Cut&Tag protocol and paper. Additionally, there is an nf-core pipeline for Cut&Tag. Because I want more control over the processing of the data, I will write my own wrapper based on the Henikoff Lab's data analysis protocol.

Here are related papers and resources for Cut&Tag:

A 2023 review on Cut&Tag
Integrative Analysis of CUT&Tag and RNA-Seq Data Through Bioinformatics: A Unified Workflow for Enhanced Insights
CUT&RUNTools 2.0: a pipeline for single-cell and bulk-level CUT&RUN and CUT&Tag data analysis
Spatial-CUT&Tag: Spatially resolved chromatin modification profiling at the cellular level
Active Motif's Spike in Strategy
GoPeaks: histone modification peak calling for CUT&Tag
CUT&Tag recovers up to half of ENCODE ChIP-seq peaks in modifications of H3K27
ChIPseqSpikeInFree: a package for Spike-in Free analysis
Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling
Another Henikoff tutorial on Cut&Tag Analysis
Really good resource on Bash operators

I submitted three groups of Cut&Tag biological replicates on 6.03.2024, 07.01.2024, and 7.19.2024. The later two submissions have technical replicates in them that should be combined. The sample identifiers for each are below:

6.03.2024 - Submission 1:

Cells	AB	Index	NuSeq ID
WT	Pol1AB	i71-i51	BM101
Pol1 degraded	Pol1AB	i71-i52	BM102
Pol1 degraded	K4Me3AB	i71-i53	BM104
WT	K4Me3AB	i71-i54	BM103
WT	Pol2AB	i72-i51	BM105
Pol 1 degraded	Pol2AB	i72-i52	BM106
WT	K9Me3AB	i72-i53	BM107
Pol 1 degraded	K9Me3AB	i72-i54	BM108

07.01.2024 - Submission 2:

Cells Rep1	AB	Index	NuSeq ID
WT	Pol1AB	i71-i51	BM109
Pol 1 degraded	Pol1AB	i71-i52	BM110
Pol 2 degraded	Pol1AB	i71-i53	BM111
WT	Pol2AB	i71-i54	BM112
Pol 1 degraded	Pol2AB	i72-i51	BM113
Pol 2 degraded	Pol2AB	i72-i52	BM114
Pol 1 degraded	K27AcAB	i72-i53	BM115
Pol 2 degraded	K27AcAB	i72-i54	BM116

Cells Rep2	AB	Index	NuSeq ID
WT	Pol1AB	i73-i51	BM117
Pol 1 degraded	Pol1AB	i73-i52	BM118
Pol 2 degraded	Pol1AB	i73-i53	BM119
WT	Pol2AB	i73-i54	BM120
Pol 1 degraded	Pol2AB	i74-i51	BM121
Pol 2 degraded	Pol2AB	i74-i52	BM122
Pol 1 degraded	K27AcAB	i74-i53	BM123
Pol 2 degraded	K27AcAB	i74-i54	BM124

07.19.2024 - Submission 3:

Cells Rep1	AB	Index	NuSeq ID
WT	Pol1AB	i71-i51	BM125
Pol 1 degraded	Pol1AB	i71-i52	BM126
Pol 2 degraded	Pol1AB	i71-i53	BM127
WT	CTCFAB	i71-i54	BM128
WT	CTCFAB	i72-i51	BM129
WT	CTCFAB	i72-i52	BM130
Pol 1 degraded	CTCFAB	i72-i53	BM131
Pol 2 degraded	CTCFAB	i72-i54	BM132

Cells Rep2	AB	Index	NuSeq ID
WT	K9AB	i73-i51	BM133
Pol 1 degraded	K9AB	i73-i52	BM134
Pol 2 degraded	K9AB	i73-i53	BM135
WT	K9AB	i73-i54	BM136
Pol 1 degraded	K9AB	i74-i51	BM137
Pol 2 degraded	K9AB	i74-i52	BM138
Pol 1 degraded	CTCFAB	i74-i53	BM139
Pol 2 degraded	CTCFAB	i74-i54	BM140

Cut&Tag uses Bowtie2 for alignment.

Build Bowtie indices for reference genome prior to alignment with Tophat using the following command bowtie2-build path_to_reference_genome.fa prefix_to_name_indexes using default bowtie2 on quest, bowtie2/2.4.5

For mapping to rDNA repeats + human genome, use the following reference. For general mapping, use Ensembl

VERSION=108
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz

OR UCSC: UCSC Genome and UCSC GTF (preferred). There are 4 different GTF references to use.

wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz
wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz

example SLURM header for Quest HPC

#!/bin/bash
#SBATCH -A p32171 ##--## SLURM start
#SBATCH -p genhimem
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH --mem-per-cpu=10G
#SBATCH --mail-user=lucascarter2025@u.northwestern.edu
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 24:00:00
#SBATCH --job-name=build_genome
#SBATCH --output=outlog
#SBATCH --error=errlog

##-- header here --##

cd /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref

module load bowtie2/2.4.5

bowtie2-build /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref/hg38.fa hg38

GTF file can be found at /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf and genome index can be found at /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref

Spike in normalization

I am using ChIPseqSpikeInFree to normalize the data in lieu of a spike-in. I installed the command line version at: /home/lmc0633/executables/ChIPseqSpikeInFree using the command version of the installation detailed at the github and R/4.3.0

Need a metadata file with the following columns:

ANTIBODY GROUPANTIBODY GROUP

ID	ANTIBODY	GROUP
WT.bam	Pol1AB	control
Pol1.bam	Pol1AB	pol1ko
Pol2.bam	Pol1AB	pol2ko

Code is written in such a way that I can run all BAMs for a particular group at once. I deposited metadata at /projects/b1042/BackmanLab/Lucas/090124_CnT/metadata as individual .txt files for each submission. each submission is labelled S1, S2, or S3, with corresponding fastqs and alignment files in each respective directory. This can be run on SLURM or it can be run on an interaction session. Script should look something like below:

Shell submission script (if submitting via SLURM):

module load R/4.3.0

Rscript --vanilla --verbose spike.R

R script for Spike In normalization factor. Must have more than 1 core enabled for this to work:

## HPC Script for normalization factor generation
library("ChIPseqSpikeInFree")

## Must have more than 1 core enabled for this to work
library(BiocParallel)
numCores <- parallel::detectCores()
register(MulticoreParam(workers = numCores-1), default = TRUE)

n=2 ## 1, 2, or 3
## Get metadata path (Must be tab delimited)
dir.meta <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/metadata/"
dir.s <- c("06032024.txt", "07012024.txt", "07192024.txt")
meta <- paste0(dir.meta,dir.s[n])

## Get bam paths
dir.bam <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/"
subdirs <- c("S1", "S2", "S3")
bam.path <- paste0(dir.bam, subdirs[n],"/BAM/")

## Get bam list
tbl <- read.table(meta, sep= "\t", header= T)
bams <- tbl$ID

## Write out txt files
#out <- c("06032024.txt", "07012024.txt", "07192024.txt")
#write.table(tbl, file = paste0(dir.meta, out[n]), quote=F, sep = "\t",row.names = F, col.names = TRUE)

library(stringr)
input <- str_trim(paste0(bam.path,bams), side = c("both", "left", "right"))

## Call function
ChIPseqSpikeInFree(bamFiles = input, chromFile = "hg38", metaFile = meta, prefix = paste0ls("/projects/b1042/BackmanLab/Lucas/090124_CnT/spikein/",subdirs[n]), ncores=4)

We need the size of each chromosome as a text file. this can be retrieved at UCSC:

wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.chrom.sizes

Once we have the scaling factors from , we want to integrate them into generation of our coverage files like so:

libSize=`cat sample1.bed|wc -l`
scale=15000000/($libSize*$SF)
genomeCoverageBed -bg -scale $scale -i sample1.bed  -g hg38.chromSizes > sample1.bedGraph
bedGraphToBigWig sample1.bedGraph hg38.chromSizes sample1.bw

OR

scale_factor=`echo "10000 / $seqDepth" | bc -l`
    echo "Scaling factor for $histName is: $scale_factor!"
    bedtools genomecov -bg -scale $scale_factor -i $projPath/alignment/bed/${histName}_bowtie2.fragments.bed -g $chromSize > $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph

Peak calling

There are a few options for calling peaks on Cut&Tag data. The more leniant and traditional option is MACS2. Using MACS2 will increase the false positive rate but will capture more peaks overall. SEACR was designed with Cut&Run/TAG data (high SN) in mind. More recent and less used, GoPeaks is also specifically designed for Cut&Run/TAG data. There are several other peak callers, but for my purposes, I will use MACS2 and SEACR. Here is a short explanation on the difference between SEACR and MACS2. This HBC workshop covers calling peaks on Cut&Run/TAG data.

For SEACR, we need to retrieve two scripts and put them in a directory for use later:

mkdir SEACR ## in bin or scripts
wget https://github.com/FredHutch/SEACR/raw/master/SEACR_1.3.sh
wget https://github.com/FredHutch/SEACR/raw/master/SEACR_1.3.R

SEACR usage and example:

## usage
bash SEACR_1.3.sh experimental bedgraph [control bedgraph | numeric threshold] ["norm" | "non"] ["relaxed" | "stringent"] output prefix 
## example
bash SEACR_1.3.sh target.bedgraph 0.01 non stringent output

For the script, we call the following:

seacr=/home/lmc0633/executables/SEACR/SEACR_1.3.sh
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}_seacr_top0.01.peaks

macs2 callpeak -t $projPath/alignment/bam/${org}/${time}/${hist}/bowtie2.mapped.bam \
      -c $projPath/alignment/bam/${org}/${timeControl}/${hist}/bowtie2.mapped.bam \
      -g ${genome} -f BAMPE -n macs2_peak_q0.1 --outdir $projPath/peakCalling/MACS2/${org}/${timeControl}_control/${time}/${hist}/
       -q 0.1 --keep-dup all 2>${projPath}/peakCalling/MACS2/${org}/${timeControl}_control/${time}/${hist}/macs2Peak_summary.txt

mkdir -p $projPath/peakCalling
macs2 callpeak -t ${projPath}/alignment/bam/${histName}_rep1_bowtie2.mapped.bam \
      -c ${projPath}/alignment/bam/${controlName}_rep1_bowtie2.mapped.bam \
      -g hs -f BAMPE -n macs2_peak_q0.1 --outdir $projPath/peakCalling/MACS2 -q 0.1 --keep-dup all 2>${projPath}/peakCalling/MACS2/macs2Peak_summary.txt

double array in linux

#!/bin/bash

array=( "Vietnam" "Germany" "Argentina" )
array2=( "Asia" "Europe" "America" )

for i in "${!array[@]}"; do
    printf "%s is in %s\n" "${array[i]}" "${array2[i]}"
done

Script to call the EU-seq alignment script (Tophat) is below

#!/bin/bash
#SBATCH -A p32171 ##--## SLURM alt #SBATCH -A b1042
#SBATCH -p genhimem

#!/bin/bash
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --mem-per-cpu=13G
#SBATCH --mail-user=lucascarter2025@u.northwestern.edu
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 48:00:00
#SBATCH --job-name=07.19.2024
#SBATCH --output=outlog
#SBATCH --error=errlog
##-- header here --##

echo -e "Usage: sh $0 -f <Forward Read> -w <Path to Working Directory> -g <Path to Bowtie Indices> -r <Path to Annotation File> -c <chromosome size file name > -t <threads> -m <min MAPQ score> -s <Spike in score (optional)> -p < path to SEACR (optional)> -b < sets MAC2 to --broad (optional)> -d <remove duplicates with Picard (optional)> \n"

## with MAC2
echo "Processing data"
for file in *R1_001.fastq.gz;
do echo "${file}";
   bash CutTag.sh -f ${file} -w /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/Backman26_7.19.2024/ -g /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref/hg38 -r /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf/hg38.ncbiRefSeq.gtf -c hg38.chrom.sizes -t 16 -m 2 -d ;
done

## with SEACR
echo "Processing data"
for file in *R1_001.fastq.gz;
do echo "${file}";
   bash CutTag.sh -f ${file} -w /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/test/ -g /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref/hg38 -r /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf/hg38.ncbiRefSeq.gtf -p /home/lmc0633/executables/SEACR/SEACR_1.3.sh -c hg38.chrom.sizes -t 16 -m 2;
done

For the correlation plot generation script, i adapted this code for use as follows:

##== R command ==##

library(corrplot)

list.files('/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S1/BED/', pattern = ".bin.bed")

n=1 # c(1,2,3)
path <- '/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/'
subdir <- c("S1", "S2", "S3")
dir <- paste0(path, subdir[n], "/BED/")
hists <- list.files(dir, pattern = ".bin.bed")

reprod = c()
fragCount = NULL
for(hist in hists){
  
  if(is.null(fragCount)){
    
    fragCount = read.table(paste0(dir, hist), header = FALSE) 
    colnames(fragCount) = c("chrom", "bin", hist)
    
  }else{
    
    fragCountTmp = read.table(paste0(dir, hist), header = FALSE)
    colnames(fragCountTmp) = c("chrom", "bin", hist)
    fragCount = full_join(fragCount, fragCountTmp, by = c("chrom", "bin"))
    
  }
}

M = cor(fragCount %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs") 

corrplot(M, method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 3, rect.col = "black", rect.lwd = 3,cl.pos = "b", tl.col = "indianred4", tl.cex = 0.5, cl.cex = 0.5, addCoef.col = "black", number.digits = 2, number.cex = 0.5, col = colorRampPalette(c("midnightblue","white","darkred"))(100))

Script for calling just SEACR peaks. Can't use NORM setting unless I have an input control

#!/bin/bash
#SBATCH -A p32171 ##--## SLURM start
#SBATCH -p genhimem
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=200G
#SBATCH --mail-user=lucascarter2025@u.northwestern.edu
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 24:00:00
#SBATCH --job-name=build_genome
#SBATCH --output=outlog
#SBATCH --error=errlog
##-- header here --##

seacr=/home/lmc0633/executables/SEACR/SEACR_1.3.sh
bgfold=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/BEDGRAPH
pkfold=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/peaks

module load R/4.3.0
module load bedtools

bash /home/lmc0633/executables/SEACR/SEACR_1.3.sh /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/BEDGRAPH/BM125_WT_S3.bedgraph 0.01 non stringent /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/peaks/test.peaks
cd ${bgfold}

echo "Processing data"
for file in *.bedgraph;
do echo "${file}";


P1=$(echo $file | sed 's/.bedgraph/.seacr.peaks/g')

echo "8. Calling SEACR Peaks on ${file}"
echo -e "\t bash $seacr ${bgfold}/${file} 0.01 norm relaxed ${pkfold}/${P1}"

bash $seacr ${bgfold}/${file} 0.01 non stringent ${pkfold}/${P1}

done

echo -e "Peak calling complete \n"

Later modification

I looked into merging replicates using Coverage() in R and unfortunately, you lose all the metadate (qValue, pValue, signal) when doing so. I decided to merge my technical replicates at the FASTQ level and then call the peaks again on those. Following calling peaks, I plan to use Bedtools intersect to find the conensus peaks since it preserves the metadata and other information. Merging all technical replicates in submission 2 and in submission 3

out=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/merged/fastq/ 
in=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/Backman26_7.19.2024/

read1=BM135_Pol_2_degraded_S13_R2_001.fastq.gz  ; read2=BM138_Pol_2_degraded_S16_R2_001.fastq.gz ;  file=Merged.K9AB.P2KO.S2_R2_001.fastq.gz

#Merge input BAMS
cat ${in}${read1} ${in}${read2} > ${out}${file}

#!/bin/bash
#SBATCH -A b1042 ##--## SLURM start
#SBATCH -p genomics
#SBATCH -N 1
#SBATCH --ntasks-per-node=10
#SBATCH --mem-per-cpu=10G
#SBATCH --mail-user=lucascarter2025@u.northwestern.edu
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 6:00:00
#SBATCH --job-name=compute matrix
#SBATCH --output=outlog
#SBATCH --error=errlog

module load deeptools

##== linux command ==##

inpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/peaks/
outpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/peaks/test/
bgpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/coverage/

bgfile=BM109_S1.rpgc.bigWig
pkfile=BM109_S1.seacr.peaks.relaxed.bed
otfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/.seacr.peaks.summitRegion.bed/g')
matfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/_SEACR.mat.gz/g')

label=P1AB_WT
rep=S2
cores=10

awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[1]"\t"region[2]}' $inpath$pkfile > $outpath$otfile

computeMatrix reference-point -S $bgpath$bgfile \
              -R $outpath$otfile \
              --skipZeros -o $outpath$matfile -p $cores -a 3000 -b 3000 --referencePoint center

plotHeatmap -m $outpath$matfile -out ${outpath}${matfile}_SEACR_heatmap.png --sortUsing sum --startLabel "Peak Start" -\
-endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${label} ${rep}"


inpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/peaks/
outpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/peaks/test/
bgpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/coverage/

bgfile=BM110_S2.rpgc.bigWig
pkfile=BM110_S2.seacr.peaks.relaxed.bed
otfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/.seacr.peaks.summitRegion.bed/g')
matfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/_SEACR.mat.gz/g')

label=P1AB_P1KO
rep=S2

awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[1]"\t"region[2]}' $inpath$pkfile > $outpath$otfile

computeMatrix reference-point -S $bgpath$bgfile \
              -R $outpath$otfile \
              --skipZeros -o $outpath$matfile -p $cores -a 3000 -b 3000 --referencePoint center

plotHeatmap -m $outpath$matfile -out ${outpath}${matfile}_SEACR_heatmap.png --sortUsing sum --startLabel "Peak Start" -\
-endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${label} ${rep}"

downsampling reads

I'm downsampling my reads because some are over-sequenced and some are undersequenced. I worry this will contribute to variability between conditions, since some samples did experience overamplification. For this task, I am using seqtk

usage: seqtk sample -s <path/in/fastq.gz> <#ofsamples> > <path/out/fastq.gz>

First thing to do is count the total number of reads per file

#!/bin/bash
#SBATCH -A b1042 ##--## SLURM start
#SBATCH -p genomics
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=50G
#SBATCH --mail-user=lucascarter2025@u.northwestern.edu
#SBATCH -t 2:00:00
#SBATCH --job-name=count_reads

for file in *R1_001.fastq.gz;
do echo "${file}";
READS=$(expr $(zcat ${file} | wc -l) / 4)
echo -e "\t file: $file has $READS reads  "
echo -e "\t file: $file reads: $READS " >> reads.txt
done

S1:

File	Reads	Name
file: K4AB-Pol1_S4_R1_001.fastq.gz	reads: 50,054,872	K4AB_P1KO
file: K4AB-WT_S3_R1_001.fastq.gz	reads: 68,741,393	K4AB_WT
file: K9AB-Pol1_S8_R1_001.fastq.gz	reads: 5,858,924	K9AB_P1KO
file: K9AB-WT_S7_R1_001.fastq.gz	reads: 2,249,616	K9AB_WT
file: P1AB-Pol1_S2_R1_001.fastq.gz	reads: 21,542,758	P1AB_P1KO
file: P1AB-WT_S1_R1_001.fastq.gz	reads: 4,246,499	P1AB_WT
file: P2AB-Pol1_S6_R1_001.fastq.gz	reads: 56,878,563	P2AB_P1KO
file: P2AB-WT_S5_R1_001.fastq.gz	reads: 21,233,193	P2AB_WT

S2:

File	Reads	Name
file: BM109_S1_R1_001.fastq.gz	reads: 2,831,267	P1AB_WT
file: BM110_S2_R1_001.fastq.gz	reads: 1,997,534	P1AB_P1KO
file: BM111_S3_R1_001.fastq.gz	reads: 2,761,436	P1AB_P2KO
file: BM112_S4_R1_001.fastq.gz	reads: 2,904,866	P2AB_WT
file: BM113_S5_R1_001.fastq.gz	reads: 8,911,631	P2AB_P1KO
file: BM114_S6_R1_001.fastq.gz	reads: 3,923,244	P2AB_P2KO
file: BM115_S7_R1_001.fastq.gz	reads: 11,335,882	K27AB_P1KO
file: BM116_S8_R1_001.fastq.gz	reads: 4,809,199	K27AB_P2KO
file: BM117_S9_R1_001.fastq.gz	reads: 408,680	P1AB_WT
file: BM118_S10_R1_001.fastq.gz	reads: 3,241,433	P1AB_P1KO
file: BM119_S11_R1_001.fastq.gz	reads: 3,001,093	P1AB_P2KO
file: BM120_S12_R1_001.fastq.gz	reads: 1,265,939	P2AB_WT
file: BM121_S13_R1_001.fastq.gz	reads: 4,573,860	P2AB_P1KO
file: BM122_S14_R1_001.fastq.gz	reads: 6,784,824	P2AB_P2KO
file: BM123_S15_R1_001.fastq.gz	reads: 2,668,701	K27AB_P1KO
file: BM124_S16_R1_001.fastq.gz	reads: 2,413,773	K27AB_P2KO

Changing names to:

BM109_S1_R1_001.fastq.gz P1AB_WT_S1_R1_001.fastq.gz BM110_S2_R1_001.fastq.gz P1AB_P1KO_S2_R1_001.fastq.gz BM111_S3_R1_001.fastq.gz P1AB_P2KO_S3_R1_001.fastq.gz BM112_S4_R1_001.fastq.gz P2AB_WT_S4_R1_001.fastq.gz BM113_S5_R1_001.fastq.gz P2AB_P1KO_S5_R1_001.fastq.gz BM114_S6_R1_001.fastq.gz P2AB_P2KO_S6_R1_001.fastq.gz BM115_S7_R1_001.fastq.gz K27AB_P1KO_S7_R1_001.fastq.gz BM116_S8_R1_001.fastq.gz K27AB_P2KO_S8_R1_001.fastq.gz BM117_S9_R1_001.fastq.gz P1AB_WT_S9_R1_001.fastq.gz BM118_S10_R1_001.fastq.gz P1AB_P1KO_S10_R1_001.fastq.gz BM119_S11_R1_001.fastq.gz P1AB_P2KO_S11_R1_001.fastq.gz BM120_S12_R1_001.fastq.gz P2AB_WT_S12_R1_001.fastq.gz BM121_S13_R1_001.fastq.gz P2AB_P1KO_S13_R1_001.fastq.gz BM122_S14_R1_001.fastq.gz P2AB_P2KO_S14_R1_001.fastq.gz BM123_S15_R1_001.fastq.gz K27AB_P1KO_S15_R1_001.fastq.gz BM124_S16_R1_001.fastq.gz K27AB_P2KO_S16_R1_001.fastq.gz

S3:

File	Reads	Name
file: BM125_WT_S3_R1_001.fastq.gz	reads: 45,823,405	P1AB_WT
file: BM126_Pol_1_degraded_S4_R1_001.fastq.gz	reads: 34,609,056	P1AB_P1KO
file: BM127_Pol_2_degraded_S5_R1_001.fastq.gz	reads: 28,356,451	P1AB_P2KO
file: BM128_WT_S6_R1_001.fastq.gz	reads: 34,639,433	CTCFAB_WT
file: BM129_WT_S7_R1_001.fastq.gz	reads: 49,503,419	CTCFAB_WT
file: BM130_WT_S8_R1_001.fastq.gz	reads: 47,559,479	CTCFAB_WT
file: BM131_Pol_1_degraded_S9_R1_001.fastq.gz	reads: 38,030,748	CTCFAB_P1KO
file: BM132_Pol_2_degraded_S10_R1_001.fastq.gz	reads: 28,916,144	CTCFAB_P2KO
file: BM133_WT_S11_R1_001.fastq.gz	reads: 63,835,066	K9AB_WT
file: BM134_Pol_1_degraded_S12_R1_001.fastq.gz	reads: 54,217,202	K9AB_P1KO
file: BM135_Pol_2_degraded_S13_R1_001.fastq.gz	reads: 53,024,658	K9AB_P2KO
file: BM136_WT_S14_R1_001.fastq.gz	reads: 49,808,186	K9AB_WT
file: BM137_Pol_1_degraded_S15_R1_001.fastq.gz	reads: 53,189,723	K9AB_P1KO
file: BM138_Pol_2_degraded_S16_R1_001.fastq.gz	reads: 72,659,514	K9AB_P2KO
file: BM139_Pol_1_degraded_S17_R1_001.fastq.gz	reads: 21,882,985	CTCFAB_P1KO
file: BM140_Pol_2_degraded_S18_R1_001.fastq.gz	reads: 23,661,905	CTCFAB_P2KO

Changing names to:

mv BM125_WT_S3_R2_001.fastq.gz P1AB_WT_S3_R2_001.fastq.gz mv BM126_Pol_1_degraded_S4_R2_001.fastq.gz P1AB_P1KO_S4_R2_001.fastq.gz mv BM127_Pol_2_degraded_S5_R2_001.fastq.gz P1AB_P2KO_S5_R2_001.fastq.gz mv BM128_WT_S6_R2_001.fastq.gz CTCFAB_WT_S6_R2_001.fastq.gz mv BM129_WT_S7_R2_001.fastq.gz CTCFAB_WT_S7_R2_001.fastq.gz mv BM130_WT_S8_R2_001.fastq.gz CTCFAB_WT_S8_R2_001.fastq.gz mv BM131_Pol_1_degraded_S9_R2_001.fastq.gz CTCFAB_P1KO_S9_R2_001.fastq.gz mv BM132_Pol_2_degraded_S10_R2_001.fastq.gz CTCFAB_P2KO_S10_R2_001.fastq.gz mv BM133_WT_S11_R2_001.fastq.gz K9AB_WT_S11_R2_001.fastq.gz mv BM134_Pol_1_degraded_S12_R2_001.fastq.gz K9AB_P1KO_S12_R2_001.fastq.gz mv BM135_Pol_2_degraded_S13_R2_001.fastq.gz K9AB_P2KO_S13_R2_001.fastq.gz mv BM136_WT_S14_R2_001.fastq.gz K9AB_WT_S14_R2_001.fastq.gz mv BM137_Pol_1_degraded_S15_R2_001.fastq.gz K9AB_P1KO_S15_R2_001.fastq.gz mv BM138_Pol_2_degraded_S16_R2_001.fastq.gz K9AB_P2KO_S16_R2_001.fastq.gz mv BM139_Pol_1_degraded_S17_R2_001.fastq.gz CTCFAB_P1KO_S17_R2_001.fastq.gz mv BM140_Pol_2_degraded_S18_R2_001.fastq.gz CTCFAB_P2KO_S18_R2_001.fastq.gz

Then we can normalize based on read #

#!/bin/bash
#SBATCH -A p32171 ##--## SLURM start
#SBATCH -p short
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=150G
#SBATCH --mail-user=lucascarter2025@u.northwestern.edu
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 3:00:00
#SBATCH --job-name=downsample
#SBATCH --output=outlog
#SBATCH --error=errlog

cutoff=1000000
norm=2000000
highnorm=10000000
seed=100

## Paths
inpath=$1 ## usage sbatch <sample.sh> </inpath/>
outpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/sampled/fastq/

module load seqtk/Dec17

for file in *R1_001.fastq.gz;
do echo "${file}";

## File names
reads=$(expr $(zcat ${file} | wc -l) / 4)
R2=$(echo $file | sed 's/R1_001.fastq.gz/R2_001.fastq.gz/g')
R1out=$(echo $file | sed 's/R1_001.fastq.gz/sampled_R1_001.fastq/g')
R2out=$(echo $R2 | sed 's/R2_001.fastq.gz/sampled_R2_001.fastq/g')

if [[ $reads -lt $cutoff ]] ## If there is no path to seacr software, use MACS2
then

    echo -e "\t Sample ${file} has too few reads: ${reads} \n"

elif [[ $reads -lt $norm ]] && [[ $reads -gt $cutoff ]]
then

echo -e "\t Sample ${file} with ${reads} reads: sampled to 1,500,000 "
echo -e "\t seqtk sample -s $seed ${inpath}${file} 1500000 > ${outpath}${R1out} \n"

seqtk sample -s $seed ${inpath}${file} 1500000 > ${outpath}${R1out}
seqtk sample -s $seed ${inpath}${R2} 1500000 > ${outpath}${R2out}

elif [[ $reads -gt $highnorm ]] 
then

echo -e "\t Sample ${file} with ${reads} reads: sampled to $highnorm "
echo -e "\t seqtk sample -s $seed ${inpath}${file} $highnorm > ${outpath}${R1out} \n"

seqtk sample -s $seed ${inpath}${file} $highnorm > ${outpath}${R1out}
seqtk sample -s $seed ${inpath}${R2} $highnorm > ${outpath}${R2out}

else

echo -e "\t Sample ${file} with ${reads} reads: sampled to $norm "
echo -e "\t seqtk sample -s $seed ${inpath}${file} $norm > ${outpath}${R1out} \n"

seqtk sample -s $seed ${inpath}${file} $norm > ${outpath}${R1out}
seqtk sample -s $seed ${inpath}${R2} $norm > ${outpath}${R2out}

fi
done

gzip *.fastq

I checked the duplication rate using the following code and found that the rates for each sample were quite high (consistent with overamplication by PCR (my fault) and oversequencing by the core (NuSeq's fault)). This publication addresses these issues and found similar duplication issues. Here is the code I used:

## Summarize the duplication information from the picard summary outputs.

n=3
dir.sam <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/"
subdirs <- c("S1", "S2", "S3")
sam.path <- paste0(dir.sam, subdirs[n],"/SAM/")
hists <- list.files(sam.path, pattern = ".dupMark.txt")

dupResult = c()
for(hist in hists){
  dupRes = read.table(paste0(sam.path,hist), header = TRUE, fill = TRUE)
  
  histInfo = strsplit(hist, "_")[[1]]
  dupResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], MappedFragNum_hg38 = dupRes$READ_PAIRS_EXAMINED[1] %>% as.character %>% as.numeric, DuplicationRate = dupRes$PERCENT_DUPLICATION[1] %>% as.character %>% as.numeric * 100, EstimatedLibrarySize = dupRes$ESTIMATED_LIBRARY_SIZE[1] %>% as.character %>% as.numeric) %>% mutate(UniqueFragNum = MappedFragNum_hg38 * (1-DuplicationRate/100))  %>% rbind(dupResult, .)
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Cut&Tag Data Processing Guide

Intro:

Spike in normalization

Peak calling

Later modification

downsampling reads

Files

README.md

Latest commit

History

README.md

File metadata and controls

Cut&Tag Data Processing Guide

Intro:

Spike in normalization

Peak calling

Later modification

downsampling reads