Skip to content

Latest commit

 

History

History
681 lines (500 loc) · 28.3 KB

090624_CnT.md

File metadata and controls

681 lines (500 loc) · 28.3 KB

Cut&Tag Data Processing Guide

Intro:

Cleavage Under Targets & Tagmentation (CUT&Tag)is an ChIP-like profiling strategy in which primary antibodies are bound to chromatin proteins in situ after nuclei have been permeabilized. Secondary antibodies covalently bound to the cut-and-paste transposase Tn5 are then bound to the primary antibody. Activation of the transposase simultaneously digests DNA and adds high throughput sequencing adapters (referred to as ‘tagmentation’) for paired-end DNA sequencing. Cut&Tag libraries have a much higher signal-to-noise ratio than traditional ChIP libraries and need far fewer cells as well.

Below is documentation for set-up and a single script for parsing Cut&Tag data based off of the data analysis protocol presented here from the Cut&Tag protocol and paper. Additionally, there is an nf-core pipeline for Cut&Tag. Because I want more control over the processing of the data, I will write my own wrapper based on the Henikoff Lab's data analysis protocol.

Here are related papers and resources for Cut&Tag:

I submitted three groups of Cut&Tag biological replicates on 6.03.2024, 07.01.2024, and 7.19.2024. The later two submissions have technical replicates in them that should be combined. The sample identifiers for each are below:

6.03.2024 - Submission 1:

Cells AB Index NuSeq ID
WT Pol1AB i71-i51 BM101
Pol1 degraded Pol1AB i71-i52 BM102
Pol1 degraded K4Me3AB i71-i53 BM104
WT K4Me3AB i71-i54 BM103
WT Pol2AB i72-i51 BM105
Pol 1 degraded Pol2AB i72-i52 BM106
WT K9Me3AB i72-i53 BM107
Pol 1 degraded K9Me3AB i72-i54 BM108

07.01.2024 - Submission 2:

Cells Rep1 AB Index NuSeq ID
WT Pol1AB i71-i51 BM109
Pol 1 degraded Pol1AB i71-i52 BM110
Pol 2 degraded Pol1AB i71-i53 BM111
WT Pol2AB i71-i54 BM112
Pol 1 degraded Pol2AB i72-i51 BM113
Pol 2 degraded Pol2AB i72-i52 BM114
Pol 1 degraded K27AcAB i72-i53 BM115
Pol 2 degraded K27AcAB i72-i54 BM116
Cells Rep2 AB Index NuSeq ID
WT Pol1AB i73-i51 BM117
Pol 1 degraded Pol1AB i73-i52 BM118
Pol 2 degraded Pol1AB i73-i53 BM119
WT Pol2AB i73-i54 BM120
Pol 1 degraded Pol2AB i74-i51 BM121
Pol 2 degraded Pol2AB i74-i52 BM122
Pol 1 degraded K27AcAB i74-i53 BM123
Pol 2 degraded K27AcAB i74-i54 BM124

07.19.2024 - Submission 3:

Cells Rep1 AB Index NuSeq ID
WT Pol1AB i71-i51 BM125
Pol 1 degraded Pol1AB i71-i52 BM126
Pol 2 degraded Pol1AB i71-i53 BM127
WT CTCFAB i71-i54 BM128
WT CTCFAB i72-i51 BM129
WT CTCFAB i72-i52 BM130
Pol 1 degraded CTCFAB i72-i53 BM131
Pol 2 degraded CTCFAB i72-i54 BM132
Cells Rep2 AB Index NuSeq ID
WT K9AB i73-i51 BM133
Pol 1 degraded K9AB i73-i52 BM134
Pol 2 degraded K9AB i73-i53 BM135
WT K9AB i73-i54 BM136
Pol 1 degraded K9AB i74-i51 BM137
Pol 2 degraded K9AB i74-i52 BM138
Pol 1 degraded CTCFAB i74-i53 BM139
Pol 2 degraded CTCFAB i74-i54 BM140

Cut&Tag uses Bowtie2 for alignment.

Build Bowtie indices for reference genome prior to alignment with Tophat using the following command bowtie2-build path_to_reference_genome.fa prefix_to_name_indexes using default bowtie2 on quest, bowtie2/2.4.5

For mapping to rDNA repeats + human genome, use the following reference. For general mapping, use Ensembl

VERSION=108
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz

OR UCSC: UCSC Genome and UCSC GTF (preferred). There are 4 different GTF references to use.

wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz
wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz

example SLURM header for Quest HPC

#!/bin/bash
#SBATCH -A p32171 ##--## SLURM start
#SBATCH -p genhimem
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH --mem-per-cpu=10G
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 24:00:00
#SBATCH --job-name=build_genome
#SBATCH --output=outlog
#SBATCH --error=errlog
##-- header here --##

cd /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref

module load bowtie2/2.4.5

bowtie2-build /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref/hg38.fa hg38

GTF file can be found at /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf and genome index can be found at /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref

Spike in normalization

I am using ChIPseqSpikeInFree to normalize the data in lieu of a spike-in. I installed the command line version at: /home/lmc0633/executables/ChIPseqSpikeInFree using the command version of the installation detailed at the github and R/4.3.0

Need a metadata file with the following columns:

ANTIBODY GROUPANTIBODY GROUP

ID ANTIBODY GROUP
WT.bam Pol1AB control
Pol1.bam Pol1AB pol1ko
Pol2.bam Pol1AB pol2ko

Code is written in such a way that I can run all BAMs for a particular group at once. I deposited metadata at /projects/b1042/BackmanLab/Lucas/090124_CnT/metadata as individual .txt files for each submission. each submission is labelled S1, S2, or S3, with corresponding fastqs and alignment files in each respective directory. This can be run on SLURM or it can be run on an interaction session. Script should look something like below:

Shell submission script (if submitting via SLURM):

module load R/4.3.0

Rscript --vanilla --verbose spike.R

R script for Spike In normalization factor. Must have more than 1 core enabled for this to work:

## HPC Script for normalization factor generation
library("ChIPseqSpikeInFree")

## Must have more than 1 core enabled for this to work
library(BiocParallel)
numCores <- parallel::detectCores()
register(MulticoreParam(workers = numCores-1), default = TRUE)

n=2 ## 1, 2, or 3
## Get metadata path (Must be tab delimited)
dir.meta <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/metadata/"
dir.s <- c("06032024.txt", "07012024.txt", "07192024.txt")
meta <- paste0(dir.meta,dir.s[n])

## Get bam paths
dir.bam <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/"
subdirs <- c("S1", "S2", "S3")
bam.path <- paste0(dir.bam, subdirs[n],"/BAM/")

## Get bam list
tbl <- read.table(meta, sep= "\t", header= T)
bams <- tbl$ID

## Write out txt files
#out <- c("06032024.txt", "07012024.txt", "07192024.txt")
#write.table(tbl, file = paste0(dir.meta, out[n]), quote=F, sep = "\t",row.names = F, col.names = TRUE)

library(stringr)
input <- str_trim(paste0(bam.path,bams), side = c("both", "left", "right"))

## Call function
ChIPseqSpikeInFree(bamFiles = input, chromFile = "hg38", metaFile = meta, prefix = paste0ls("/projects/b1042/BackmanLab/Lucas/090124_CnT/spikein/",subdirs[n]), ncores=4)

We need the size of each chromosome as a text file. this can be retrieved at UCSC:

wget -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.chrom.sizes

Once we have the scaling factors from , we want to integrate them into generation of our coverage files like so:

libSize=`cat sample1.bed|wc -l`
scale=15000000/($libSize*$SF)
genomeCoverageBed -bg -scale $scale -i sample1.bed  -g hg38.chromSizes > sample1.bedGraph
bedGraphToBigWig sample1.bedGraph hg38.chromSizes sample1.bw

OR

scale_factor=`echo "10000 / $seqDepth" | bc -l`
    echo "Scaling factor for $histName is: $scale_factor!"
    bedtools genomecov -bg -scale $scale_factor -i $projPath/alignment/bed/${histName}_bowtie2.fragments.bed -g $chromSize > $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph

Peak calling

There are a few options for calling peaks on Cut&Tag data. The more leniant and traditional option is MACS2. Using MACS2 will increase the false positive rate but will capture more peaks overall. SEACR was designed with Cut&Run/TAG data (high SN) in mind. More recent and less used, GoPeaks is also specifically designed for Cut&Run/TAG data. There are several other peak callers, but for my purposes, I will use MACS2 and SEACR. Here is a short explanation on the difference between SEACR and MACS2. This HBC workshop covers calling peaks on Cut&Run/TAG data.

For SEACR, we need to retrieve two scripts and put them in a directory for use later:

mkdir SEACR ## in bin or scripts
wget https://github.com/FredHutch/SEACR/raw/master/SEACR_1.3.sh
wget https://github.com/FredHutch/SEACR/raw/master/SEACR_1.3.R

SEACR usage and example:

## usage
bash SEACR_1.3.sh experimental bedgraph [control bedgraph | numeric threshold] ["norm" | "non"] ["relaxed" | "stringent"] output prefix 
## example
bash SEACR_1.3.sh target.bedgraph 0.01 non stringent output

For the script, we call the following:

seacr=/home/lmc0633/executables/SEACR/SEACR_1.3.sh
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}_seacr_top0.01.peaks

macs2 callpeak -t $projPath/alignment/bam/${org}/${time}/${hist}/bowtie2.mapped.bam \
      -c $projPath/alignment/bam/${org}/${timeControl}/${hist}/bowtie2.mapped.bam \
      -g ${genome} -f BAMPE -n macs2_peak_q0.1 --outdir $projPath/peakCalling/MACS2/${org}/${timeControl}_control/${time}/${hist}/
       -q 0.1 --keep-dup all 2>${projPath}/peakCalling/MACS2/${org}/${timeControl}_control/${time}/${hist}/macs2Peak_summary.txt
mkdir -p $projPath/peakCalling
macs2 callpeak -t ${projPath}/alignment/bam/${histName}_rep1_bowtie2.mapped.bam \
      -c ${projPath}/alignment/bam/${controlName}_rep1_bowtie2.mapped.bam \
      -g hs -f BAMPE -n macs2_peak_q0.1 --outdir $projPath/peakCalling/MACS2 -q 0.1 --keep-dup all 2>${projPath}/peakCalling/MACS2/macs2Peak_summary.txt

double array in linux

#!/bin/bash

array=( "Vietnam" "Germany" "Argentina" )
array2=( "Asia" "Europe" "America" )

for i in "${!array[@]}"; do
    printf "%s is in %s\n" "${array[i]}" "${array2[i]}"
done

Script to call the EU-seq alignment script (Tophat) is below

#!/bin/bash
#SBATCH -A p32171 ##--## SLURM start
#SBATCH -p genhimem
#SBATCH -N 1
#SBATCH --ntasks-per-node=24
#SBATCH --mem-per-cpu=17G
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 48:00:00
#SBATCH --job-name=build_genome
#SBATCH --output=outlog
#SBATCH --error=errlog
##-- header here --##

echo -e "Usage: sh $0 -f <Forward Read> -w <Path to Working Directory> -g <Path to Bowtie Indices> -r <Path to Annotation File> -c <chromosome size file name > -t <threads> -m <min MAPQ score> -s <Spike in score (optional)> -p < path to SEACR (optional)> -b < sets MAC2 to --broad (optional)> -d <remove duplicates with Picard (optional)> \n"

## with MAC2
echo "Processing data"
for file in *R1_001.fastq.gz;
do echo "${file}";
   bash CutTag.sh -f ${file} -w /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S1/Backman23_6.03.2024/ -g /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref/hg38 -r /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf/hg38.ncbiRefSeq.gtf -c hg38.chrom.sizes -t 24 -m 2 -d ;
done

## with SEACR
echo "Processing data"
for file in *R1_001.fastq.gz;
do echo "${file}";
   bash CutTag.sh -f ${file} -w /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/test/ -g /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/ref/hg38 -r /projects/b1042/BackmanLab/Lucas/090124_CnT/genome/gtf/hg38.ncbiRefSeq.gtf -p /home/lmc0633/executables/SEACR/SEACR_1.3.sh -c hg38.chrom.sizes -t 16 -m 2;
done

For the correlation plot generation script, i adapted this code for use as follows:

##== R command ==##

library(corrplot)

list.files('/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S1/BED/', pattern = ".bin.bed")

n=1 # c(1,2,3)
path <- '/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/'
subdir <- c("S1", "S2", "S3")
dir <- paste0(path, subdir[n], "/BED/")
hists <- list.files(dir, pattern = ".bin.bed")

reprod = c()
fragCount = NULL
for(hist in hists){
  
  if(is.null(fragCount)){
    
    fragCount = read.table(paste0(dir, hist), header = FALSE) 
    colnames(fragCount) = c("chrom", "bin", hist)
    
  }else{
    
    fragCountTmp = read.table(paste0(dir, hist), header = FALSE)
    colnames(fragCountTmp) = c("chrom", "bin", hist)
    fragCount = full_join(fragCount, fragCountTmp, by = c("chrom", "bin"))
    
  }
}

M = cor(fragCount %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs") 

corrplot(M, method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 3, rect.col = "black", rect.lwd = 3,cl.pos = "b", tl.col = "indianred4", tl.cex = 0.5, cl.cex = 0.5, addCoef.col = "black", number.digits = 2, number.cex = 0.5, col = colorRampPalette(c("midnightblue","white","darkred"))(100))

Script for calling just SEACR peaks. Can't use NORM setting unless I have an input control

#!/bin/bash
#SBATCH -A p32171 ##--## SLURM start
#SBATCH -p genhimem
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=200G
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 24:00:00
#SBATCH --job-name=build_genome
#SBATCH --output=outlog
#SBATCH --error=errlog
##-- header here --##

seacr=/home/lmc0633/executables/SEACR/SEACR_1.3.sh
bgfold=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/BEDGRAPH
pkfold=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/peaks

module load R/4.3.0
module load bedtools

bash /home/lmc0633/executables/SEACR/SEACR_1.3.sh /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/BEDGRAPH/BM125_WT_S3.bedgraph 0.01 non stringent /projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/peaks/test.peaks
cd ${bgfold}

echo "Processing data"
for file in *.bedgraph;
do echo "${file}";


P1=$(echo $file | sed 's/.bedgraph/.seacr.peaks/g')

echo "8. Calling SEACR Peaks on ${file}"
echo -e "\t bash $seacr ${bgfold}/${file} 0.01 norm relaxed ${pkfold}/${P1}"

bash $seacr ${bgfold}/${file} 0.01 non stringent ${pkfold}/${P1}

done

echo -e "Peak calling complete \n"

Later modification

I looked into merging replicates using Coverage() in R and unfortunately, you lose all the metadate (qValue, pValue, signal) when doing so. I decided to merge my technical replicates at the FASTQ level and then call the peaks again on those. Following calling peaks, I plan to use Bedtools intersect to find the conensus peaks since it preserves the metadata and other information. Merging all technical replicates in submission 2 and in submission 3

out=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/merged/fastq/ 
in=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S3/Backman26_7.19.2024/


read1=BM135_Pol_2_degraded_S13_R2_001.fastq.gz  ; read2=BM138_Pol_2_degraded_S16_R2_001.fastq.gz ;  file=Merged.K9AB.P2KO.S2_R2_001.fastq.gz

#Merge input BAMS
cat ${in}${read1} ${in}${read2} > ${out}${file}


#!/bin/bash
#SBATCH -A b1042 ##--## SLURM start
#SBATCH -p genomics
#SBATCH -N 1
#SBATCH --ntasks-per-node=10
#SBATCH --mem-per-cpu=10G
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 6:00:00
#SBATCH --job-name=compute matrix
#SBATCH --output=outlog
#SBATCH --error=errlog

module load deeptools

##== linux command ==##

inpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/peaks/
outpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/peaks/test/
bgpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/coverage/

bgfile=BM109_S1.rpgc.bigWig
pkfile=BM109_S1.seacr.peaks.relaxed.bed
otfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/.seacr.peaks.summitRegion.bed/g')
matfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/_SEACR.mat.gz/g')

label=P1AB_WT
rep=S2
cores=10

awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[1]"\t"region[2]}' $inpath$pkfile > $outpath$otfile

computeMatrix reference-point -S $bgpath$bgfile \
              -R $outpath$otfile \
              --skipZeros -o $outpath$matfile -p $cores -a 3000 -b 3000 --referencePoint center

plotHeatmap -m $outpath$matfile -out ${outpath}${matfile}_SEACR_heatmap.png --sortUsing sum --startLabel "Peak Start" -\
-endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${label} ${rep}"


inpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/peaks/
outpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/peaks/test/
bgpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/S2/coverage/

bgfile=BM110_S2.rpgc.bigWig
pkfile=BM110_S2.seacr.peaks.relaxed.bed
otfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/.seacr.peaks.summitRegion.bed/g')
matfile=$(echo $pkfile | sed 's/.seacr.peaks.relaxed.bed/_SEACR.mat.gz/g')

label=P1AB_P1KO
rep=S2

awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[1]"\t"region[2]}' $inpath$pkfile > $outpath$otfile

computeMatrix reference-point -S $bgpath$bgfile \
              -R $outpath$otfile \
              --skipZeros -o $outpath$matfile -p $cores -a 3000 -b 3000 --referencePoint center

plotHeatmap -m $outpath$matfile -out ${outpath}${matfile}_SEACR_heatmap.png --sortUsing sum --startLabel "Peak Start" -\
-endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${label} ${rep}"

downsampling reads

I'm downsampling my reads because some are over-sequenced and some are undersequenced. I worry this will contribute to variability between conditions, since some samples did experience overamplification. For this task, I am using seqtk

usage: seqtk sample -s <path/in/fastq.gz> <#ofsamples> > <path/out/fastq.gz>

First thing to do is count the total number of reads per file

#!/bin/bash
#SBATCH -A b1042 ##--## SLURM start
#SBATCH -p genomics
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=50G
#SBATCH [email protected]
#SBATCH -t 2:00:00
#SBATCH --job-name=count_reads

for file in *R1_001.fastq.gz;
do echo "${file}";
READS=$(expr $(zcat ${file} | wc -l) / 4)
echo -e "\t file: $file has $READS reads  "
echo -e "\t file: $file reads: $READS " >> reads.txt
done

S1:

File Reads Name
file: K4AB-Pol1_S4_R1_001.fastq.gz reads: 50,054,872 K4AB_P1KO
file: K4AB-WT_S3_R1_001.fastq.gz reads: 68,741,393 K4AB_WT
file: K9AB-Pol1_S8_R1_001.fastq.gz reads: 5,858,924 K9AB_P1KO
file: K9AB-WT_S7_R1_001.fastq.gz reads: 2,249,616 K9AB_WT
file: P1AB-Pol1_S2_R1_001.fastq.gz reads: 21,542,758 P1AB_P1KO
file: P1AB-WT_S1_R1_001.fastq.gz reads: 4,246,499 P1AB_WT
file: P2AB-Pol1_S6_R1_001.fastq.gz reads: 56,878,563 P2AB_P1KO
file: P2AB-WT_S5_R1_001.fastq.gz reads: 21,233,193 P2AB_WT

mv K4AB-Pol1_S4_R2_001.fastq.gz K4AB_P1KO_S4_R2_001.fastq.gz mv K4AB-WT_S3_R2_001.fastq.gz K4AB_WT_S3_R2_001.fastq.gz mv K9AB-Pol1_S8_R2_001.fastq.gz K9AB_P1KO_S8_R2_001.fastq.gz mv K9AB-WT_S7_R2_001.fastq.gz K9AB_WT_S7_R2_001.fastq.gz mv P1AB-Pol1_S2_R2_001.fastq.gz P1AB_P1KO_S2_R2_001.fastq.gz mv P1AB-WT_S1_R2_001.fastq.gz P1AB_WT_S1_R2_001.fastq.gz mv P2AB-Pol1_S6_R2_001.fastq.gz P2AB_P1KO_S6_R2_001.fastq.gz mv P2AB-WT_S5_R2_001.fastq.gz P2AB_WT_S5_R2_001.fastq.gz

S2:

File Reads Name
file: BM109_S1_R1_001.fastq.gz reads: 2,831,267 P1AB_WT
file: BM110_S2_R1_001.fastq.gz reads: 1,997,534 P1AB_P1KO
file: BM111_S3_R1_001.fastq.gz reads: 2,761,436 P1AB_P2KO
file: BM112_S4_R1_001.fastq.gz reads: 2,904,866 P2AB_WT
file: BM113_S5_R1_001.fastq.gz reads: 8,911,631 P2AB_P1KO
file: BM114_S6_R1_001.fastq.gz reads: 3,923,244 P2AB_P2KO
file: BM115_S7_R1_001.fastq.gz reads: 11,335,882 K27AB_P1KO
file: BM116_S8_R1_001.fastq.gz reads: 4,809,199 K27AB_P2KO
file: BM117_S9_R1_001.fastq.gz reads: 408,680 P1AB_WT
file: BM118_S10_R1_001.fastq.gz reads: 3,241,433 P1AB_P1KO
file: BM119_S11_R1_001.fastq.gz reads: 3,001,093 P1AB_P2KO
file: BM120_S12_R1_001.fastq.gz reads: 1,265,939 P2AB_WT
file: BM121_S13_R1_001.fastq.gz reads: 4,573,860 P2AB_P1KO
file: BM122_S14_R1_001.fastq.gz reads: 6,784,824 P2AB_P2KO
file: BM123_S15_R1_001.fastq.gz reads: 2,668,701 K27AB_P1KO
file: BM124_S16_R1_001.fastq.gz reads: 2,413,773 K27AB_P2KO

S3:

File Reads Name
file: BM125_WT_S3_R1_001.fastq.gz reads: 45,823,405 P1AB_WT
file: BM126_Pol_1_degraded_S4_R1_001.fastq.gz reads: 34,609,056 P1AB_P1KO
file: BM127_Pol_2_degraded_S5_R1_001.fastq.gz reads: 28,356,451 P1AB_P2KO
file: BM128_WT_S6_R1_001.fastq.gz reads: 34,639,433 CTCFAB_WT
file: BM129_WT_S7_R1_001.fastq.gz reads: 49,503,419 CTCFAB_WT
file: BM130_WT_S8_R1_001.fastq.gz reads: 47,559,479 CTCFAB_WT
file: BM131_Pol_1_degraded_S9_R1_001.fastq.gz reads: 38,030,748 CTCFAB_P1KO
file: BM132_Pol_2_degraded_S10_R1_001.fastq.gz reads: 28,916,144 CTCFAB_P2KO
file: BM133_WT_S11_R1_001.fastq.gz reads: 63,835,066 K9AB_WT
file: BM134_Pol_1_degraded_S12_R1_001.fastq.gz reads: 54,217,202 K9AB_P1KO
file: BM135_Pol_2_degraded_S13_R1_001.fastq.gz reads: 53,024,658 K9AB_P2KO
file: BM136_WT_S14_R1_001.fastq.gz reads: 49,808,186 K9AB_WT
file: BM137_Pol_1_degraded_S15_R1_001.fastq.gz reads: 53,189,723 K9AB_P1KO
file: BM138_Pol_2_degraded_S16_R1_001.fastq.gz reads: 72,659,514 K9AB_P2KO
file: BM139_Pol_1_degraded_S17_R1_001.fastq.gz reads: 21,882,985 CTCFAB_P1KO
file: BM140_Pol_2_degraded_S18_R1_001.fastq.gz reads: 23,661,905 CTCFAB_P2KO

Then we can normalize based on read #

#!/bin/bash
#SBATCH -A p32171 ##--## SLURM start
#SBATCH -p short
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=150G
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH -t 3:00:00
#SBATCH --job-name=downsample
#SBATCH --output=outlog
#SBATCH --error=errlog

cutoff=1000000
norm=2000000
highnorm=10000000
seed=100

## Paths
inpath=$1 ## usage sbatch <sample.sh> </inpath/>
outpath=/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/sampled/fastq/

module load seqtk/Dec17

for file in *R1_001.fastq.gz;
do echo "${file}";

## File names
reads=$(expr $(zcat ${file} | wc -l) / 4)
R2=$(echo $file | sed 's/R1_001.fastq.gz/R2_001.fastq.gz/g')
R1out=$(echo $file | sed 's/R1_001.fastq.gz/sampled_R1_001.fastq/g')
R2out=$(echo $R2 | sed 's/R2_001.fastq.gz/sampled_R2_001.fastq/g')

if [[ $reads -lt $cutoff ]] ## If there is no path to seacr software, use MACS2
then

    echo -e "\t Sample ${file} has too few reads: ${reads} \n"

elif [[ $reads -lt $norm ]] && [[ $reads -gt $cutoff ]]
then

echo -e "\t Sample ${file} with ${reads} reads: sampled to 1,500,000 "
echo -e "\t seqtk sample -s $seed ${inpath}${file} 1500000 > ${outpath}${R1out} \n"

seqtk sample -s $seed ${inpath}${file} 1500000 > ${outpath}${R1out}
seqtk sample -s $seed ${inpath}${R2} 1500000 > ${outpath}${R2out}

elif [[ $reads -gt $highnorm ]] 
then

echo -e "\t Sample ${file} with ${reads} reads: sampled to $highnorm "
echo -e "\t seqtk sample -s $seed ${inpath}${file} $highnorm > ${outpath}${R1out} \n"

seqtk sample -s $seed ${inpath}${file} $highnorm > ${outpath}${R1out}
seqtk sample -s $seed ${inpath}${R2} $highnorm > ${outpath}${R2out}

else

echo -e "\t Sample ${file} with ${reads} reads: sampled to $norm "
echo -e "\t seqtk sample -s $seed ${inpath}${file} $norm > ${outpath}${R1out} \n"

seqtk sample -s $seed ${inpath}${file} $norm > ${outpath}${R1out}
seqtk sample -s $seed ${inpath}${R2} $norm > ${outpath}${R2out}

fi
done

gzip *.fastq

I checked the duplication rate using the following code and found that the rates for each sample were quite high (consistent with overamplication by PCR (my fault) and oversequencing by the core (NuSeq's fault)). This publication addresses these issues and found similar duplication issues. Here is the code I used:

## Summarize the duplication information from the picard summary outputs.

n=3
dir.sam <- "/projects/b1042/BackmanLab/Lucas/090124_CnT/fastq/"
subdirs <- c("S1", "S2", "S3")
sam.path <- paste0(dir.sam, subdirs[n],"/SAM/")
hists <- list.files(sam.path, pattern = ".dupMark.txt")

dupResult = c()
for(hist in hists){
  dupRes = read.table(paste0(sam.path,hist), header = TRUE, fill = TRUE)
  
  histInfo = strsplit(hist, "_")[[1]]
  dupResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], MappedFragNum_hg38 = dupRes$READ_PAIRS_EXAMINED[1] %>% as.character %>% as.numeric, DuplicationRate = dupRes$PERCENT_DUPLICATION[1] %>% as.character %>% as.numeric * 100, EstimatedLibrarySize = dupRes$ESTIMATED_LIBRARY_SIZE[1] %>% as.character %>% as.numeric) %>% mutate(UniqueFragNum = MappedFragNum_hg38 * (1-DuplicationRate/100))  %>% rbind(dupResult, .)
}