How to get Allelic counts per SNP or per gene starting with Fastq files

Step 1: Getting RNA-seq data (download a test RNA-seq data or use your own data)

For the purposes of this workflow, we will demonstrate how to run analysis for one of two technical replicates. Feel free to write a script to run these step in parallel for all replicates / libraries that you have.

We will be using RNA-seq data from Gendrel et al. 2014. They had two technical replicates for one of NPC clones (paired end data):

SRR1106781_1.fastq.gz and SRR1106781_2.fastq.gz for replicate 1 and SRR1106786_1.fastq.gz and SRR1106786_2.fastq.gz for replicate 2.

Download the data:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/001/SRR1106781/SRR1106781_1.fastq.gz 
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/001/SRR1106781/SRR1106781_2.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/006/SRR1106786/SRR1106786_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR110/006/SRR1106786/SRR1106786_2.fastq.gz

Step 2: Aligning reads to maternal and paternal genomes

You need to align your reads to so-called preudogenomes, i.e. reference genome with SNP from corresponding maternal and paternal genomes. So if you don't have pseudogenomes ready yet, please refer to instructions) to generate them.

pseudoRefDirs=/full/path/to/dir/for/pseudo/ref/out/
# replicate 1
## align to 129 pseudogenome
STAR --readFilesIn /full/path/to/SRR1106781_1.fastq.gz /full/path/to/SRR1106781_2.fastq.gz \
     --outFileNamePrefix /full/path/to/alignment/output/SRR1106781_on129S1. \
     --runThreadN 4 --outSAMtype SAM \
     --outSAMattrRGline ID:mat \
     --genomeDir $pseudoRefDirs/129S1_SvImJ/ \
     --outFilterMultimapNmax 1 --sjdbGTFfile /full/path/to/Mus_musculus.GRCm38.68.gtf
## align to CAST pseudogenome     
STAR --readFilesIn /full/path/to/SRR1106781_1.fastq.gz /full/path/to/SRR1106781_2.fastq.gz \
     --outFileNamePrefix /full/path/to/alignment/output/SRR1106781_onCAST. \
     --runThreadN 4 --outSAMtype SAM \
     --outSAMattrRGline ID:pat \
     --genomeDir $pseudoRefDirs/CAST_EiJ/ \
     --outFilterMultimapNmax 1 --sjdbGTFfile /full/path/to/Mus_musculus.GRCm38.68.gtf
# replicate 2  
STAR --readFilesIn /full/path/to/SRR1106786_1.fastq.gz /full/path/to/SRR1106786_2.fastq.gz \
     --outFileNamePrefix /full/path/to/alignment/output/SRR1106786_on129S1. \
     --runThreadN 4 --outSAMtype SAM \
     --outSAMattrRGline ID:mat \
     --genomeDir $pseudoRefDirs/129S1_SvImJ/ \
     --outFilterMultimapNmax 1 --sjdbGTFfile /full/path/to/Mus_musculus.GRCm38.68.gtf
STAR --readFilesIn /full/path/to/SRR1106786_1.fastq.gz /full/path/to/SRR1106786_2.fastq.gz \
     --outFileNamePrefix /full/path/to/alignment/output/SRR1106786_onCAST. \
     --runThreadN 4 --outSAMtype SAM \
     --outSAMattrRGline ID:pat \
     --genomeDir $pseudoRefDirs/CAST_EiJ/ \
     --outFilterMultimapNmax 1 --sjdbGTFfile /full/path/to/Mus_musculus.GRCm38.68.gtf

Output: . sam files with reads alignments.

Step 3: Merging two files with reads aligned to maternal and to paternal genomes into one file

First, sort the files by read names (samtools sort -n ):

samtools sort -n -O sam -o /full/path/to/SRR1106781_on129S1.Nsorted.sam -@ 4 /full/path/to/SRR1106781_on129S1.Aligned.out.sam
samtools sort -n -O sam -o /full/path/to/SRR1106781_onCAST.Nsorted.sam -@ 4 /full/path/to/SRR1106781_onCAST.Aligned.out.sam
samtools sort -n -O sam -o /full/path/to/SRR1106786_on129S1.Nsorted.sam -@ 4 /full/path/to/SRR1106786_on129S1.Aligned.out.sam
samtools sort -n -O sam -o /full/path/to/SRR1106786_onCAST.Nsorted.sam -@ 4 /full/path/to/SRR1106786_onCAST.Aligned.out.sam

Then merge:

python /full/path/to/ASE/python/alleleseq_merge_stream_v2.py \ 
       --mat_sam /full/path/to/SRR1106781_on129S1.Nsorted.sam \
       --pat_sam /full/path/to/SRR1106781_onCAST.Nsorted.sam \
       --o /full/path/to/SRR1106781_merged.sam \
       --paired 1
python /full/path/to/ASE/python/alleleseq_merge_stream_v2.py \ 
       --mat_sam /full/path/to/SRR1106786_on129S1.Nsorted.sam \
       --pat_sam /full/path/to/SRR1106786_onCAST.Nsorted.sam \
       --o /full/path/to/SRR1106786_merged.sam \
       --paired 1

Output: one sam file with mat and pat readgroups per replicate.

Step 4: Reads sampling

All sam files in the analysis should be sampled to the same lib size (for example, min(sizes)), please see our paper for reasoning.

Sort merged files by read names (samtools sort -n ):

samtools sort -n -O sam -o /full/path/to/SRR1106781_merged.Nsorted.sam -@ 4 /full/path/to/SRR1106781_merged.sam
samtools sort -n -O sam -o /full/path/to/SRR1106786_merged.Nsorted.sam -@ 4 /full/path/to/SRR1106786_merged.sam

Then sample (and repeat as many times as you nead, then just process separatelly), for paired end:

first calculate sizes (samtools view -c):

for sam in /full/path/to/SRR1106781_merged.Nsorted.sam /full/path/to/SRR1106786_merged.Nsorted.sam
do
  echo -e $sam'\t'`samtools view -c $sam` >> /path/to/samsizes.tsv
done

take minimum:

minsize=$(cut -f2 /path/to/samsizes.tsv | sort -V | head -1)

and sample all files to that number of reads, in paired-end case, for example:

for sam in /full/path/to/SRR1106781_merged.Nsorted.sam /full/path/to/SRR1106786_merged.Nsorted.sam
do
  grep "^@" $sam > $sam".sample"$(( minsize/2 ))"reads.sam"
  grep -v "^@" $sam | sed '$!N;s/\n/ IHOPETHATNEVERWOULDAPPERINSAMFILE /' | shuf -n $(( $minsize/2 )) | \
       sed 's/ IHOPETHATNEVERWOULDAPPERINSAMFILE /\n/' >> $sam".sample"$(( $minsize/2 ))"Preads.sam"
done

(for single end, even simplier: pipe of grep -v "^@" and shuf -n $minsize)

Output: one sampled sam file per replicate.

Step 5: Extracting SNP coverage information from the alignements

Convert sam to sorted bam (samtools sort):

samtools sort -o /full/path/to/SRR1106781_merged_sample26302221Preads.sorted.bam /full/path/to/SRR1106781_merged.Nsorted.sam.sample26302221Preads.sam
samtools sort -o /full/path/to/SRR1106786_merged_sample26302221Preads.sorted.bam /full/path/to/SRR1106786_merged.Nsorted.sam.sample26302221Preads.sam

Obtain table with SNP allele counts:

python /home/am717/scripts/allelecounter.py --vcf /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf.gz \
       --bam /full/path/to/SRR1106781_merged_sample26302221Preads.sorted.bam \
       --ref $pseudoDir/129S1_SvImJ/129S1_SvImJ_pseudo.fa \
       --sample F1 --min_cov 0 --min_baseq 2 --min_mapq 10 \
       --o /full/path/to/SRR1106781_merged_sample26302221Preads.stat_0.txt
python /home/am717/scripts/allelecounter.py --vcf /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf.gz \
       --bam /full/path/to/SRR1106786_merged_sample26302221Preads.sorted.bam\
       --ref $pseudoDir/129S1_SvImJ/129S1_SvImJ_pseudo.fa \
       --sample F1 --min_cov 0 --min_baseq 2 --min_mapq 10 \
       --o /full/path/to/SRR1106786_merged_sample26302221Preads.stat_0.txt

Output: one table per replicate.

Step 6: Getting allelic counts per SNP and per gene

Rscript --vanilla /home/am717/scripts/counts_to_snp_genes.R \ 
        -d /full/path/to/dir/with/stat/files/ \
        -n SRR1106781_merged_sample26302221Preads,SRR1106786_merged_sample26302221Preads \
        -r Gendrel_81_85 \
        -o /full/path/to/dir/with/stat/files/ \
        -v /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt \
        -b /full/path/to/output/Mus_musculus.GRCm38.68.EXONS.bed

Output: SNP table and Grouped SNP table (for example, genes) per set of replicates.

As a result of the last step, you will get several files in your output folder, including Gendrel_81_85_processed_gene_extended2.txt. This file will be used for the downstream analysis, see description here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allelic_counts.md

Allelic_counts.md

How to get Allelic counts per SNP or per gene starting with Fastq files

Step 1: Getting RNA-seq data (download a test RNA-seq data or use your own data)

Step 2: Aligning reads to maternal and paternal genomes

Step 3: Merging two files with reads aligned to maternal and to paternal genomes into one file

Step 4: Reads sampling

Step 5: Extracting SNP coverage information from the alignements

Step 6: Getting allelic counts per SNP and per gene

Files

Allelic_counts.md

Latest commit

History

Allelic_counts.md

File metadata and controls

How to get Allelic counts per SNP or per gene starting with Fastq files

Step 1: Getting RNA-seq data (download a test RNA-seq data or use your own data)

Step 2: Aligning reads to maternal and paternal genomes

Step 3: Merging two files with reads aligned to maternal and to paternal genomes into one file

Step 4: Reads sampling

Step 5: Extracting SNP coverage information from the alignements

Step 6: Getting allelic counts per SNP and per gene