Bioinformatic analysis of Nano3P-seq nanopore libraries (direct cDNA first strand sequencing with template switching)
- General command line steps used to analyze Nano3P-seq datasets
- Detailed mapping steps
- Software versions used
- Citation
Basecalling is done using Guppy basecaller without adapter trimming. We need the adapter sequence for the tailfindR software.
Base-calling with Guppy v6 without trimming the adapter :
guppy_basecalling --device cuda:0 -c dna_r9.4.1_450bps_hac.cfg --barcode_kits EXP-NBD104 --fast5_out --trim_strategy none -ri fast5_files -s output_folder
Base-calling and demultiplexing with Guppy v6 without trimming the adapter :
guppy_basecalling --device cuda:0 -c dna_r9.4.1_450bps_hac.cfg --fast5_out --trim_strategy none -ri fast5_files -s output_folder
Demultiplexing the unclassified.fastq file using porechop (Python 3 is required)
porechop -i unclassified.fastq -b output_folder -t 10 --barcode_threshold 50 --untrimmed
You can download nano3p-seq version of tailfindr here
#Prerequisite : tailfindR tool nano3p-seq version
tails <- find_tails(fast5_dir ='fast5_location',
save_dir = './',
csv_filename = 'Tails.csv' ,
num_cores = 10)
Note: For each read, Tailfindr will report as output numerical integers corresponding to the tail length in nt (e.g. 120), 0 or NaN.
Length = 0 ---> when the status of the read is: contains_no_polyT_tail
Length = NaN ---> when the status of the read is: no_adaptor_found
Therefore, tail length equal to 0 means that the software could find the adapter, but next to it, it did not find any polyA tail stretch (so length is 0 nt), whereas NaN means that the read could not be analyzed because the adapter was not found.
We need to trim the adapter sequence before analysing the tail content In order to do so, we need to create an alternative adapters.py file that ONLY contains Nano3p-seq adapter. This way we can make the search with less stringency and get a cleaner trimming You can find the adapters.py file in the porechop_libraries folder
porechop --extra_end_trim 0 --end_threshold 40 --adapter_threshold 40 -i input.fastq -t 10 > output.fastq
General mapping options used in our pipeline
# Mapping to transcriptome
minimap2 -ax map-ont --MD reference.fasta input.fastq | samtools view -hSb -F 3844 - > output.sam
samtools view -f 0x10 -bq 59 output.sam | samtools sort - output.sorted && samtools index output.sorted.bam
# Mapping to genome
minimap2 -ax splice -k14 --MD $ref input.fastq | samtools view -hSb -F 3844 - > output.sam
samtools sort output.sam output.sorted && rm output.sam && samtools index output.sorted.bam
Furthermore, we followed a pipelime comprised of customised scripts in order to extract the following information:
- Filter out the reads that might belong to degraded RNAs
- Correctly assign the reads to distinct biotypes
You can check out this section for the detailed pipeline
We extract this information for the tail content analysis. It should contain the unmapped tail region of the reads
python soft_clipped_content.py trimmed.bam > tail_content.tsv
#Prerequisite : isoquant.py tool
python isoquant.py --genedb gtf_file --complete_genedb --bam data.bam --data_type nanopore -o OUTPUT_FOLDER
Filtering mapped reads based on annotations and assigning reads to gene biotype At this step, using the annotation, we aim to remove the reads coming from degraded RNAs We will use a mouse sample run as an example
Refer to this link for creating bed files from gtf file
#Minimap with default options
minimap2 -ax map-ont --MD -t 2 reference_fasta input_fastq > cytrRNA.bam
#Convert BAM into SAM, including only high-quality allignment
samtools view -Sb -F 3844 cytrRNA.bam > cytrRNA.sam
#Extract high quality and reverse strand only reads and sort/index
samtools view -hb -f 0x10 -bq 59 cytrRNA.sam | samtools sort - cytrRNA.sorted && samtools index cytrRNA.sorted.bam
#Remove intermediate files
rm cytrRNA.bam cytrRNA.sam
#Intersect the BAM file reads with rRNA annotation
bedtools intersect -abam cytrRNA.sorted.bam -b Zebrafish_rRNA_Transcript_Ends.bed -wa -wb -bed > cytrRNA_complete.bed
#Extract Read IDs
awk '!seen[$4]++' cytrRNA_complete.bed | cut -f4 > cytrRNA_complete.reads
#Extract BAM for complete mapping to rRNA
java -jar picard.jar FilterSamReads \
I=cytrRNA.sorted.bam \
O=cytrRNA_complete.sorted.bam\
READ_LIST_FILE=cytrRNA_complete.reads \
FILTER=includeReadList
#Index the BAM file
samtools index cytrRNA_complete.sorted.bam
#Intersect the BAM file with rRNA Annotation to label the reads as rRNA
bedtools intersect -abam cytrRNA_complete.sorted.bam -b Zebrafish_rRNA_Annotation.bed -wa -wb -bed -S | awk '!seen[$4]++'> cytrRNA.overlapping.FINAL.bed
# Exclude these reads from fastq
samtools view cytrRNA.sorted.bam | cut -f1 > cytrRNA.reads
#Excluded fastq
seqkit grep --pattern-file cytrRNA.reads --invert-match input_fastq > nonrRNA.fastq
#Minimap with default options
minimap2 -ax splice -k14 --MD -t 2 $ref nonrRNA.fastq > nonrRNA.bam
#Convert BAM into SAM, including only high-quality allignment
samtools view -Sb -F 3844 nonrRNA.bam > nonrRNA.sam
#Sort BAM file
samtools view -hb nonrRNA.sam | samtools sort - nonrRNA.sorted && samtools index nonrRNA.sorted.bam
#Convert BAM into BED
bedtools bamtobed -i nonrRNA.sorted.bam > nonrRNA.sorted.bed
#Remove intermediate files
rm nonrRNA.bam nonrRNA.sam
#Assigning complete reads to Biotypes
#Extract read start coordinates from the BED file
Rscript --vanilla readstarts.R nonrRNA.sorted.bed nonrRNA
#Intersect reads with miRNA Gene annotation to first take miRNA reads apart from the rest
bedtools intersect -abam nonrRNA.sorted.bam -b miRNA_Gene.bed -wa -wb -bed -split -S > miRNAs.bed
#Extract read IDs
awk '!seen[$4]++' miRNAs.bed | cut -f4 > miRNAs.reads
#Intersect read start coordinates with small RNA transcript end coordinates
bedtools intersect -a nonrRNA_readstarts.bed -b SmallRNA_TranscriptEnds.bed -wa -wb > smallRNAs.bed
#Exrtract read IDs
awk '!seen[$4]++' smallRNAs.bed | cut -f4 > smallRNAs.reads
#Make a BAM file for reads mapping to small RNAs
java -jar picard.jar FilterSamReads \
I=nonrRNA.sorted.bam \
O=smallRNAs.bam\
READ_LIST_FILE=smallRNAs.reads \
FILTER=includeReadList
#Index the BAM file
samtools index smallRNAs.bam
#Make a BAM file for reads not mapping to small RNAs
java -jar picard.jar FilterSamReads \
I=nonrRNA.sorted.bam \
O=longRNAs.bam\
READ_LIST_FILE=smallRNAs.reads\
FILTER=excludeReadList
#Index the BAM file
samtools index longRNAs.bam
#Convert BAM to BED
bedtools bamtobed -i longRNAs.bam > longRNAs.bed
#Extract read starts from BAM file
Rscript --vanilla readstarts.R longRNAs.bed longRNAs
#Intersect read start coordinates with long RNA transcript end coordinates
bedtools intersect -a longRNAs.readstarts.bed -b Transcript_Ends.bed -wa -wb > longRNAs.overlapping.bed
#Extract read IDs
awk '!seen[$4]++' longRNAs.overlapping.bed | cut -f4 > longRNAs.overlapping.reads
#Make a BAM file for reads mapping to long RNAs
java -jar picard.jar FilterSamReads \
I=nonrRNA.sorted.bam \
O=longRNAs.overlapping.bam\
READ_LIST_FILE=longRNAs.overlapping.reads \
FILTER=includeReadList
#Index BAM file
samtools index longRNAs.overlapping.bam
# We need to remove the reads from miRNAs that are overlapping with smallRNAs or restRNAs
diff longRNAs.overlapping.reads miRNAs.reads |grep ">"|cut -c 3- > miRNAs.longexcluded.reads
diff smallRNAs.reads miRNAs.longexcluded.reads |grep ">"|cut -c 3- > miRNAs.longsmallexcluded.reads
#Extract filtered miRNA BAM file
java -jar picard.jar FilterSamReads \
I=sorted.bam \
O=miRNAFINAL.bam\
READ_LIST_FILE=miRNAs.longsmallexcluded.reads\
FILTER=includeReadList
#Index BAM file
samtools index miRNAFINAL.bam
#Assigning Read IDs to Biotypes
bedtools intersect -abam longRNAs.overlapping.bam -b Rest_EXON.bed -wa -wb -bed -split -S | awk '!seen[$4]++'> longRNAs.overlapping.FINAL.bed
#Assigning Read IDs to Biotypes
bedtools intersect -abam smallRNAs.bam -b SmallRNA_Gene.bed -wa -wb -bed -split -S | awk '!seen[$4]++' > smallRNAs.overlapping.FINAL.bed
#Assigning Read IDs to Biotypes
bedtools intersect -abam miRNAFINAL.bam -b miRNA_Gene.bed -wa -wb -bed -split -S | awk '!seen[$4]++' > miRNAs.FINAL.bed
#Merge all the files
cat cytrRNA_complete.bed longRNAs.overlapping.FINAL.bed smallRNAs.overlapping.FINAL.bed miRNAs.FINAL.bed > allRNAs.bed
#Exract Read IDs
awk '!seen[$4]++' allRNAs.bed | cut -f4 > allRNAs.reads
#Create BAM file for nonrRNA FINAL version
java -jar /users/enovoa/boguzhan/Software/picard/build/libs/picard.jar FilterSamReads \
I=nonrRNA.sorted.bam \
O=nonrRNA.FINAL.bam\
READ_LIST_FILE=allRNAs.reads\
FILTER=includeReadList
#Merge nonrRNA and cyt rRNA reads
samtools merge allRNAs.bam nonrRNA.FINAL.bam cytrRNA_complete.sorted.bam
#Index BAM file
samtools index allRNAs.bam
R script for timepoint dot plots
Rscript dotplot_timepoints.R <tailfindR.file> <bedfile1> <bedfile2> <bedfile3> label
#Example
Rscript dotplot_timepoints.R zebrafish_tailfindr.csv 2hpf.bed 4hpf.bed 6hpf.bed zebrafish
You can find them in the reference folder
- Curlcake sequences :
- curlcake_1_2.fasta
- cDNA Standards :
- cdna_std.fasta
- Sequins :
- Genome : chrIS.v2.fa
- Genome annotation : RNAsequins.v2.2.gtf
- Yeast :
- Genome : SacCer3.fa
- Ribosomal RNA : yeast_rRNA.fa
- Genome annotation : Saccer64.gtf
- Mouse :
- Genome
- Ribosomal RNA : mus_musculus_rRNA.fasta
- Genome annotation
- Zebrafish :
- Genome
- Ribosomal RNA : Zebrafish_rRNA_Maternal_Zygotic.fa
- Genome annotation
- Human :
- Genome
- Ribosomal RNA : human_rRNAs.fa
- Genome annotation
- Guppy version 6.0.2
- minimap2 version 2.17
- samtools version 0.1.19
- R version 3.6.0
- TailfindR (Nano3P-seq version)
- picard.jar v2.25.0
- bedtools v2.29.1
- Isoquant v1.3
- porechop v0.2.4
- seqkit
- Python version 3
If you find this work useful, please cite:
Begik O, Diensthuber G, Liu H, Delgado-Tejedor A, Kontur C, Niazi AM, Valen E, Giraldez AJ, Beaudoin JD, Mattick JS and Novoa EM. Nano3P-seq: transcriptome-wide analysis of gene expression and tail dynamics using end-capture nanopore cDNA sequencing. bioRxiv 2022. doi:https://www.biorxiv.org/content/10.1101/2021.09.22.461331v3.