This script was written for long-read Oxford Nanopore Technologies direct RNA/cDNA sequencing data. It uses BAM files produced after mapping with minimap2 to the reference transcriptome. It will output a summary file and plots from your aligned reads. This script was used in: https://doi.org/10.1093/nar/gkab1129.
To obtain a BAM file align your FASTQ/FASTA files to the transcriptome with minimap2.
minimap2 -ax map-ont --sam-hit-only transcriptome.fasta reads.fastq | samtools view -bh > aligned_reads.bam
If minimap2 is not run with --sam-hit-only you should remove unmapped reads prior to running BamSlam to avoid slowing it down. You can also input a BAM file output from NanoCount: https://github.com/a-slide/NanoCount.
R (tested with v4.2.2) R packages:
- GenomicAlignments (Bioconductor)
- dplyr
- tidyr
- tibble
- data.table
- ggplot2
- viridis
- hexbin
Download/copy the Rscript from this repository and run it from the terminal as follows:
Rscript BamSlam.R [DATA_TYPE] [BAM_FILE] [OUT_PREFIX]
Rscript BamSlam.R rna undiff1_5Y.bam undiff1
DATA_TYPE, enter either: cdna, rna
BAM_FILE, a BAM file of alignments to the transcriptome
OUT_PREF, output file prefix
The script takes approximately 5 minutes per million reads.
- A summary CSV file of your alignments.
- A CSV file of all input alignments (primary and secondary) for downstream analysis.
- A CSV file of each unique transcript identified in the data with its corresponding median read coverage.
-
Histogram of read coverages (full-length reads cutoff/dashed line = 0.95):
-
Histogram of known transcript lengths (per distinct transcript in the data):
-
Histogram of known transcript lengths (per read):
-
Histogram density plot of known transcript length vs coverage fractions:
-
Bar chart of the secondary alignments:
-
Density plot of the read accuracies:
- Total number of reads
- Number of reads representing full-length transcripts (reads with coverage fractions > 0.95)
- Percentage of reads representing full-length transcripts
- Median read coverage fraction (primary alignments)
- Median alignment length (primary alignments)
- Median accuracy (primary alignments) (calculated from CIGAR strings as: (nbrM+nbrI+nbrD-NM)/(nbrM+nbrI+nbrD))
- Number of reads with no secondary alignments
- Percentage of reads with no secondary alignments
- Total number of distinct transcripts identified in the data
- Median coverage fraction of all unique transcripts (median value of the median read coverages per transcript)
- Median length of all unique transcripts identified (median length in nt of the annotated length of transcripts identified)