Skip to content

Summarise and plot data from long-read ONT (direct RNA/cDNA) BAM files

License

Notifications You must be signed in to change notification settings

josiegleeson/BamSlam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 

Repository files navigation

BamSlam

This script was written for long-read Oxford Nanopore Technologies direct RNA/cDNA sequencing data. It uses BAM files produced after mapping with minimap2 to the reference transcriptome. It will output a summary file and plots from your aligned reads. This script was used in: https://doi.org/10.1093/nar/gkab1129.

Inputs:

To obtain a BAM file align your FASTQ/FASTA files to the transcriptome with minimap2.

minimap2 -ax map-ont --sam-hit-only transcriptome.fasta reads.fastq | samtools view -bh > aligned_reads.bam

If minimap2 is not run with --sam-hit-only you should remove unmapped reads prior to running BamSlam to avoid slowing it down. You can also input a BAM file output from NanoCount: https://github.com/a-slide/NanoCount.

Requirements:

R (tested with v4.2.2) R packages:

  • GenomicAlignments (Bioconductor)
  • dplyr
  • tidyr
  • tibble
  • data.table
  • ggplot2
  • viridis
  • hexbin

Usage:

Download/copy the Rscript from this repository and run it from the terminal as follows:

Rscript BamSlam.R [DATA_TYPE] [BAM_FILE] [OUT_PREFIX]
Rscript BamSlam.R rna undiff1_5Y.bam undiff1

DATA_TYPE, enter either: cdna, rna
BAM_FILE, a BAM file of alignments to the transcriptome
OUT_PREF, output file prefix

The script takes approximately 5 minutes per million reads.

Outputs:

Data

  • A summary CSV file of your alignments.
  • A CSV file of all input alignments (primary and secondary) for downstream analysis.
  • A CSV file of each unique transcript identified in the data with its corresponding median read coverage.

Plots

  • Histogram of read coverages (full-length reads cutoff/dashed line = 0.95):

  • Histogram of known transcript lengths (per distinct transcript in the data):

  • Histogram of known transcript lengths (per read):

  • Histogram density plot of known transcript length vs coverage fractions:

  • Bar chart of the secondary alignments:

  • Density plot of the read accuracies:

Details of summary file metrics

  • Total number of reads
  • Number of reads representing full-length transcripts (reads with coverage fractions > 0.95)
  • Percentage of reads representing full-length transcripts
  • Median read coverage fraction (primary alignments)
  • Median alignment length (primary alignments)
  • Median accuracy (primary alignments) (calculated from CIGAR strings as: (nbrM+nbrI+nbrD-NM)/(nbrM+nbrI+nbrD))
  • Number of reads with no secondary alignments
  • Percentage of reads with no secondary alignments
  • Total number of distinct transcripts identified in the data
  • Median coverage fraction of all unique transcripts (median value of the median read coverages per transcript)
  • Median length of all unique transcripts identified (median length in nt of the annotated length of transcripts identified)

About

Summarise and plot data from long-read ONT (direct RNA/cDNA) BAM files

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages