Kids First Data Resource Center Oxford Nanopore Technologies Long Reads Alignment and Variant Calling Workflow

The Kids First Data Resource Center (KFDRC) Oxford Nanopore Technologies (ONT) Long Reads Alignment and Variant Calling Workflow is a Common Workflow Language (CWL) implementation of various softwares used to take reads information generated by ONT long reads sequencers and generate alignment and variant information. This pipeline was made possible thanks to significant software and support contributions from both Sentieon and Wang Genomics Lab. For more information on our collaborators, check out their websites:

Sentieon: https://www.sentieon.com/
Wang Genomics Lab: https://wglab.org/

Relevant Softwares and Versions

samtools head: 1.17
samtools fastq: 1.15.1
Sentieon Minimap2: 202112.01
Sentieon util sort: 202112.01
Sentieon LongReadSV: 202112.06
LongReadSum: 1.2.0
Sniffles: 2.0.7
CuteSV: 2.0.3
Nanocaller: 3.2.0

Input Files

input_unaligned_bam: The primary input of the ONT Long Reads Workflow is an unaligned BAM and associated index.
indexed_reference_fasta: Any suitable human reference genome. KFDRC uses Homo_sapiens_assembly38.fasta from Broad Institute.

Output Files

cutesv_structural_variants: BGZIP and TABIX indexed VCF containing structural variant calls made by CuteSV on the minimap2_aligned_bam.
longreadsum_bam_metrics: BGZIP TAR containing various metrics collected by LongReadSum from the minimap2_aligned_bam.
minimap2_aligned_bam: Indexed BAM file containing reads from the input_unaligned_bam aligned to the indexed_reference_fasta.
nanocaller_small_variants: BGZIP and TABIX indexed VCF containing small variant calls made by Nanocaller on the minimap2_aligned_bam.
sniffles_structural_variants: BGZIP and TABIX indexed VCF containing structural variant calls made by Sniffles on the minimap2_aligned_bam.
longreadsv_structural_variants: BGZIP and TABIX indexed VCF containing structural variant calls made by Sentieon LongReadSV on the minimap2_aligned_bam.

Generalized Process

Read group information (@RG) is harvested from the input_unaligned_bam header using samtools head and grep.
If user provides biospecimen_name input, that value replaces the SM value pulled in the preceeding step.
Align input_unaligned_bam to indexed_reference_fasta with tohe above @RG information using samtools fastq, Sentieon Minimap2, and Sentieon sort.
Generate long reads alignment metrics from the minimap2_aligned_bam using LongReadSum.
Generate structural variant calls from the minimap2_aligned_bam using CuteSV.
Generate structural variant calls from the minimap2_aligned_bam using Sniffles.
Generate structural variant calls from the minimap2_aligned_bam using Sentieon LongReadSV.
Estimate mean depth of coverage of chr1 and chrX using samtools.
Generate small variant calls from the minimap2_aligned_bam using Nanocaller.

Workflow Trivia

Nanocaller runtime is particularly influenced by one of its inputs: mincov. This value is something that users should be tuning based on their understanding of the data (particularly quality and coverage). In general as coverage goes up, mincov should also go up to reduce the amount of noise. Even in the absence of user input we should scale this value based on the input BAM; therefore, the workflow will now samtools coverage on chr1 to assess the mean depth of coverage. From there we will set mincov to meandepth / 4 for SNPs and meandepth / 8 for INDELs. The reason for INDELs being more permissive is the following: The mincov for SNP calling applies to all reads, but for indel calling, it applies to reads from each parental haplotype. So a mincov of 8 for SNP means each position needs to have at least 8 reads to be considered for SNP calling, but for indel calling, it needs 8 from each parental haplotype, so it ends up being 16 reads required at least. Therefore to keep read support parity between SNPs and INDELs, INDELs mincov should be half of SNPs.
Input sample sex matters to Nanocaller. Nanocaller in SNP mode and the phase flag set will output phased BAM files for all diploid chromosomes in the sample. For male samples this means that phased BAMs are produced for the autosomes (chr1-22); females, however, will have an additional phased BAM for chrX. If the user does not provide the sex of the sample as an input, the workflow will attempt to guess. The workflow will use samtools coverage to calculate the mean depth of coverage for chrX. Using that value as well as the meandepth of coverage calcualted for chr1 (see above), if the chrX/chr1 mean depth ratio is 0.75 or more, the workflow will presume the sample is female and therefore has a diploid X.

Basic Info

D3b dockerfiles
Testing Tools:
- Seven Bridges Cavatica Platform
- Common Workflow Language reference implementation (cwltool)

References

KFDRC AWS s3 bucket: s3://kids-first-seq-data/broad-references/
Cavatica: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONT_WORKFLOW_README.md

ONT_WORKFLOW_README.md

Kids First Data Resource Center Oxford Nanopore Technologies Long Reads Alignment and Variant Calling Workflow

Relevant Softwares and Versions

Input Files

Output Files

Generalized Process

Workflow Trivia

Basic Info

References

Files

ONT_WORKFLOW_README.md

Latest commit

History

ONT_WORKFLOW_README.md

File metadata and controls

Kids First Data Resource Center Oxford Nanopore Technologies Long Reads Alignment and Variant Calling Workflow

Relevant Softwares and Versions

Input Files

Output Files

Generalized Process

Workflow Trivia

Basic Info

References