PolkaPox: Output

Introduction

This document describes the output produced by the pipeline. Sub-directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Overview

The pipeline is built using Nextflow and processes data into the following outputs:

Kraken2 - Fastq sequence classification
seqTK - Keep only sequences classified as Monkeypox Virus
FastP - Raw read QC and adapter + quality trimming
BWA mem - Read alignment to a reference genome
samtools - Read alignment depth + metrics
iVar - Consensus sequence generation and variant calling from reference-based assembly
Variant summaries - Summary table of variant calls for select coordinates if coords parameter is specified
Unicycler - De novo assembly from Fastp-trimmed reads
QUAST - Assembly quality
Graph_recon - Assembly graph resolution
MUMmer - Quantify assembly corrections
Final assembly - Polish de novo assembly
summarize_qc.py - Aggregate summary metrics in a tsv file per sample
MultiQC - Aggregate report describing QC related to read quality
Pipeline information - Report metrics generated during the workflow execution

Kraken2

Output files

kraken2/
- *.classified.fastq.gz: Classified read pairs.
- *.classifiedreads.txt: Report showing classification output for each read.
- *.report.txt: Taxonomic classification report.

Kraken2 documentation

seqTK

Output files

seqtk/
- *.fq.gz: Fastqs containing only sequences classified as Monkeypox Virus with Kraken2.

seqTK documentation

FastP

Output files

fastp/
- *.fastp.fastq.gz: Trimmed read output. If paired input, a trimmed file will exist for R1 and R2.
- *.fastp.html: QC report containing quality metrics.
- *.fastp.json: QC report in json file.
- *.fastp.txt: QC report in text file.

Fastp documentation

BWA mem

Output files

bwa/
- *.bam: Sorted bam alignment to reference genome.
- bwa/*: Reference genome index files

BWA mem documentation

samtools

Output files

samtools/
- *.depth.tsv: Coverage depth vs reference genome.
- *.flagstat: Alignment stats from samtools flagstat

samtools documentation

iVar

Output files

ivar/
- *.bwa.fa: Consensus generated from BWA MEM alignment to reference genome.
- *.bwa.mpileup*: Mpileup output from BWA MEM alignment to reference genome.
- *.ivar.tsv: Default ivar variant output with variants as tsv table for each sample.
- *.ivar.vcf: VCF converted from ivar variants tsv for each sample.

iVar documentation

Variant summaries

Output files

variant_summaries/
- all_samples.vcf.summary.txt: Concatenation of the sample-level summary files for regions of interest specified by the coords parameter.
- *_ivar_summary.txt: Sample-level summary files for regions of interest specified by the coords parameter.

Unicycler

Output files

unicycler/
- *.assembly.gfa.gz: Assembly graph output.
- *.scaffolds.fa.gz: Scaffold-level de novo genome assembly output.
- *.unicycler.log: Log of unicycler process.

Unicycler documentation

QUAST

Output files

quast/
- quast/*: QUAST output files.
- report.tsv: Summary of assembly metrics for all samples.

QUAST documentation

Graph reconstruction

Output files

graph_recon/
- *.assembly_asm.fasta: Assembled genome.
- *.assembly_longest.fasta: Longest input contig sequence from Unicycler
- *.assembly.summary: Summary QC metrics for graph reconstruction.
- *.assembly.log: Log of assembly graph reconstruction process.
- *contigs.fasta: Reformatted fasta file from the input gfa file.
graph_recon_mapping/
- *.bam: Sorted bam alignment of reads to reconstructed genome.
- *flagstat: Alignment stats from samtools flagstat
- bwa/*: Reconstructed genome index files

documentation

MUMmer dnadiff

Output files

mummer/
- *.report: Summary of assembly corrections detected with dnadiff.

MUMmer documentation

Final assembly

Output files

final_assembly/
- *.final.fa: Final assembly generated by de novo subworkflow
- *.draft.fa: Multi-contig draft assembly written only if Graph_Recon fails to reconstruct the Unicycler graph

QC Summary

Output files

sample_summary.tsv: A tsv file with fields summarizing QC metrics for various steps in the pipeline. Note that columns included varies with subworkflow executed. All columns will be present with --workflow full and --filter true, but other options will only include relevant outputs.
- Column summary:
  1. sample - Sample name
  2. reference_genome - Reference genome defined with --fasta
  3. total_raw_reads - Total count of raw input reads
  4. opx_read_count_kraken - Count of reads classified as orthopox with kraken
  5. opx_percent_kraken - Percentage of total reads classified as orthopox
  6. human_percent_kraken - Percentage of reads classified as human
  7. unclass_percent_kraken - Percentage of reads not classified
  8. kraken_db - kraken database defined with --kraken_db
  9. kraken_tax_ids - List of taxids used for classification with --kraken2_tax_ids
  10. filtered_read_count_fastp - Count of classified reads passing fastp trimming and filtering
  11. percent_reads_passed_fastp - Percentage of classified reads passing fastp
  12. percent_adapter_fastp - Percent of classified reads with adapter contamination
  13. gc_content_postfilter_fastp - Percent QC of filtered reads
  14. q30_rate_postfilter_fastp - Q30 of filtered reads
  15. percent_duplication_fastp - Percent duplication in filtered reads
  16. reads_mapped_bwa - Count of filtered reads mapping to the reference genome
  17. percent_mapped_bwa - Percent of filtered reads mapping to the reference genome
  18. average_depth_bwa - Average coverage depth of filtered reads mapped to the reference genome
  19. count_20xdepth_bwa - Count of reference positions with >20x coverage (breadth 20x)
  20. n_contigs_unicycler - Number of contigs produced by de novo assembly with Unicycler
  21. assembly_length_unicycler - Total length of assembled contigs
  22. n50_unicycler - N50 of Unicycler assembly
  23. mapped_reads_denovo - Count of filtered reads mapped to the final assembly
  24. percent_mapped_denovo - Percent of filtered reads mapped to the final assembly
  25. orientation_copy_number - Final assembly graph path
  26. sequence_length - Total length of final assembly
  27. itr_length - ITR length inferred from the assembly graph
  28. gfa_status - Status of assembly graph resolution (PASS|FAIL)
  29. gfa_notes - Detailed message of graph resolution status
  30. total_snps - Count of detected SNPs relative to the reference genome
  31. COORD1_SNPs - Count of filtered SNPs within defined coordinate range
  32. COORD2_SNPs - Count of filtered SNPs within defined coordinate range
  33. corrected_snps - Count of SNPs corrected when polishing the final assembly
  34. corrected_indels - Count of indels corrected when polishing the final assembly
  35. corrected_Ns - Count of Ns added to the final assembly due to low coverage

MultiQC

Output files

multiqc/
- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files

pipeline_info/
- samplesheet.valid.csv: Reformatted samplesheet files used as input to the pipeline.
- software_versions.yml: Captures all software versions used within the workflow, sub-workflows, and modules.
- pipeline_report_*: Reports generated with --email / --email_on_fail parameters at runtime.
- execution_trace_*: Tracing files that contain information about each process executed in the pipeline.
- pipeline_dag_: Graphical representations of the directed acyclic graph corresponding to the workflow structure defined in the pipeline.
- execution_timeline_: Files that contain information about the execution timeline of tasks completed by the pipeline.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output.md

output.md

PolkaPox: Output

Introduction

Overview

Kraken2

seqTK

FastP

BWA mem

samtools

iVar

Variant summaries

Unicycler

QUAST

Graph reconstruction

MUMmer dnadiff

Final assembly

QC Summary

MultiQC

Pipeline information

Files

output.md

Latest commit

History

output.md

File metadata and controls

PolkaPox: Output

Introduction

Overview

Kraken2

seqTK

FastP

BWA mem

samtools

iVar

Variant summaries

Unicycler

QUAST

Graph reconstruction

MUMmer dnadiff

Final assembly

QC Summary

MultiQC

Pipeline information