This document describes the output produced by the pipeline. Sub-directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The pipeline is built using Nextflow and processes data into the following outputs:
- Kraken2 - Fastq sequence classification
- seqTK - Keep only sequences classified as Monkeypox Virus
- FastP - Raw read QC and adapter + quality trimming
- BWA mem - Read alignment to a reference genome
- samtools - Read alignment depth + metrics
- iVar - Consensus sequence generation and variant calling from reference-based assembly
- Variant summaries - Summary table of variant calls for select coordinates if
coords
parameter is specified - Unicycler - De novo assembly from Fastp-trimmed reads
- QUAST - Assembly quality
- Graph_recon - Assembly graph resolution
- MUMmer - Quantify assembly corrections
- Final assembly - Polish de novo assembly
- summarize_qc.py - Aggregate summary metrics in a tsv file per sample
- MultiQC - Aggregate report describing QC related to read quality
- Pipeline information - Report metrics generated during the workflow execution
Output files
kraken2/
*.classified.fastq.gz
: Classified read pairs.*.classifiedreads.txt
: Report showing classification output for each read.*.report.txt
: Taxonomic classification report.
Output files
seqtk/
*.fq.gz
: Fastqs containing only sequences classified as Monkeypox Virus with Kraken2.
Output files
fastp/
*.fastp.fastq.gz
: Trimmed read output. If paired input, a trimmed file will exist for R1 and R2.*.fastp.html
: QC report containing quality metrics.*.fastp.json
: QC report in json file.*.fastp.txt
: QC report in text file.
Output files
bwa/
*.bam
: Sorted bam alignment to reference genome.bwa/*
: Reference genome index files
Output files
samtools/
*.depth.tsv
: Coverage depth vs reference genome.*.flagstat
: Alignment stats fromsamtools flagstat
Output files
ivar/
*.bwa.fa
: Consensus generated from BWA MEM alignment to reference genome.*.bwa.mpileup*
: Mpileup output from BWA MEM alignment to reference genome.*.ivar.tsv
: Default ivar variant output with variants as tsv table for each sample.*.ivar.vcf
: VCF converted from ivar variants tsv for each sample.
Output files
variant_summaries/
all_samples.vcf.summary.txt
: Concatenation of the sample-level summary files for regions of interest specified by thecoords
parameter.*_ivar_summary.txt
: Sample-level summary files for regions of interest specified by thecoords
parameter.
Output files
unicycler/
*.assembly.gfa.gz
: Assembly graph output.*.scaffolds.fa.gz
: Scaffold-level de novo genome assembly output.*.unicycler.log
: Log of unicycler process.
Output files
quast/
quast/*
: QUAST output files.report.tsv
: Summary of assembly metrics for all samples.
Output files
graph_recon/
*.assembly_asm.fasta
: Assembled genome.*.assembly_longest.fasta
: Longest input contig sequence from Unicycler*.assembly.summary
: Summary QC metrics for graph reconstruction.*.assembly.log
: Log of assembly graph reconstruction process.*contigs.fasta
: Reformatted fasta file from the input gfa file.
graph_recon_mapping/
*.bam
: Sorted bam alignment of reads to reconstructed genome.*flagstat
: Alignment stats fromsamtools flagstat
bwa/*
: Reconstructed genome index files
Output files
mummer/
*.report
: Summary of assembly corrections detected withdnadiff
.
Output files
final_assembly/
*.final.fa
: Final assembly generated by de novo subworkflow*.draft.fa
: Multi-contig draft assembly written only if Graph_Recon fails to reconstruct the Unicycler graph
Output files
sample_summary.tsv
: A tsv file with fields summarizing QC metrics for various steps in the pipeline. Note that columns included varies with subworkflow executed. All columns will be present with--workflow full
and--filter true
, but other options will only include relevant outputs.- Column summary:
sample
- Sample namereference_genome
- Reference genome defined with--fasta
total_raw_reads
- Total count of raw input readsopx_read_count_kraken
- Count of reads classified as orthopox with krakenopx_percent_kraken
- Percentage of total reads classified as orthopoxhuman_percent_kraken
- Percentage of reads classified as humanunclass_percent_kraken
- Percentage of reads not classifiedkraken_db
- kraken database defined with--kraken_db
kraken_tax_ids
- List of taxids used for classification with--kraken2_tax_ids
filtered_read_count_fastp
- Count of classified reads passing fastp trimming and filteringpercent_reads_passed_fastp
- Percentage of classified reads passing fastppercent_adapter_fastp
- Percent of classified reads with adapter contaminationgc_content_postfilter_fastp
- Percent QC of filtered readsq30_rate_postfilter_fastp
- Q30 of filtered readspercent_duplication_fastp
- Percent duplication in filtered readsreads_mapped_bwa
- Count of filtered reads mapping to the reference genomepercent_mapped_bwa
- Percent of filtered reads mapping to the reference genomeaverage_depth_bwa
- Average coverage depth of filtered reads mapped to the reference genomecount_20xdepth_bwa
- Count of reference positions with >20x coverage (breadth 20x)n_contigs_unicycler
- Number of contigs produced by de novo assembly with Unicyclerassembly_length_unicycler
- Total length of assembled contigsn50_unicycler
- N50 of Unicycler assemblymapped_reads_denovo
- Count of filtered reads mapped to the final assemblypercent_mapped_denovo
- Percent of filtered reads mapped to the final assemblyorientation_copy_number
- Final assembly graph pathsequence_length
- Total length of final assemblyitr_length
- ITR length inferred from the assembly graphgfa_status
- Status of assembly graph resolution (PASS|FAIL)gfa_notes
- Detailed message of graph resolution statustotal_snps
- Count of detected SNPs relative to the reference genomeCOORD1_SNPs
- Count of filtered SNPs within defined coordinate rangeCOORD2_SNPs
- Count of filtered SNPs within defined coordinate rangecorrected_snps
- Count of SNPs corrected when polishing the final assemblycorrected_indels
- Count of indels corrected when polishing the final assemblycorrected_Ns
- Count of Ns added to the final assembly due to low coverage
- Column summary:
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Output files
pipeline_info/
samplesheet.valid.csv
: Reformatted samplesheet files used as input to the pipeline.software_versions.yml
: Captures all software versions used within the workflow, sub-workflows, and modules.pipeline_report_*
: Reports generated with--email
/--email_on_fail
parameters at runtime.execution_trace_*
: Tracing files that contain information about each process executed in the pipeline.pipeline_dag_
: Graphical representations of the directed acyclic graph corresponding to the workflow structure defined in the pipeline.execution_timeline_
: Files that contain information about the execution timeline of tasks completed by the pipeline.
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.