Skip to content

A set of tools designed for ONT sequencing analysis, enabling the exploration and comparison of sequencing outcomes across experimental conditions.

License

Notifications You must be signed in to change notification settings

phac-nml/sequenoscope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI Bioconda Anaconda-Server Badge Anaconda-Server Badge Nextflow

Sequenoscope

logo11

A tool for analyzing sequencing run outputs primarily from adaptive sampling experiments and Oxford Nanopore Technology sequencers.

Contents

Introduction

Analyzing and interpreting sequencing data is a fundamental task in bioinformatics, and with the advent of ONT adaptive-sampling sequencing, specialized tools are needed to visualize and assess the effectiveness of enrichment or depletion in adaptive-sampling sequencing runs. Adaptive sampling data present challenges in effectively visualizing and assessing these sequencing runs in terms of key parameters, necessitating tailored analytical approaches and visual analytics. To assist with these challenges, we have developed a comprehensive bioinformatics pipeline consisting of three modules: analyze, plot, and filter_ONT. Our accessible pipeline aims to provide researchers with a fast and intuitive workflow for easily processing and analyzing sequencing data especially from ONT adaptive sequencing runs, enabling them to gain interpretable insights into their datasets with minimal upfront efforts.

The analyze module serves as the core component of our pipeline. First, It takes an input FASTQ file, a reference FASTA file, and an optional sequencing summary file from ONT sequencers or base callers. Next, Leveraging tools such as fastp, minimap2, pysam, and mash, this module performs a series of essential tasks. It filters the input FASTQ file, maps it to the reference FASTA file, and finally, generates a sequence manifest txt file and summary sequence manifest txt file. These files include key sequencing statistics such as read length, read quality (Q score), mapping efficiency, and coverage depth. For an in-depth explanation of all statistics provided, please refer to the report format section below.

The plot module complements the analysis performed by the "analyze" module by using the output to render interactive plots. It takes as input both a "test" and "control" directory, which represent different testing conditions, containing manifest and manifest summary txt files generated by the "analyze" module. With these files, the plot module generates visualizations that aid in the interpretation and visualization of the sequencing data. Please Note: This module is designed for comparative analysis where two testing conditions are present and can be compared.

The filter_ONT module is designed for for ONT raw reads filtering and subsetting. This module leverages a sequencing summary file to allow researchers to precisely filter reads based on customized criteria, including channel, sequencing decisions and other parameters.

Our bioinformatics pipeline offers a powerful tool for researchers working with ONT sequencing data. Whether you are exploring metagenomics sample composition, investigating adaptive sampling for your project, or conducting a comparative analysis of different methods in your lab, our pipeline can streamline your analyses and provide valuable insights into your genomic datasets using visual aids and easy to understand outputs.

Dependencies

  • Python: >=3.7.12, <4
  • fastp: >=0.22.0
  • mash: >=2.3
  • minimap2: >=2.26
  • seqtk: >=1.4
  • samtools: >=1.6

Python Packages

  • pysam: >=0.16.0
  • plotly: >=5.16.1

Installation

Option 1: As a conda package (Recomended)

Install the latest released version from conda:

    conda create -c bioconda -c conda-forge -n sequenoscope

Option 2: As a PyPI package

Coming soon

Install using pip:

    pip install sequenoscope

Option 3: Install from source

Coming soon

If you wish to install sequenoscope from source, please first ensure these dependencies are installed and configured on your system: python>=3.7.12,<4 fastp >=0.22.0 mash >=2.3 minimap2 >=2.26 seqtk >=1.4 samtools >=1.6 pysam >=0.16.0 plotly >=5.16.1

Install the latest commit from the master branch directly from Github:

    pip install git+https://github.com/phac-nml/sequenoscope.git

Workflow Example

In this section, we will walk through a simple workflow using mock data to demonstrate how to use each module of sequenoscope. The mock data directory contains the following files:

mock_data/
├── mock_adaptive_sampling.fastq
├── mock_control.fastq
├── mock_sequencing_summary.txt
├── mock.fastq
└── mock_reference.fasta

Our goal is to:

  1. Use the filter_ONT module to subset raw FASTQ reads into two sets representing different channel ranges.
  2. Run the analyze module on both sets (treated as control and adaptive sampling datasets).
  3. Use the plot module to visualize and compare the results.

This workflow is meant to provide a hands-on example that you can easily follow with your own data.


Step 1: Filtering Reads with filter_ONT

First, we will create a dataset that simulates an adaptive sampling scenario by filtering reads by channel. Let’s start by extracting reads from channel 1 to 256 from our mock.fastq dataset using the filter_ONT module. This will give us a subset similar to mock_adaptive_sampling.fastq.

Command:

sequenoscope filter_ONT --input_fastq mock.fastq \
                        --input_summary mock_sequencing_summary.txt \
                        -o mock_filter_ONT \
                        -min_ch 1 \
                        -max_ch 256

What this does:

  • Takes reads from mock.fastq that come from channels 1 to 256.
  • Outputs a filtered subset in mock_filter_ONT/sample_filtered_fastq_subset.fastq which should be identical to mock_adaptive_sampling.fastq.

If desired, you could similarly generate the control dataset by adjusting the channel range (e.g., -min_ch 257 -max_ch 512) to create a mock_control.fastq. However, since we already have mock_control.fastq available, we’ll skip that step for now to keep things simple.

Output Directory Structure:

filter_ONT module (mock_filter_ONT)

mock_filter_ONT/
├── filter.log
├── sample_filtered_fastq_subset.fastq
└── sample_read_id_list.csv

Step 2: Running the analyze Module

Next, we run the analyze module on both the control and adaptive sampling datasets. This step will generate various output files including manifest files, BAM alignments, and summary statistics.

Command for Control Dataset:

sequenoscope analyze --input_fastq mock_control.fastq \
                     --input_reference mock_reference.fasta \
                     -seq_sum mock_sequencing_summary.txt \
                     -o mock_control_results \
                     -seq_type SE \
                     -op control

Explanation:

  • --input_fastq mock_control.fastq: The control dataset FASTQ file.
  • --input_reference mock_reference.fasta: Reference genome or sequence.
  • -seq_sum mock_sequencing_summary.txt: The sequencing summary file from ONT.
  • -o mock_control_results: Output directory.
  • -seq_type SE: Single-end sequencing.
  • -op control: A prefix for output files.

Control Output Directory Structure:

analyze module (mock_control_results)

mock_control_results/
├── analyze.log
├── control_fastp_output.fastp.fastq
├── control_fastp_output.html
├── control_fastp_output.json
├── control_manifest_summary.txt
├── control_manifest.txt
├── control_mapped_bam.bam
├── control_mapped_bam.bam.bai
├── control_mapped_fastq.fastq
├── control_mapped_sam.sam
├── control_mash_hash.msh
└── control_read_list.txt

Command for Adaptive Sampling Dataset:

sequenoscope analyze --input_fastq mock_adaptive_sampling.fastq \
                     --input_reference mock_reference.fasta \
                     -seq_sum mock_sequencing_summary.txt \
                     -o mock_adaptive_sampling_results \
                     -seq_type SE \
                     -op adaptive_sampling

Explanation:

  • mock_adaptive_sampling.fastq represents the dataset filtered by filter_ONT (or provided).
  • The rest of the parameters are analogous to the control dataset.
  • -op adaptive_sampling tags output files with "adaptive_sampling" for clarity.

Adaptive Sampling Output Directory Structure:

analyze module (mock_adaptive_sampling_results)

mock_adaptive_sampling_results/
├── adaptive_sampling_fastp_output.fastp.fastq
├── adaptive_sampling_fastp_output.html
├── adaptive_sampling_fastp_output.json
├── adaptive_sampling_manifest_summary.txt
├── adaptive_sampling_manifest.txt
├── adaptive_sampling_mapped_bam.bam
├── adaptive_sampling_mapped_bam.bam.bai
├── adaptive_sampling_mapped_fastq.fastq
├── adaptive_sampling_mapped_sam.sam
├── adaptive_sampling_mash_hash.msh
├── adaptive_sampling_read_list.txt
└── analyze.log

Step 3: Visualizing Results with the plot Module

Finally, we use the plot module to compare the control and adaptive sampling datasets. For this example, we will use hours as the time bin due to truncated data in the mock dataset.

Command:

sequenoscope plot -T mock_adaptive_sampling_results/ \
                  -C mock_control_results/ \
                  -o mock_comparison_plots \
                  -op mock \
                  -AS \
                  -bin hours

Explanation:

  • -T mock_adaptive_sampling_results/: Test (adaptive sampling) directory.
  • -C mock_control_results/: Control directory.
  • -o mock_comparison_plots: Output directory for plots.
  • -op mock: Prefix for output files.
  • -AS: Enable adaptive sampling decision charts.
  • -bin hours: Use hourly bins for time-based decision charts.

Plot Output Directory Structure:

plot module (mock_comparison_plots)

mock_comparison_plots/
├── mock_control_cumulative_decision_bar_chart.html
├── mock_control_independent_decision_bar_chart.html
├── mock_ratio_bar_chart.html
├── mock_read_len_comparison_plot.html
├── mock_read_qscore_comparison_plot.html
├── mock_source_file_taxon_covered_bar_chart.html
├── mock_stat_results.csv
├── mock_test_cumulative_decision_bar_chart.html
├── mock_test_independent_decision_bar_chart.html
└── plot.log

Summary

In this workflow example, we:

  1. Used filter_ONT to subset reads from a mock dataset by channel number.
  2. Applied analyze to both the control and adaptive sampling datasets, generating manifest files and alignment statistics.
  3. Visualized and compared the results using plot, focusing on adaptive sampling decisions and coverage metrics.

By following these steps, you can quickly get started with sequenoscope and adapt the workflow to suit your own data and research needs.

Use-case Example

To demonstrate the practical application of our pipeline, consider a scenario where a researcher conducts adaptive sampling using an ONT sequencer. In this example, the researcher divides the sequencer channels into two sets: one half for adaptive sampling enrichment and the other half for regular sequencing as a control.

  • Utilizing our filter_ONT module, the researcher can create two distinct sets of FASTQ files (a 1-256 FASTQ file and a 257-512 FASTQ file), each representing the minimum and maximum channels of the sequencing data.

  • These files are then processed separately through our analyze module, generating two datasets – one for the test (adaptive sampling) and one for the control (regular sequencing).

  • Finally, by employing the plot module, the researcher can visually assess the effectiveness of the adaptive sampling in their experiment. This example shows how Sequenoscope facilitates data processing and analysis, enhancing the researcher's ability to draw meaningful conclusions from their ONT sequencing data.

Usage

If you run sequenoscope, you should see the following usage statement:

    Usage: sequenoscope <command> <required arguments>
    
    To get full help for a command use one of:
    sequenoscope <command> -h
    sequenoscope <command> --help
    
    
    Available commands:
    
    analyze     map reads to a target and produce a report with sequencing statistics
    plot        generate plots based on directories with seq manifest files
    filter_ONT  filter reads from a FASTQ file based on a sequencing summary file

If you run sequenoscope analyze -h or sequenoscope analyze --help, you should see the following options and usage guidleines:

    usage: sequenoscope analyze --input_fastq <file.fq> --input_reference <ref.fasta> -o <out> -seq_type <sr>[options]
    For help use: sequenoscope analyze -h or sequenoscope analyze --help
    
    sequenoscope version 0.0.5: a flexible tool for processing multiplatform sequencing data: analyze, subset/filter, compare and visualize.
    
    Arguments:
      -h, --help            show this help message and exit
      --input_fastq  [ ...]
                            [REQUIRED] Path to ***EITHER 1 or 2*** fastq files to process.
      --input_reference     [REQUIRED] Path to a single reference FASTA file to process. the single FASTA file may contain several sequences.
      -seq_sum , --sequencing_summary 
                            Path to sequencing summary for manifest creation
      -start , --start_time 
                            Start time when no seq summary is provided
      -end , --end_time     End time when no seq summary is provided
      -o , --output         [REQUIRED] Output directory designation
      -op , --output_prefix 
                            Output file prefix designation. default is [sample]
      -seq_type , --sequencing_type 
                            [REQUIRED] A designation of the type of sequencing utilized for the input fastq files. SE = single-end reads and PE = paired-end reads.
      -t , --threads        A designation of the number of threads to use
      -min_len , --minimum_read_length 
                            A designation of the minimum read length. reads shorter than the integer specified required will be discarded, default is 15
      -max_len , --maximum_read_length 
                            A designation of the maximum read length. reads longer than the integer specified required will be discarded, default is 0 meaning no limitation
      -trm_fr , --trim_front_bp 
                            A designation of the how many bases to trim from the front of the sequence, default is 0.
      -trm_tail , --trim_tail_bp 
                            A designation of the how many bases to trim from the tail of the sequence, default is 0
      -q , --quality_threshold 
                            Quality score threshold for filtering reads. Reads with an average quality score below this threshold will be discarded. If not specified, no quality filtering will be performed.
      -min_cov , --minimum_coverage 
                            A designation of the minimum coverage for each taxon. Only bases equal to or higher then the designated value will be considered. default is 1
      --minimap2_kmer       A designation of the kmer size when running minimap2
      --force               Force overwite of existing results directory

If you run sequenoscope filter_ONT -h or sequenoscope filter_ONT --help, you should see the following options and usage guidleines:

    usage: sequenoscope filter_ONT --input_fastq <file.fq> --input_summary <seq_summary.txt> -o <out.fastq> [options]
    For help use: sequenoscope filter_ONT -h or sequenoscope filter_ONT --help
    
    sequenoscope version 0.0.5: a flexible tool for processing multiplatform sequencing data: analyze, subset/filter, compare and visualize.
    
    Arguments:
      -h, --help            show this help message and exit
      --input_fastq  [ ...]
                            Path to adaptive sequencing fastq files to process. Not required when using --summarize.
      --input_summary       [REQUIRED] Path to ONT sequencing summary file.
      -o , --output         [REQUIRED] Output directory designation
      -op , --output_prefix 
                            Output file prefix designation. default is [sample]
      -cls , --classification 
                            a designation of the adaptive-sampling sequencing decision classification ['unblocked', 'stop_receiving', or 'no_decision']
      -min_ch , --minimum_channel 
                            a designation of the minimum channel/pore number for filtering reads
      -max_ch , --maximum_channel 
                            a designation of the maximum channel/pore number for filtering reads
      -min_dur , --minimum_duration 
                            a designation of the minimum duration of the sequencing run in SECONDS for filtering reads
      -max_dur , --maximum_duration 
                            a designation of the maximum duration of the sequencing run in SECONDS for filtering reads
      -min_start , --minimum_start_time 
                            a designation of the minimum start time of the sequencing run in SECONDS for filtering reads
      -max_start , --maximum_start_time 
                            a designation of the maximum start time of the sequencing run in SECONDS for filtering reads
      -min_q , --minimum_q_score 
                            a designation of the minimum q score for filtering reads
      -max_q , --maximum_q_score 
                            a designation of the maximum q score for filtering reads
      -min_len , --minimum_length 
                            a designation of the minimum read length for filtering reads
      -max_len , --maximum_length 
                            a designation of the maximum read length for filtering reads
      --force               Force overwite of existing results directory
      --summarize           Generate barcode statistics. Must specify an input summary and output directory
      -v, --version         show program's version number and exit

If you run sequenoscope plot -h or sequenoscope plot --help, you should see the following options and usage guidleines:

    usage: sequenoscope plot --test_dir <test_dir_path> --control_dir <control_dir_path> --output_dir <out_path>
    For help use: sequenoscope plot -h or sequenoscope plot --help
    
    sequenoscope version 0.0.5: a flexible tool for processing multiplatform sequencing data: analyze, subset/filter, compare and visualize.
    
    Optional Arguments:
      -h, --help            show this help message and exit
    
    Required Paths:
      Specify the necessary directories for the tool.
    
      -T TEST_DIR, --test_dir TEST_DIR
                            Path to test directory.
                            
      -C CONTROL_DIR, --control_dir CONTROL_DIR
                            Path to control directory.
                            
      -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                            Output directory designation.
                            
      --force               Force overwrite of existing results directory.
                            
    
    Plotting Options:
      Customize the appearance and data for plots.
      
    
      -op OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                            Output prefix added before plot names. Default is 'sample'.
                            
      -AS ADAPTIVE_SAMPLING, --adaptive_sampling ADAPTIVE_SAMPLING
                            Generate decision bar charts for adaptive sampling if utilized during sequencing. Specify as True or False.
                            
      -single SINGLE_CHARTS, --single_charts SINGLE_CHARTS
                            Generate charts for data based on selected comparison metric.
                            
      --comparison_metric {est_genome_size,est_kmer_coverage_depth,total_bases,total_fastp_bases,mean_read_length,taxon_length,taxon_%_covered_bases,taxon_mean_read_length}
                            Type of parameter for the box plot and single ratio bar chart. Default parameter is taxon_%_covered_bases.
                            
      -VP VIOLIN_DATA_PERCENT, --violin_data_percent VIOLIN_DATA_PERCENT
                            Fraction of the data to use for the violin plot.
                            
      -bin {seconds,minutes,5m,15m,hours}, --time_bin_unit {seconds,minutes,5m,15m,hours}
                            Time bin used for decision bar charts.
                            
      -legend TAXON_CHART_LEGEND, --taxon_chart_legend TAXON_CHART_LEGEND
                            Generate a legend for the source file taxon covered bar chart.

Handling Multiple FASTQ or FASTQ GZ Files (Single End Read Sets)

Typically, ONT sequencing runs produce multiple FASTQ files for each barcode after base calling. Use the following steps to concatenate those files:

Concatenating FASTQ Files

To concatenate multiple FASTQ files into a single FASTQ file, you can use the following command:

cat file1.fastq file2.fastq > combined.fastq

Concatenating FASTQ GZ Files and Uncompressing

To concatenate multiple FASTQ GZ files and uncompress them into a single FASTQ file, you can use the following commands:

concatenate:

zcat file1.fastq.gz file2.fastq.gz > combined.fastq.gz

uncompress:

gzip -d combined.fastq.gz

Paired End Read Sets

Typically, paired end read sets will have a forward and a reverse compliment FASTQ that are compressed. Use these steps to uncompress them:

if the files are compressed, you can uncompress them as follows:

gzip -d Illumina_file_R1.fastq.gz

and

gzip -d Illumina_file_R2.fastq.gz

You should end up with two FASTQ files such as Illumina_file_R1.fastq and Illumina_file_R2.fastqwhich can then be run through sequenoscope analyze module like this:

sequenoscope analyze --input_fastq Illumina_file_R1.fastq Illumina_file_R2.fastq --input_reference ref.fasta -o output -seq_type PE

Quick start

analyze module

The analyze module provides specific sequencing statistics based on the reference FASTA file provided. Refer to the outputs section below for more details.

To quickly get started with the analyze module:

  1. Ensure that you have the necessary input files and reference database prepared:

    • Input FASTQ files: Provide the path to the FASTQ files you want to process using the --input_FASTQ option.
    • Reference database: Specify the path to the reference database in FASTA format using the --input_reference option.
  2. Choose an output directory for the results:

    • Specify the output directory path using the --output option.
  3. Sprcify the sequencing type

    • Specify the sequencing type -seq_type as either Paired-end PE or Single-end SE
  4. Run the module with the minimally required options:

     sequenoscope analyze --input_fastq <file.fq> --input_reference <ref.FASTA> -o <output_directory> -seq_type <sr>
    

This command will initiate the analysis module using the default settings. The input FASTQ file(s) will be processed, and the results will be saved in the specified output directory.

Please note that this is a simplified quick start guide, and additional options are available for advanced usage. For additional customization options and more detailed information on available options please run sequenoscope analyze -h or sequenoscope analyze --help.

Note: remember to replace <file.fq> with the actual path to your FASTQ file, <ref.FASTA> with the path to your reference database, <output_directory> with the desired location for the output files and <sr> with your sequencing type (SE for single-end and PE for paired-end).

Note: Taxon IDs are used as a naming convention, reflecting the sequence name in the FASTA file. The pipeline can process genes, subspecies, and other identifiers; it doesn't have to be a taxon.

filter_ONT module

To quickly get started with the filter_ONT module:

  1. Ensure that you have the necessary input files prepared:

    • Input FASTQ files: Provide the path to the adaptive sequencing FASTQ files from ONT sequencer you want to process using the --input_FASTQ option.
    • ONT sequencing summary file: Specify the path to the ONT sequencing summary file using the --input_summary option that is either generated by MinKnow or base calling tool such as Guppy or Dorado.
  2. Choose an output file and directory for the filtered reads:

    • Specify the output file path and directory using the --output option.
  3. Set the desired filtering criteria:

    • You can optionally apply various filters to the reads based on the following criteria:
      • Read classification status*: Use the -cls or --classification option to designate the adaptive-sampling sequencing decision classification. Valid options are 'unblocked', 'stop_receiving', or 'no_decision'.
      • Channel range/Pore number: Set the minimum and maximum channel/pore number range for filtering using the -min_ch and -max_ch options.
      • Duration: Define the minimum and maximum duration of the read sequencing time in seconds using the -min_dur and -max_dur options.
      • Run time range: Specify the minimum and maximum start time of the sequencing run in seconds using the -min_start and -max_start options.
      • Q score: Determine the minimum and maximum q score for filtering using the -min_q and -max_q options.
      • Read length range: Set the minimum and maximum read length for filtering using the -min_len and -max_len options.

Note: Some sequence summary files lack the field specifying read classification status. A warning will be raised if this is the case.

  1. Run the command with the basic required options:

     sequenoscope filter_ONT --input_fastq <file.fq> --input_summary <seq_summary.txt> -o <output.FASTQ>
    

This command will initiate the filtering process based on the specified criteria and save the filtered reads to the output FASTQ file.

Please note that this is a simplified quick start guide, and additional options are available for advanced usage. For more detailed information on available options, you can run sequenoscope filter_ONT -h or sequenoscope filter_ONT --help.

Note: Remember to replace <file.fq> with the actual path to your ONT sequencing FASTQ file, <seq_summary.txt> with the path to your ONT sequencing summary file, and <output.FASTQ> with the desired path and filename for the filtered reads.

plot module

This module is designed for comparative analysis where two testing conditions are present and can be compared.

Visualize the analyze module test and control directories outputs using interarctive graphs. To quickly get started with the plot module:

  1. Required Paths: The plot module is comparative and requires two sets of data outputs from the analyze module to produce meaningful results. Ensure you have provided the necessary directories:
  • Test Directory: Provide the path to the test directory that contains the sequence manifest txt files from the analyze module. -T or --test_dir <test_dir_path>
  • Control Directory: Specify the path to the control directory that contains the sequence manifest txt files from the analyze module. -C or --control_dir <control_dir_path>
  • Output Directory: Choose an output directory for the plots. -o or --output_dir <out_path>
  1. Plotting Options: Customize your plots with various options:
  • Output Prefix: You can add a prefix before plot names with the --output_prefix option. -op or --output_prefix <OUTPUT_PREFIX>. Default is 'sample'.
  • Comparison Metric: Select a parameter for the box plot and single ratio bar chart using the --comparison_metric option. Default parameter is taxon*_%_covered_bases.
  • Single Charts: Generate an addtional box plot and single ratio bar chart based on selected comparison metric using the --single_charts option. {TRUE, FALSE}. Default value is False
  • Adaptive Sampling: Generate read classification decision bar charts for adaptive sampling runs if utilized during sequencing by specifying -AS option. Default value is False
  • Violin Data Fraction: Set a fraction of the sequnecing data (total number of reads) to use for the violin plot. -VP or --violin_data_percent <0.1 - 1>. Default fraction is 0.1
  • Time Bin Unit: Designate a time bin used for read classification decision bar charts. -bin or --time_bin_unit {seconds,minutes,5m,15m,hours}. Default bin is minutes
  • Taxon* Legend: Generate a legend for the source file taxon covered bar chart. -legend or --taxon_chart_legend {TRUE, FALSE}. Default designation is False
  1. Run the Command: With the basic required options:

     sequenoscope plot --test_dir <test_dir_path> --control_dir <control_dir_path> --output_dir <out_path>
    

Use the --force flag if you wish to force an overwrite of an existing results directory.

Please note that this is a simplified quick start guide, and additional options are available for advanced usage. For more detailed information on available options, Please consult the usage section for more information on plot paramters or run sequenoscope plot -h or sequenoscope plot --help.

Remember to replace <test_dir_path>, <control_dir_path>, and <out_path> with the actual paths for your directories.

Note: Taxon is a general term that refers to the reference sequences in the user-provided FASTA file.

Outputs

analyze module outputs

File Description
<prefix>_fastp_output.fastq The output FASTQ file after processing with fastp. It includes filtered and trimmed sequencing reads.
<prefix>_fastp_output.html An HTML report generated by fastp summarizing the filtering and quality control results.
<prefix>_fastp_output.json A JSON formatted report with detailed fastp quality control statistics.
<prefix>_manifest.txt A sequence manifest file containing various sequencing statistics post-analysis.
<prefix>_manifest_summary.txt A summary of the sequence manifest with key statistics for a quick overview.
<prefix>_mapped.bam The BAM file output from minimap2, containing aligned sequences to the reference FASTA.
<prefix>_mapped.bam.bai An index file for the BAM file to enable quick read access.
<prefix>_mapped_fastq.fastq The FASTQ file containing reads that have been mapped to the reference.
<prefix>_mapped.sam The SAM file equivalent of the BAM file, containing human-readable alignment data.
<prefix>_mash.hash.msh A MASH sketch file used for rapid genome distance estimation.
<prefix>_read_list.txt A text file list of reads, potentially used for further downstream analysis.

Note: Replace <prefix> with the user-specified prefix that precedes all output filenames.

sample manifest report format

Column ID Description
sample_id Identifier for the sample to which the read belongs.
read_id Unique identifier for the sequencing read.
read_len Length of the sequencing read in base pairs.
read_qscore Quality score of the sequencing read.
channel The channel on the sequencing device from which the read was recorded.
start_time Time when the sequencing of the read started.
end_time Time when the sequencing of the read ended.
decision Indicates the final decision on the sequencing read. Decisions are categorized into three main types: stop_receiving (the sequencing is allowed to continue, represented by signal_positive), unblocked (the read is ejected from sequencing, indicated by data_service_unblock_mux_change), and no_decision (no definitive action was taken, denoted by either signal_negative or unblock_mux_change). Each term explains the action taken or not taken based on the read's signal detection and processing status.
fastp_status Indicates whether the read passed the filtering and trimming process by fastp.
is_mapped Indicates whether the read is mapped to any sequence in the provided multi-sequence FASTA reference file (TRUE if mapped, also see note 1 below).
is_uniq Indicates whether the read is unique within the sample manifest file (TRUE if unique, also see note 2 below).
contig_id Identifier for the contig to which the read is mapped, if applicable.

Notes:

  1. is_mapped refers to whether or not a read is mapped to any sequence in the multi-sequence FASTA reference file provided by the user. If true, the contig_id is provided.
  2. is_uniq refers to whether or not a read is unique throughout the sample manifest file. In ONT sequencing, a read may be processed multiple times if the decision is labelled as signal_negative or No_decision before a final decision is made on whether to allow the read to continue sequencing or not.

sample manifest summary report format

Column ID Description
sample_id Identifier for the sample.
est_genome_size Estimated size of the genome.
est_coverage Estimated coverage of the genome.
total_bases Total number of bases in the sample.
total_fastp_bases Total number of bases after processing with fastp.
mean_read_length Mean read length of the sequencing reads.
taxon_id Identifier for the taxon. Obtained from the user-provided FASTA file.
taxon_length Length of the taxon's genome.
taxon_mean_coverage Mean coverage across the taxon's genome.
taxon_covered_bases_<prefix>X Number of bases in the taxon's genome covered at user-specified coverage threshold.
taxon_%_covered_bases Percentage of the taxon's genome that is covered by reads at the user-specified coverage threshold .
total_taxon_mapped_bases Total number of bases mapped to the taxon.
taxon_mean_read_length Mean read length of the reads mapped to the taxon.

Note: Replace <prefix> with the user-specified threshold coverage.

filter_ONT module outputs

File Description
<user_prefix>_filtered_fastq_subset.fastq The subset of FASTQ reads that have been filtered based on the user-defined criteria within the filter_ONT module.
<user_prefix>_read_id_list.csv A CSV file containing the list of read identifiers that correspond to the filtered subset. This may be used for further reference or analysis.

Note: Replace <prefix> with the user-specified prefix that precedes all output filenames.

plot module outputs

File Description Triggered by Command
<prefix>_ratio_bar_chart.html An HTML file containing a bar chart that displays the ratio statistics of the manifest summary file. Default behavior
<prefix>_source_file_taxon_covered_bar_chart.html An HTML file containing a bar chart displaying the coverage of taxa in the source files. Default behavior and --taxon_chart_legend specifying the inclusion of a legend
<prefix>_stat_results.csv A CSV file with statistical results of the analysis, such as taxa coverage percentages. Default behavior
<prefix>_cumulative_decision_bar_chart.html An HTML file containing a bar chart with cumulative decision metrics over time for either test or control datasets. adaptive sampling enabled (-AS) and time-bin specified (--time_bin_unit)
<prefix>_independent_decision_bar_chart.html An HTML file containing a bar chart with independent decision metrics over time for either test or control datasets. adaptive sampling enabled (-AS) and time-bin specified (--time_bin_unit)
read_len_<prefix>_violin_comparison_plot.html An HTML file containing a violin plot comparing log-transformed data between the test and control datasets. Default behavior and --violin_data_percent specifying the fraction of data to plot
read_qscore_<prefix>_violin_comparison_plot.html An HTML file containing a violin plot comparing q-score distributions between test and control datasets. Default behavior and --violin_data_percent specifying the fraction of data to plot
<prefix>_box_plot.html Generate a box plot comparing a specific parameter from test and control files. --comparison_metric specified with --single_charts enabled
<prefix>_single_ratio_bar_chart.html Generate a single bar chart comparing a specific parameter from test and control files. --comparison_metric specified with --single_charts enabled

Note: Replace <prefix> with the user-specified prefix that precedes all output filenames from the plot module. This prefix is set with the --output_prefix option when running the command.

Note: For the adaptive sampling plots specified with -AS command, there will be 2 files, test and control, for each type of bar chart, independent and cumulative.

Citation

A manuscript is currently in preparation and will be updated later with publication reference once available.

Legal

Copyright Government of Canada 2023

Written by: National Microbiology Laboratory, Public Health Agency of Canada

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Contact

Abdallah Meknas: [email protected]

About

A set of tools designed for ONT sequencing analysis, enabling the exploration and comparison of sequencing outcomes across experimental conditions.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •