A tool for analyzing sequencing run outputs primarily from adaptive sampling experiments and Oxford Nanopore Technology sequencers.
- Introduction
- Dependencies
- Installation
- Workflow Example
- Use-case Example
- Usage
- Quick Start
- Outputs
- Citation
- Legal
- Contact
Analyzing and interpreting sequencing data is a fundamental task in bioinformatics, and with the advent of ONT adaptive-sampling sequencing, specialized tools are needed to visualize and assess the effectiveness of enrichment or depletion in adaptive-sampling sequencing runs. Adaptive sampling data present challenges in effectively visualizing and assessing these sequencing runs in terms of key parameters, necessitating tailored analytical approaches and visual analytics. To assist with these challenges, we have developed a comprehensive bioinformatics pipeline consisting of three modules: analyze, plot, and filter_ONT. Our accessible pipeline aims to provide researchers with a fast and intuitive workflow for easily processing and analyzing sequencing data especially from ONT adaptive sequencing runs, enabling them to gain interpretable insights into their datasets with minimal upfront efforts.
The analyze module serves as the core component of our pipeline. First, It takes an input FASTQ file, a reference FASTA file, and an optional sequencing summary file from ONT sequencers or base callers. Next, Leveraging tools such as fastp
, minimap2
, pysam
, and mash
, this module performs a series of essential tasks. It filters the input FASTQ file, maps it to the reference FASTA file, and finally, generates a sequence manifest txt file and summary sequence manifest txt file. These files include key sequencing statistics such as read length, read quality (Q score), mapping efficiency, and coverage depth. For an in-depth explanation of all statistics provided, please refer to the report format section below.
The plot module complements the analysis performed by the "analyze" module by using the output to render interactive plots. It takes as input both a "test" and "control" directory, which represent different testing conditions, containing manifest and manifest summary txt files generated by the "analyze" module. With these files, the plot module generates visualizations that aid in the interpretation and visualization of the sequencing data. Please Note: This module is designed for comparative analysis where two testing conditions are present and can be compared.
The filter_ONT module is designed for for ONT raw reads filtering and subsetting. This module leverages a sequencing summary file to allow researchers to precisely filter reads based on customized criteria, including channel, sequencing decisions and other parameters.
Our bioinformatics pipeline offers a powerful tool for researchers working with ONT sequencing data. Whether you are exploring metagenomics sample composition, investigating adaptive sampling for your project, or conducting a comparative analysis of different methods in your lab, our pipeline can streamline your analyses and provide valuable insights into your genomic datasets using visual aids and easy to understand outputs.
- Python:
>=3.7.12, <4
- fastp:
>=0.22.0
- mash:
>=2.3
- minimap2:
>=2.26
- seqtk:
>=1.4
- samtools:
>=1.6
- pysam:
>=0.16.0
- plotly:
>=5.16.1
Install the latest released version from conda:
conda create -c bioconda -c conda-forge -n sequenoscope
Coming soon
Install using pip:
pip install sequenoscope
Coming soon
If you wish to install sequenoscope from source, please first ensure these dependencies are installed and configured on your system:
python>=3.7.12,<4
fastp >=0.22.0
mash >=2.3
minimap2 >=2.26
seqtk >=1.4
samtools >=1.6
pysam >=0.16.0
plotly >=5.16.1
Install the latest commit from the master branch directly from Github:
pip install git+https://github.com/phac-nml/sequenoscope.git
In this section, we will walk through a simple workflow using mock data to demonstrate how to use each module of sequenoscope. The mock data directory contains the following files:
mock_data/
├── mock_adaptive_sampling.fastq
├── mock_control.fastq
├── mock_sequencing_summary.txt
├── mock.fastq
└── mock_reference.fasta
Our goal is to:
- Use the
filter_ONT
module to subset raw FASTQ reads into two sets representing different channel ranges. - Run the
analyze
module on both sets (treated as control and adaptive sampling datasets). - Use the
plot
module to visualize and compare the results.
This workflow is meant to provide a hands-on example that you can easily follow with your own data.
First, we will create a dataset that simulates an adaptive sampling scenario by filtering reads by channel. Let’s start by extracting reads from channel 1 to 256 from our mock.fastq
dataset using the filter_ONT
module. This will give us a subset similar to mock_adaptive_sampling.fastq
.
Command:
sequenoscope filter_ONT --input_fastq mock.fastq \
--input_summary mock_sequencing_summary.txt \
-o mock_filter_ONT \
-min_ch 1 \
-max_ch 256
What this does:
- Takes reads from
mock.fastq
that come from channels 1 to 256. - Outputs a filtered subset in
mock_filter_ONT/sample_filtered_fastq_subset.fastq
which should be identical tomock_adaptive_sampling.fastq
.
If desired, you could similarly generate the control dataset by adjusting the channel range (e.g., -min_ch 257 -max_ch 512
) to create a mock_control.fastq
. However, since we already have mock_control.fastq
available, we’ll skip that step for now to keep things simple.
Output Directory Structure:
mock_filter_ONT/
├── filter.log
├── sample_filtered_fastq_subset.fastq
└── sample_read_id_list.csv
Next, we run the analyze
module on both the control and adaptive sampling datasets. This step will generate various output files including manifest files, BAM alignments, and summary statistics.
Command for Control Dataset:
sequenoscope analyze --input_fastq mock_control.fastq \
--input_reference mock_reference.fasta \
-seq_sum mock_sequencing_summary.txt \
-o mock_control_results \
-seq_type SE \
-op control
Explanation:
--input_fastq mock_control.fastq
: The control dataset FASTQ file.--input_reference mock_reference.fasta
: Reference genome or sequence.-seq_sum mock_sequencing_summary.txt
: The sequencing summary file from ONT.-o mock_control_results
: Output directory.-seq_type SE
: Single-end sequencing.-op control
: A prefix for output files.
Control Output Directory Structure:
mock_control_results/
├── analyze.log
├── control_fastp_output.fastp.fastq
├── control_fastp_output.html
├── control_fastp_output.json
├── control_manifest_summary.txt
├── control_manifest.txt
├── control_mapped_bam.bam
├── control_mapped_bam.bam.bai
├── control_mapped_fastq.fastq
├── control_mapped_sam.sam
├── control_mash_hash.msh
└── control_read_list.txt
Command for Adaptive Sampling Dataset:
sequenoscope analyze --input_fastq mock_adaptive_sampling.fastq \
--input_reference mock_reference.fasta \
-seq_sum mock_sequencing_summary.txt \
-o mock_adaptive_sampling_results \
-seq_type SE \
-op adaptive_sampling
Explanation:
mock_adaptive_sampling.fastq
represents the dataset filtered byfilter_ONT
(or provided).- The rest of the parameters are analogous to the control dataset.
-op adaptive_sampling
tags output files with "adaptive_sampling" for clarity.
Adaptive Sampling Output Directory Structure:
mock_adaptive_sampling_results/
├── adaptive_sampling_fastp_output.fastp.fastq
├── adaptive_sampling_fastp_output.html
├── adaptive_sampling_fastp_output.json
├── adaptive_sampling_manifest_summary.txt
├── adaptive_sampling_manifest.txt
├── adaptive_sampling_mapped_bam.bam
├── adaptive_sampling_mapped_bam.bam.bai
├── adaptive_sampling_mapped_fastq.fastq
├── adaptive_sampling_mapped_sam.sam
├── adaptive_sampling_mash_hash.msh
├── adaptive_sampling_read_list.txt
└── analyze.log
Finally, we use the plot
module to compare the control and adaptive sampling datasets. For this example, we will use hours
as the time bin due to truncated data in the mock dataset.
Command:
sequenoscope plot -T mock_adaptive_sampling_results/ \
-C mock_control_results/ \
-o mock_comparison_plots \
-op mock \
-AS \
-bin hours
Explanation:
-T mock_adaptive_sampling_results/
: Test (adaptive sampling) directory.-C mock_control_results/
: Control directory.-o mock_comparison_plots
: Output directory for plots.-op mock
: Prefix for output files.-AS
: Enable adaptive sampling decision charts.-bin hours
: Use hourly bins for time-based decision charts.
Plot Output Directory Structure:
mock_comparison_plots/
├── mock_control_cumulative_decision_bar_chart.html
├── mock_control_independent_decision_bar_chart.html
├── mock_ratio_bar_chart.html
├── mock_read_len_comparison_plot.html
├── mock_read_qscore_comparison_plot.html
├── mock_source_file_taxon_covered_bar_chart.html
├── mock_stat_results.csv
├── mock_test_cumulative_decision_bar_chart.html
├── mock_test_independent_decision_bar_chart.html
└── plot.log
In this workflow example, we:
- Used
filter_ONT
to subset reads from a mock dataset by channel number. - Applied
analyze
to both the control and adaptive sampling datasets, generating manifest files and alignment statistics. - Visualized and compared the results using
plot
, focusing on adaptive sampling decisions and coverage metrics.
By following these steps, you can quickly get started with sequenoscope and adapt the workflow to suit your own data and research needs.
To demonstrate the practical application of our pipeline, consider a scenario where a researcher conducts adaptive sampling using an ONT sequencer. In this example, the researcher divides the sequencer channels into two sets: one half for adaptive sampling enrichment and the other half for regular sequencing as a control.
-
Utilizing our filter_ONT module, the researcher can create two distinct sets of FASTQ files (a 1-256 FASTQ file and a 257-512 FASTQ file), each representing the minimum and maximum channels of the sequencing data.
-
These files are then processed separately through our analyze module, generating two datasets – one for the test (adaptive sampling) and one for the control (regular sequencing).
-
Finally, by employing the plot module, the researcher can visually assess the effectiveness of the adaptive sampling in their experiment. This example shows how Sequenoscope facilitates data processing and analysis, enhancing the researcher's ability to draw meaningful conclusions from their ONT sequencing data.
If you run sequenoscope
, you should see the following usage statement:
Usage: sequenoscope <command> <required arguments>
To get full help for a command use one of:
sequenoscope <command> -h
sequenoscope <command> --help
Available commands:
analyze map reads to a target and produce a report with sequencing statistics
plot generate plots based on directories with seq manifest files
filter_ONT filter reads from a FASTQ file based on a sequencing summary file
If you run sequenoscope analyze -h
or sequenoscope analyze --help
, you should see the following options and usage guidleines:
usage: sequenoscope analyze --input_fastq <file.fq> --input_reference <ref.fasta> -o <out> -seq_type <sr>[options]
For help use: sequenoscope analyze -h or sequenoscope analyze --help
sequenoscope version 0.0.5: a flexible tool for processing multiplatform sequencing data: analyze, subset/filter, compare and visualize.
Arguments:
-h, --help show this help message and exit
--input_fastq [ ...]
[REQUIRED] Path to ***EITHER 1 or 2*** fastq files to process.
--input_reference [REQUIRED] Path to a single reference FASTA file to process. the single FASTA file may contain several sequences.
-seq_sum , --sequencing_summary
Path to sequencing summary for manifest creation
-start , --start_time
Start time when no seq summary is provided
-end , --end_time End time when no seq summary is provided
-o , --output [REQUIRED] Output directory designation
-op , --output_prefix
Output file prefix designation. default is [sample]
-seq_type , --sequencing_type
[REQUIRED] A designation of the type of sequencing utilized for the input fastq files. SE = single-end reads and PE = paired-end reads.
-t , --threads A designation of the number of threads to use
-min_len , --minimum_read_length
A designation of the minimum read length. reads shorter than the integer specified required will be discarded, default is 15
-max_len , --maximum_read_length
A designation of the maximum read length. reads longer than the integer specified required will be discarded, default is 0 meaning no limitation
-trm_fr , --trim_front_bp
A designation of the how many bases to trim from the front of the sequence, default is 0.
-trm_tail , --trim_tail_bp
A designation of the how many bases to trim from the tail of the sequence, default is 0
-q , --quality_threshold
Quality score threshold for filtering reads. Reads with an average quality score below this threshold will be discarded. If not specified, no quality filtering will be performed.
-min_cov , --minimum_coverage
A designation of the minimum coverage for each taxon. Only bases equal to or higher then the designated value will be considered. default is 1
--minimap2_kmer A designation of the kmer size when running minimap2
--force Force overwite of existing results directory
If you run sequenoscope filter_ONT -h
or sequenoscope filter_ONT --help
, you should see the following options and usage guidleines:
usage: sequenoscope filter_ONT --input_fastq <file.fq> --input_summary <seq_summary.txt> -o <out.fastq> [options]
For help use: sequenoscope filter_ONT -h or sequenoscope filter_ONT --help
sequenoscope version 0.0.5: a flexible tool for processing multiplatform sequencing data: analyze, subset/filter, compare and visualize.
Arguments:
-h, --help show this help message and exit
--input_fastq [ ...]
Path to adaptive sequencing fastq files to process. Not required when using --summarize.
--input_summary [REQUIRED] Path to ONT sequencing summary file.
-o , --output [REQUIRED] Output directory designation
-op , --output_prefix
Output file prefix designation. default is [sample]
-cls , --classification
a designation of the adaptive-sampling sequencing decision classification ['unblocked', 'stop_receiving', or 'no_decision']
-min_ch , --minimum_channel
a designation of the minimum channel/pore number for filtering reads
-max_ch , --maximum_channel
a designation of the maximum channel/pore number for filtering reads
-min_dur , --minimum_duration
a designation of the minimum duration of the sequencing run in SECONDS for filtering reads
-max_dur , --maximum_duration
a designation of the maximum duration of the sequencing run in SECONDS for filtering reads
-min_start , --minimum_start_time
a designation of the minimum start time of the sequencing run in SECONDS for filtering reads
-max_start , --maximum_start_time
a designation of the maximum start time of the sequencing run in SECONDS for filtering reads
-min_q , --minimum_q_score
a designation of the minimum q score for filtering reads
-max_q , --maximum_q_score
a designation of the maximum q score for filtering reads
-min_len , --minimum_length
a designation of the minimum read length for filtering reads
-max_len , --maximum_length
a designation of the maximum read length for filtering reads
--force Force overwite of existing results directory
--summarize Generate barcode statistics. Must specify an input summary and output directory
-v, --version show program's version number and exit
If you run sequenoscope plot -h
or sequenoscope plot --help
, you should see the following options and usage guidleines:
usage: sequenoscope plot --test_dir <test_dir_path> --control_dir <control_dir_path> --output_dir <out_path>
For help use: sequenoscope plot -h or sequenoscope plot --help
sequenoscope version 0.0.5: a flexible tool for processing multiplatform sequencing data: analyze, subset/filter, compare and visualize.
Optional Arguments:
-h, --help show this help message and exit
Required Paths:
Specify the necessary directories for the tool.
-T TEST_DIR, --test_dir TEST_DIR
Path to test directory.
-C CONTROL_DIR, --control_dir CONTROL_DIR
Path to control directory.
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Output directory designation.
--force Force overwrite of existing results directory.
Plotting Options:
Customize the appearance and data for plots.
-op OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
Output prefix added before plot names. Default is 'sample'.
-AS ADAPTIVE_SAMPLING, --adaptive_sampling ADAPTIVE_SAMPLING
Generate decision bar charts for adaptive sampling if utilized during sequencing. Specify as True or False.
-single SINGLE_CHARTS, --single_charts SINGLE_CHARTS
Generate charts for data based on selected comparison metric.
--comparison_metric {est_genome_size,est_kmer_coverage_depth,total_bases,total_fastp_bases,mean_read_length,taxon_length,taxon_%_covered_bases,taxon_mean_read_length}
Type of parameter for the box plot and single ratio bar chart. Default parameter is taxon_%_covered_bases.
-VP VIOLIN_DATA_PERCENT, --violin_data_percent VIOLIN_DATA_PERCENT
Fraction of the data to use for the violin plot.
-bin {seconds,minutes,5m,15m,hours}, --time_bin_unit {seconds,minutes,5m,15m,hours}
Time bin used for decision bar charts.
-legend TAXON_CHART_LEGEND, --taxon_chart_legend TAXON_CHART_LEGEND
Generate a legend for the source file taxon covered bar chart.
Typically, ONT sequencing runs produce multiple FASTQ files for each barcode after base calling. Use the following steps to concatenate those files:
To concatenate multiple FASTQ files into a single FASTQ file, you can use the following command:
cat file1.fastq file2.fastq > combined.fastq
To concatenate multiple FASTQ GZ files and uncompress them into a single FASTQ file, you can use the following commands:
concatenate:
zcat file1.fastq.gz file2.fastq.gz > combined.fastq.gz
uncompress:
gzip -d combined.fastq.gz
Typically, paired end read sets will have a forward and a reverse compliment FASTQ that are compressed. Use these steps to uncompress them:
if the files are compressed, you can uncompress them as follows:
gzip -d Illumina_file_R1.fastq.gz
and
gzip -d Illumina_file_R2.fastq.gz
You should end up with two FASTQ files such as Illumina_file_R1.fastq
and Illumina_file_R2.fastq
which can then be run through sequenoscope analyze
module like this:
sequenoscope analyze --input_fastq Illumina_file_R1.fastq Illumina_file_R2.fastq --input_reference ref.fasta -o output -seq_type PE
The analyze module provides specific sequencing statistics based on the reference FASTA file provided. Refer to the outputs section below for more details.
To quickly get started with the analyze
module:
-
Ensure that you have the necessary input files and reference database prepared:
- Input FASTQ files: Provide the path to the FASTQ files you want to process using the
--input_FASTQ
option. - Reference database: Specify the path to the reference database in FASTA format using the
--input_reference
option.
- Input FASTQ files: Provide the path to the FASTQ files you want to process using the
-
Choose an output directory for the results:
- Specify the output directory path using the
--output
option.
- Specify the output directory path using the
-
Sprcify the sequencing type
- Specify the sequencing type
-seq_type
as either Paired-endPE
or Single-endSE
- Specify the sequencing type
-
Run the module with the minimally required options:
sequenoscope analyze --input_fastq <file.fq> --input_reference <ref.FASTA> -o <output_directory> -seq_type <sr>
This command will initiate the analysis module using the default settings. The input FASTQ file(s) will be processed, and the results will be saved in the specified output directory.
Please note that this is a simplified quick start guide, and additional options are available for advanced usage. For additional customization options and more detailed information on available options please run sequenoscope analyze -h
or sequenoscope analyze --help
.
Note: remember to replace <file.fq>
with the actual path to your FASTQ file, <ref.FASTA>
with the path to your reference database, <output_directory>
with the desired location for the output files and <sr>
with your sequencing type (SE for single-end and PE for paired-end).
Note: Taxon IDs are used as a naming convention, reflecting the sequence name in the FASTA file. The pipeline can process genes, subspecies, and other identifiers; it doesn't have to be a taxon.
To quickly get started with the filter_ONT
module:
-
Ensure that you have the necessary input files prepared:
- Input FASTQ files: Provide the path to the adaptive sequencing FASTQ files from ONT sequencer you want to process using the
--input_FASTQ
option. - ONT sequencing summary file: Specify the path to the ONT sequencing summary file using the
--input_summary
option that is either generated by MinKnow or base calling tool such as Guppy or Dorado.
- Input FASTQ files: Provide the path to the adaptive sequencing FASTQ files from ONT sequencer you want to process using the
-
Choose an output file and directory for the filtered reads:
- Specify the output file path and directory using the
--output
option.
- Specify the output file path and directory using the
-
Set the desired filtering criteria:
- You can optionally apply various filters to the reads based on the following criteria:
- Read classification status*: Use the
-cls
or--classification
option to designate the adaptive-sampling sequencing decision classification. Valid options are'unblocked'
,'stop_receiving'
, or'no_decision'
. - Channel range/Pore number: Set the minimum and maximum channel/pore number range for filtering using the
-min_ch
and-max_ch
options. - Duration: Define the minimum and maximum duration of the read sequencing time in seconds using the
-min_dur
and-max_dur
options. - Run time range: Specify the minimum and maximum start time of the sequencing run in seconds using the
-min_start
and-max_start
options. - Q score: Determine the minimum and maximum q score for filtering using the
-min_q
and-max_q
options. - Read length range: Set the minimum and maximum read length for filtering using the
-min_len
and-max_len
options.
- Read classification status*: Use the
- You can optionally apply various filters to the reads based on the following criteria:
Note: Some sequence summary files lack the field specifying read classification status. A warning will be raised if this is the case.
-
Run the command with the basic required options:
sequenoscope filter_ONT --input_fastq <file.fq> --input_summary <seq_summary.txt> -o <output.FASTQ>
This command will initiate the filtering process based on the specified criteria and save the filtered reads to the output FASTQ file.
Please note that this is a simplified quick start guide, and additional options are available for advanced usage. For more detailed information on available options, you can run sequenoscope filter_ONT -h
or sequenoscope filter_ONT --help
.
Note: Remember to replace <file.fq>
with the actual path to your ONT sequencing FASTQ file, <seq_summary.txt>
with the path to your ONT sequencing summary file, and <output.FASTQ>
with the desired path and filename for the filtered reads.
This module is designed for comparative analysis where two testing conditions are present and can be compared.
Visualize the analyze
module test and control directories outputs using interarctive graphs. To quickly get started with the plot
module:
- Required Paths: The plot module is comparative and requires two sets of data outputs from the
analyze
module to produce meaningful results. Ensure you have provided the necessary directories:
- Test Directory: Provide the path to the test directory that contains the sequence manifest txt files from the analyze module.
-T
or--test_dir
<test_dir_path>
- Control Directory: Specify the path to the control directory that contains the sequence manifest txt files from the analyze module.
-C
or--control_dir
<control_dir_path>
- Output Directory: Choose an output directory for the plots.
-o
or--output_dir
<out_path>
- Plotting Options: Customize your plots with various options:
- Output Prefix: You can add a prefix before plot names with the
--output_prefix
option.-op
or--output_prefix
<OUTPUT_PREFIX>
. Default is 'sample'. - Comparison Metric: Select a parameter for the box plot and single ratio bar chart using the
--comparison_metric
option. Default parameter is taxon*_%_covered_bases. - Single Charts: Generate an addtional box plot and single ratio bar chart based on selected comparison metric using the
--single_charts
option.{TRUE, FALSE}
. Default value is False - Adaptive Sampling: Generate read classification decision bar charts for adaptive sampling runs if utilized during sequencing by specifying
-AS
option. Default value is False - Violin Data Fraction: Set a fraction of the sequnecing data (total number of reads) to use for the violin plot.
-VP
or--violin_data_percent
<0.1 - 1>
. Default fraction is 0.1 - Time Bin Unit: Designate a time bin used for read classification decision bar charts.
-bin
or--time_bin_unit
{seconds,minutes,5m,15m,hours}
. Default bin is minutes - Taxon* Legend: Generate a legend for the source file taxon covered bar chart.
-legend
or--taxon_chart_legend
{TRUE, FALSE}
. Default designation is False
-
Run the Command: With the basic required options:
sequenoscope plot --test_dir <test_dir_path> --control_dir <control_dir_path> --output_dir <out_path>
Use the --force flag
if you wish to force an overwrite of an existing results directory.
Please note that this is a simplified quick start guide, and additional options are available for advanced usage. For more detailed information on available options, Please consult the usage section for more information on plot paramters or run sequenoscope plot -h
or sequenoscope plot --help
.
Remember to replace <test_dir_path>
, <control_dir_path>
, and <out_path>
with the actual paths for your directories.
Note: Taxon is a general term that refers to the reference sequences in the user-provided FASTA file.
File | Description |
---|---|
<prefix>_fastp_output.fastq |
The output FASTQ file after processing with fastp . It includes filtered and trimmed sequencing reads. |
<prefix>_fastp_output.html |
An HTML report generated by fastp summarizing the filtering and quality control results. |
<prefix>_fastp_output.json |
A JSON formatted report with detailed fastp quality control statistics. |
<prefix>_manifest.txt |
A sequence manifest file containing various sequencing statistics post-analysis. |
<prefix>_manifest_summary.txt |
A summary of the sequence manifest with key statistics for a quick overview. |
<prefix>_mapped.bam |
The BAM file output from minimap2 , containing aligned sequences to the reference FASTA. |
<prefix>_mapped.bam.bai |
An index file for the BAM file to enable quick read access. |
<prefix>_mapped_fastq.fastq |
The FASTQ file containing reads that have been mapped to the reference. |
<prefix>_mapped.sam |
The SAM file equivalent of the BAM file, containing human-readable alignment data. |
<prefix>_mash.hash.msh |
A MASH sketch file used for rapid genome distance estimation. |
<prefix>_read_list.txt |
A text file list of reads, potentially used for further downstream analysis. |
Note: Replace <prefix>
with the user-specified prefix that precedes all output filenames.
Column ID | Description |
---|---|
sample_id |
Identifier for the sample to which the read belongs. |
read_id |
Unique identifier for the sequencing read. |
read_len |
Length of the sequencing read in base pairs. |
read_qscore |
Quality score of the sequencing read. |
channel |
The channel on the sequencing device from which the read was recorded. |
start_time |
Time when the sequencing of the read started. |
end_time |
Time when the sequencing of the read ended. |
decision |
Indicates the final decision on the sequencing read. Decisions are categorized into three main types: stop_receiving (the sequencing is allowed to continue, represented by signal_positive ), unblocked (the read is ejected from sequencing, indicated by data_service_unblock_mux_change ), and no_decision (no definitive action was taken, denoted by either signal_negative or unblock_mux_change ). Each term explains the action taken or not taken based on the read's signal detection and processing status. |
fastp_status |
Indicates whether the read passed the filtering and trimming process by fastp . |
is_mapped |
Indicates whether the read is mapped to any sequence in the provided multi-sequence FASTA reference file (TRUE if mapped, also see note 1 below). |
is_uniq |
Indicates whether the read is unique within the sample manifest file (TRUE if unique, also see note 2 below). |
contig_id |
Identifier for the contig to which the read is mapped, if applicable. |
Notes:
is_mapped
refers to whether or not a read is mapped to any sequence in the multi-sequence FASTA reference file provided by the user. If true, thecontig_id
is provided.is_uniq
refers to whether or not a read is unique throughout the sample manifest file. In ONT sequencing, a read may be processed multiple times if the decision is labelled assignal_negative
orNo_decision
before a final decision is made on whether to allow the read to continue sequencing or not.
Column ID | Description |
---|---|
sample_id |
Identifier for the sample. |
est_genome_size |
Estimated size of the genome. |
est_coverage |
Estimated coverage of the genome. |
total_bases |
Total number of bases in the sample. |
total_fastp_bases |
Total number of bases after processing with fastp . |
mean_read_length |
Mean read length of the sequencing reads. |
taxon_id |
Identifier for the taxon. Obtained from the user-provided FASTA file. |
taxon_length |
Length of the taxon's genome. |
taxon_mean_coverage |
Mean coverage across the taxon's genome. |
taxon_covered_bases_<prefix>X |
Number of bases in the taxon's genome covered at user-specified coverage threshold. |
taxon_%_covered_bases |
Percentage of the taxon's genome that is covered by reads at the user-specified coverage threshold . |
total_taxon_mapped_bases |
Total number of bases mapped to the taxon. |
taxon_mean_read_length |
Mean read length of the reads mapped to the taxon. |
Note: Replace <prefix>
with the user-specified threshold coverage.
File | Description |
---|---|
<user_prefix>_filtered_fastq_subset.fastq |
The subset of FASTQ reads that have been filtered based on the user-defined criteria within the filter_ONT module. |
<user_prefix>_read_id_list.csv |
A CSV file containing the list of read identifiers that correspond to the filtered subset. This may be used for further reference or analysis. |
Note: Replace <prefix>
with the user-specified prefix that precedes all output filenames.
File | Description | Triggered by Command |
---|---|---|
<prefix>_ratio_bar_chart.html |
An HTML file containing a bar chart that displays the ratio statistics of the manifest summary file. | Default behavior |
<prefix>_source_file_taxon_covered_bar_chart.html |
An HTML file containing a bar chart displaying the coverage of taxa in the source files. | Default behavior and --taxon_chart_legend specifying the inclusion of a legend |
<prefix>_stat_results.csv |
A CSV file with statistical results of the analysis, such as taxa coverage percentages. | Default behavior |
<prefix>_cumulative_decision_bar_chart.html |
An HTML file containing a bar chart with cumulative decision metrics over time for either test or control datasets. | adaptive sampling enabled (-AS ) and time-bin specified (--time_bin_unit ) |
<prefix>_independent_decision_bar_chart.html |
An HTML file containing a bar chart with independent decision metrics over time for either test or control datasets. | adaptive sampling enabled (-AS ) and time-bin specified (--time_bin_unit ) |
read_len_<prefix>_violin_comparison_plot.html |
An HTML file containing a violin plot comparing log-transformed data between the test and control datasets. | Default behavior and --violin_data_percent specifying the fraction of data to plot |
read_qscore_<prefix>_violin_comparison_plot.html |
An HTML file containing a violin plot comparing q-score distributions between test and control datasets. | Default behavior and --violin_data_percent specifying the fraction of data to plot |
<prefix>_box_plot.html |
Generate a box plot comparing a specific parameter from test and control files. | --comparison_metric specified with --single_charts enabled |
<prefix>_single_ratio_bar_chart.html |
Generate a single bar chart comparing a specific parameter from test and control files. | --comparison_metric specified with --single_charts enabled |
Note: Replace <prefix>
with the user-specified prefix that precedes all output filenames from the plot
module. This prefix is set with the --output_prefix
option when running the command.
Note: For the adaptive sampling plots specified with -AS
command, there will be 2 files, test and control, for each type of bar chart, independent and cumulative.
A manuscript is currently in preparation and will be updated later with publication reference once available.
Copyright Government of Canada 2023
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Abdallah Meknas: [email protected]