About METEORE
METEORE provides snakemake pipelines for various tools to detect DNA methylation from Nanopore sequencing reads. Additionally, it provides new predictive models (random forest and multiple linear regression) that combine the outputs from the tools to produce a consensus prediction with higher accuracy than the individual tools.
NEW UPDATES (Mar-2021)
METEORE can now produce two per-site result files in an augmented BED format for each tool except for DeepMod (which will be updated very soon). The first output file contains the following fields:
- Reference chromosome
- Start position in chromosome
- End position in chromosome
- Read coverage
- Methylation (i.e. methylation frequency)
- Strandedness
In the second output file, we combine the methylation predictions from both strands on CpG sites by averaging the methylation frequencies and adding up the coverage. This output file contains the following fields:
- Reference chromosome
- Start position in chromosome
- End position in chromosome
- Read coverage
- Methylation (i.e. methylation frequency)
- Pipeline
- Installation
- Tutorial on an example dataset
- Combined model (random forest) usage
- Combined model (multiple linear regression) usage
Fig 1. Pipeline for CpG methylation detection form nanopore sequencing data. All tools take the input fast5 files, detect modified bases (5-methylcytosine at CG dinucleotides in this case) in reads and predict per-site methylation frequency at genome level.
We recommend to install software dependencies via Conda
on Linux. You can find Miniconda installation instructions for Linux here.
Make sure you install the Miniconda Python3 distribution.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Accept the license terms during installation.
For performance and compatibility reasons you should install Mamba
via conda to install Snakemake for each pipeline later. See Snakemake documentation for more details.
conda install -c conda-forge mamba
Once you have installed Conda and Mamba, you can download the Snakemake pipelines and the example datasets.
git clone https://github.com/comprna/METEORE.git
cd METEORE/
We provide an example dataset data/example
along with a genome reference data/ecoli_k12_mg1655.fasta
for you to try the pipelines with. The example contains 50 single-read fast5 files from the positive control dataset for E.coli generated by Simpson et al. (2017).
Run the pipelines with your own data:
- You can run the pipeline with your own dataset by replacing
example
folder in thedata
directory with your folder containing the fast5 files. You will use the fast5 folder name to specify your target output file in the Snakemake pipeline. Simply replace example in the output file with your fast5 folder name in the command line below. - You should place the reference genome file in .fasta format in a folder named
data
, and re-define the reference genome file within the Snakefile (Nanopolish
,Deepsignal1
,Tombo
,Guppy
) by replacingecoli_k12_mg1655.fasta
with your specified reference genome.
To install packages for Nanopolish pipeline, run one of the following:
- Installing packages via Mamba
# Create an environment with Snakemake installed
mamba create -c conda-forge -c bioconda -n meteore_nanopolish_env snakemake
# Activate
conda activate meteore_nanopolish_env
# Install all required conda packages with mamba
mamba install -c bioconda nanopolish minimap2 samtools r-data.table r-dplyr r-plyr
- Installing packages using .yml file**
mamba env create -f nanopolish.yml
conda activate meteore_nanopolish_env
Before executing the workflow below, make sure you have the basecalled fastq file in the METEORE
directory. Nanopolish needs to link the read ids from the fastq file with their signal-level data in the fast5 files. An example fastq file example.fastq
is provided.
A Snakefile named Nanopolish
contains all rules for the Snakemake workflow. Run the snakemake to create the output files:
snakemake -s Nanopolish nanopolish_results/example_nanopolish-freq-perCG.tsv --cores all
This will produce four index files example.fastq.index
, example.fastq.index.fai
, example.fastq.index.gzi
and example.fastq.index.readdb
, and the nanopolish_results
output directory containing all output files.
example_nanopolish-log.tsv
is the raw output after runningnanopolish call-methylation
.example_nanopolish-log-perCG.tsv
contains per-read per-site data, which splits up the CpG group containing multiple nearby sites into its constituent CpG sites.
Chr Pos Strand Log.like.ratio Read_ID
NC_000913.3 3499494 + -0.62 094dfe6b-23ed-4195-8876-805a399fade5
NC_000913.3 3499526 + -0.33 094dfe6b-23ed-4195-8876-805a399fade5
NC_000913.3 3499546 + -0.12 094dfe6b-23ed-4195-8876-805a399fade5
NC_000913.3 3499563 + 8.26 094dfe6b-23ed-4195-8876-805a399fade5
-
example_nanopolish-freq-perCG.tsv
stores the final per-site data in a augmented BED format where the columns represent:- Reference chromosome
- Start position in chromosome
- End position in chromosome
- Read coverage
- Methylation (i.e. methylation frequency)
- Strandedness
Chr Pos_start Pos_end Coverage Methylation Strand
NC_000913.3 3503839 3503840 7 1 +
NC_000913.3 3503840 3503841 7 1 -
NC_000913.3 3503849 3503850 7 1 +
NC_000913.3 3503850 3503851 7 1 -
-
example_nanopolish-freq-perCG-combStrand.tsv
also stores the final per-site data in the same augmented BED format but the methylation calls from both strands are merged into a single strand by averaging the methylation frequencies and adding up the coverage for a CpG site. Each column represents:- Reference chromosome
- Start position in chromosome
- End position in chromosome
- Read coverage
- Methylation (i.e. methylation frequency)
Chr Pos_start Pos_end Coverage Methylation
NC_000913.3 3503839 3503840 14 1
NC_000913.3 3503849 3503850 14 1
You can also generate the per-read prediction output in a format that can be used in METEORE to generate a consensus prediction (see further below).
snakemake -s Nanopolish nanopolish_results/example_nanopolish-perRead-score.tsv --cores all
The output is in TSV format and contains the following fields: ID
, Chr
, Pos
, Strand
and Score
, e.g.:
ID Chr Pos Strand Score
094dfe6b-23ed-4195-8876-805a399fade5 NC_000913.3 3499655 - 20.79
...
Note: Two environments with Snakemake installed are required. The first is for Deepsignal itself. The second is to produce output files that are in the same format generated by other tools.
We will create and first Conda environment:
# Here we install Python 3.6 with older version of Snakemake so we can install tensorflow==1.13.1 later
mamba create -n meteore_deepsignal_env1 -c bioconda -c conda-forge python=3.6.1 snakemake=5.3.0
# Activate
conda activate meteore_deepsignal_env1
# Install all required packages using pip
# We will run DeepSignal on CPUs so we install the CPU version of tensorflow
pip install numpy deepsignal 'tensorflow==1.13.1' ont-tombo ont-fast5-api
Alternatively, you can run DeepSignal on a GPU-enabled machine. Then you can install the GPU version of tensorflow. Please check out DeepSignal Github Page for more information.
Please download DeepSignal's trained model model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz
here.
To extract the model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz
to the METEORE/data
directory:
tar xvzf model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz -C <path_to_METEORE/data_directory>
Note:
- DeepSignal does not support multi-read fast5 files. Please use the
multi_to_single_fast5
command from the ont_fast5_api package to convert the fast5 files to single-read fast5 format before running the snakemake. - Raw read fast5 files must contain basecall information from the fastq files. If not, please use the
tombo preprocess annotate_raw_with_fastqs
command from tombo preprocess to add basecalls to the fast5 files.
A Snakefile named Deepsignal1
contains rules to generate default output files from DeepSignal in the Snakemake workflow. Run it to create these files:
snakemake -s Deepsignal1 deepsignal_results/example_deepsignal-freq-perCG-raw.tsv
This will produce the deepsignal_results
output directory containing the following output files:
example_deepsignal-prob.tsv
contains the raw per-read results including the position of the CpG site, read ID, strand, methylated probability, unmethylated probability etc.example_deepsignal-freq-perCG-raw.tsv
contains the default per-site results generated by a Python script call_modification_frequency.py provided byDeepSignal
. The file contains 11 columns:
#chr pos strand 0-based_pos prob_unmethy_sum prob_methyl_sum count_modified count_unmodified coverage mod_freq k_mer
NC_000913.3 3501290 + 3501290 1.947 5.053 7 0 7 1.000 AAAAGCACCGTGGACTT
NC_000913.3 3501291 - 1140360 3.103 1.897 1 4 5 0.200 AAAGTCCACGGTGCTTT
NC_000913.3 3501308 + 3501308 1.032 5.968 7 0 7 1.000 CTGGTCACCGAAAATAT
NC_000913.3 3501309 - 1140342 1.493 3.507 3 2 5 0.600 CATATTTTCGGTGACCA
Deactivate the meteore_deepsignal_env environment before creating a new environment. We use Conda rather than Mamba to avoid any caching issues (known bug in Mamba).
conda deactivate
conda create -n meteore_deepsignal_env2 -c bioconda -c conda-forge r-base=4.0.3 r-data.table r-dplyr snakemake
conda activate meteore_deepsignal_env2
Another Snakefile named Deepsignal2
contains the rules for the Snakemake workflow to produce per-site result files and prepare the input file for combined model usage. Note the uses of --cores in the newer version of Snakemake.
snakemake -s Deepsignal2 deepsignal_results/example_deepsignal-freq-perCG.tsv --cores 1
Two output files will be produced and saved to the deepsignal_results
directory:
-
example_deepsignal-freq-perCG.tsv
stores the final per-site data in a augmented BED format where the columns represent:- Reference chromosome
- Start position in chromosome
- End position in chromosome
- Read coverage
- Methylation (i.e. methylation frequency)
- Strandedness
Chr Pos_start Pos_end Coverage Methylation Strand
NC_000913.3 3501827 3501828 9 0.889 +
NC_000913.3 3501828 3501829 8 1 -
NC_000913.3 3501840 3501841 9 1 +
NC_000913.3 3501841 3501842 8 0.875 -
NC_000913.3 3501861 3501862 9 0.889 +
NC_000913.3 3501862 3501863 8 0.875 -
-
example_deepsignal-freq-perCG-combStrand.tsv
also stores the final per-site data in the same augmented BED format but the methylation calls from both strands are merged into a single strand by averaging the methylation frequencies and adding up the coverage for a CpG site. Each column represents:- Reference chromosome
- Start position in chromosome
- End position in chromosome
- Read coverage
- Methylation (i.e. methylation frequency)
Chr Pos_start Pos_end Coverage Methylation
NC_000913.3 3501827 3501828 17 0.9445
NC_000913.3 3501840 3501841 17 0.9375
NC_000913.3 3501861 3501862 17 0.882
You can now generate the per-read prediction output in a format that can be used in METEORE to generate a consensus prediction (see further below). Note the uses of --cores in the newer version of Snakemake.
snakemake -s Deepsignal2 deepsignal_results/example_deepsignal-perRead-score.tsv --cores 1
The output is in TSV format and contains the following fields: ID
, Chr
, Pos
, Strand
and Score
.
# Create an environment with Snakemake installed
mamba create -c conda-forge -c bioconda -n meteore_tombo_env snakemake==5.3.0 python==3.6.10 ont-tombo r-data.table r-dplyr
# Activate
conda activate meteore_tombo_env
# Install all required packages using pip
pip install wiggelen ont-fast5-api h5py==2.10.0
Note:
- Tombo does not support multi-read fast5 files. Please use the
multi_to_single_fast5
command from the ont_fast5_api package to convert the fast5 files to single-read fast5 format before running the snakemake. - Raw read fast5 files must contain basecall information from the fastq files. If not, please use the
tombo preprocess annotate_raw_with_fastqs
command from tombo preprocess to add basecalls to the fast5 files
A Snakefile named Tombo
contains all rules in the Snakemake workflow. Run it to create the output files:
snakemake -s Tombo tombo_results/example_tombo-freq-perCG.tsv
This will produce the following original modified base prediction results generated by Tombo
. You can find the detailed information for all Tombo commands and outputs on the tombo documentation page.
example.CpG.tombo.stats
: a binary Tombo statistics fileexample.CpG.tombo.per_read_stats
: a per-read statistics fileexample.fraction_modified_reads.plus.wig
: a wiggle output file which stores the raw fraction of significantly modified reads mapped on +'ve strandexample.fraction_modified_reads.minus.wig
: a wiggle output file which stores the raw fraction of significantly modified reads mapped on -'ve strandexample.valid_coverage.plus.wig
: a wiggle output file which stores the coverage data for reads on +'ve strand that are mapped, re-squiggled and outside the interval of the default thresholdsexample.valid_coverage.minus.wig
: a wiggle output file which stores the coverage data for reads on -'ve strand that are mapped, re-squiggled and outside the interval of the default thresholds
There are rules in the Snakemake workflow for downstream processing of the output files generated by Tombo. The above command also generates a tombo_results
output directory which contains the following output files:
-
example_tombo-freq-only.tsv
: contains methylation frequency results from the wiggle files -
example_tombo-cov-only.tsv
: contains coverage data from the wiggle files -
example_deepsignal-freq-perCG.tsv
stores the final per-site data in a augmented BED format including the following fields:- Reference chromosome
- Start position in chromosome
- End position in chromosome
- Read coverage
- Methylation (i.e. methylation frequency)
- Strandedness
-
example_deepsignal-freq-perCG-combStrand.tsv
also stores the final per-site data in the same augmented BED format but the methylation calls from both strands are merged into a single strand by averaging the methylation frequencies and adding up the coverage for a CpG site. Each column represents:- Reference chromosome
- Start position in chromosome
- End position in chromosome
- Read coverage
- Methylation (i.e. methylation frequency)
You can also generate the per-read prediction output in a format that can be used in METEORE to generate a consensus prediction (see further below).
snakemake -s Tombo tombo_results/example_tombo-perRead-score.tsv
The output is in TS format and contains the following fields: ID
, Chr
, Pos
, Strand
and Score
.
Note:
- This command can only extract per-read data for a region of interest. You need to go to script_in_snakemake/extract_tombo_per_read_results.py to specify your region of interest (chromosome, start position and end position) in the Python script and rerun the results.
- Open the Python script extract_tombo_per_read_results.py with an editor of your choice and go to line 21 to 23 of the file:
######## specify region of interest below: ########
chromosome = 'NC_000913.3'
start_position = 412305
end_position = 4584088
You need to basecall with the standalone Guppy basecaller before running the Snakemake pipeline. The pipeline was only designed to process and analyze the Guppy's fast5 output using the open-source custom scripts available from https://github.com/kpalin/gcf52ref. This was only tested with Guppy (version 3.6) when modified basecalling was introduced. We highly recommend to use Megalodon (ONT-developed tool) coupled with Guppy for methylation calling (see Megalodon section below).
Guppy basecaller is only available to Oxford Nanopore Technologies' customers via the community site. Please check the community page for download and installation instructions.
Once you have installed Guppy, you can perform modified basecalling from the signal data using the dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg
guppy config.
# For Guppy (CPU on Linux)
./<path_to_ont-guppy-cpu>/bin/guppy_basecaller --config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg --fast5_out --input_path data/example/ --save_path guppy_results/ --cpu_threads_per_caller 10
# For Guppy (GPU on Linux)
./<path_to_ont-guppy-gpu>/bin/guppy_basecaller --config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg --fast5_out --input_path data/example/ --save_path guppy_results/ --device auto
Once you have done modified basecalling with Guppy, the output files will be saved to the guppy_results
directory where you will find the logs, basecalled fastq file(s) and a folder named workspace
which contains basecalled fast5 files with modified base information.
Before running the Snakmake pipeline, you need to prepare the following files:
- If you have more than 1 fastq files generated, please concatenate these files first
cd guppy_results/
cat *.fastq > example.fq
- If there is only one fastq file, rename that file instead
mv [original_fastq_file_with_long_filename] example.fq
- Download the scripts that convert CpG methylation from fast5s to reference anchored calls
git clone https://github.com/kpalin/gcf52ref.git
Then you can create and activate the environment from the guppy.yml
file:
mamba env create -f guppy.yml
conda activate guppy_cpg_snakemake
A Snakefile named Guppy contains all rules in the Snakemake workflow. Run the snakemake to create the output files:
cd ../
snakemake -s Guppy guppy_results/example_guppy-freq-perCG.tsv
This will produce two output files in the guppy_results
directory:
example_guppy-log-perCG.tsv
contains per-read per-site data.
chromosome strand start end read_name log_lik_ratio log_lik_methylated log_lik_unmethylated num_calling_strands num_motifs sequence
NC_000913.3 + 3505711 3505712 8386596c-ff56-4032-b54e-8f062c194b16 -2.41 -2.41 0.0 1 1 CG
NC_000913.3 + 3505714 3505715 8386596c-ff56-4032-b54e-8f062c194b16 -0.575 -0.676 -0.101 1 1 CG
NC_000913.3 + 3505726 3505727 8386596c-ff56-4032-b54e-8f062c194b16 1.49 -0.012 -1.51 1 1 CG
NC_000913.3 + 3505728 3505729 8386596c-ff56-4032-b54e-8f062c194b16 0.361 -0.155 -0.516 1 1 CG
NC_000913.3 + 3505745 3505746 8386596c-ff56-4032-b54e-8f062c194b16 -0.112 -0.359 -0.247 1 1 CG
example_guppy-freq-perCG.tsv
stores the final per-site data in a augmented BED format including the following fields:Chromosome
,Start position
,End position
,Coverage
,Methylation frequency
andStrand
, which is in the same format generated by other tools.example_guppy-freq-perCG-combStrand.tsv
also stores the final per-site data in the same augmented BED format but the methylation calls from both strands are merged into a single strand by averaging the methylation frequencies and adding up the coverage for a CpG site. This output contains the following fields:Chromosome
,Start position
,End position
,Coverage
andMethylation frequency
, which is in the same format generated by other tools.example_guppy-perRead-score.tsv
is a per-read prediction file containing five columns:Read_ID
,Chromosome
,Position
,Strand
andScore
. This file can be used later in METEORE to generate a consensus prediction.
You can also generate the per-read prediction output in a format that can be used in METEORE to generate a consensus prediction (see further below).
snakemake -s Guppy guppy_results/example_guppy-perRead-score.tsv
The output is in tsv format and contains the following fields: ID
, Chr
, Pos
, Strand
and Score
.
We did not develop a snakemake pipeline for Megalodon as it can basecall, map, detect base modifications with Guppy (GPU on Linux) and Remora running backend, using a single command.
Please check out Megalodon GitHub Page and Megalodon's documentation for more details.
Here we used the most recent modified basecalling models in Remora that provides the most accurate methylation result. See Remora for installation and usage.
We tested with Megalodon (v2.5.0), Guppy (v6.1.1, GPU on linux), Remora (v1.0.0). GPU resources are required.
megalodon data/example/ \
--outputs basecalls mappings mod_mappings per_read_mods mods \
--guppy-config dna_r9.4.1_450bps_hac.cfg \
--guppy-server-path <path_to_ont-guppy-gpu>/bin/guppy_basecall_server \
--reference data/ecoli_k12_mg1655.fasta \
--remora-modified-bases dna_r9.4.1_e8 hac 0.0.0 5mc CG 0 \
--overwrite --mod-motif m CG 0 --write-mods-text \
--devices "cuda:all" --processes 20
The megalodon
command produces the megalodon_results
directory containing logs, per-read modified base output per_read_modified_base_calls.txt
, per-site modified base output modified_bases.5mC.bed
.
We provide a bash script to generate the methylation frequency file and the input file for combined model usage, which are in the same format used in every method. Run the bash script with your preferred filename (eg. example) for the output files:
./script/megalodon.sh example
You will get the following files:
example_megalodon-freq-perCG.tsv
stores the final per-site data in a augmented BED format including the following fields:Chromosome
,Start position
,End position
,Coverage
,Methylation frequency
andStrandedness
, which is in the same format generated by other tools.example_deepsignal-freq-perCG-combStrand.tsv
also stores the final per-site data in the same augmented BED format but the methylation calls from both strands are merged into a single strand by averaging the methylation frequencies and adding up the coverage for a CpG site. This output contains the following fields:Chromosome
,Start position
,End position
,Coverage
andMethylation frequency
, which is in the same format generated by other tools.example_megalodon-perRead-score.tsv
is a per-read prediction file containing five columns:Read_ID
,Chromosome
,Position
,Strand
andScore
. This file can be used later in METEORE to generate a consensus prediction.
We did not develop a snakemake pipeline for DeepMod as it can be run with a single command.
Please check out DeepMod GitHub Page for installation instructions and usage.
# Set the working directory where you install DeepMod
python bin/DeepMod.py detect --wrkBase <path_to_METEORE>/data/example/ --Ref <path_to_METEORE>/data/ecoli_k12_mg1655.fasta --outFolder <path_to_METEORE>/deepmod_results --Base C --modfile train_mod/rnn_conmodC_P100wd21_f7ne1u0_4/mod_train_conmodC_P100wd21_f3ne1u0 --FileID example_deepmod --threads 10
You will find a deepmod_results
output folder which contains a .done file and a example_deepmod
result folder. Inside the example_deepmod
folder, you will find two output files in a bed format mod_pos.NC_000913.3-.C.bed
and mod_pos.NC_000913.3+.C.bed
, which per-site data including genomic position of the CpG site, strand, base, coverage, methylation percentage and methylation coverage. The format of the output of DeepMod is described here.
We provide a R script to generate a methylation frequency file in the same format used in other methods. Run Rscript script/run_deepmod.R deepmod_output_4_plus_strand.bed deepmod_output_4_minus_strand.bed output_file.tsv
# Set the working directory where you install METEORE
Rscript script/run_deepmod.R deepmod_results/example_deepmod/mod_pos.NC_000913.3+.C.bed deepmod_results/example_deepmod/mod_pos.NC_000913.3-.C.bed example_deepmod-freq-perCG.tsv
The example_deepmod-freq-perCG.tsv
output file contains the per-site data including the genomic position of the CpG site, methylation frequency and coverage. The methylation calls from both strands are merged into a single strand.
Note that DeepMod does not produce per-read predictions, so we could not produce an output file in the same format as those described above (.tsv) to be later combined to build a consensus.
We have trained Random Forest models that combine the outputs from two of the methods above to produce consensus predictions with improved accuracy. We also provide the scripts to train new models with two or more methods.
First, please make sure you install the required libraries in the Conda environment
pip install -r requirements.txt
To make the predictions using the combination model (Here we use DeepSignal and Nanopolish as an example), you can generate the per-read prediction input
file example_deepsignal-perRead-score.tsv
and example_nanopolish-perRead-score.tsv
as described before in our snakemake pipelines.
The input (.tsv) file from each method must be formatted as below:
ID Chr Pos Strand Score
094dfe6b-23ed-4195-8876-805a399fade5 NC_000913.3 3499495 - 2.99
094dfe6b-23ed-4195-8876-805a399fade5 NC_000913.3 3499527 - 3.55
094dfe6b-23ed-4195-8876-805a399fade5 NC_000913.3 3499547 - 6.19
094dfe6b-23ed-4195-8876-805a399fade5 NC_000913.3 3499564 - 4.45
where the score is the significance score given by each method. Nanopolish and Guppy provide a log-likelihood ratio value per site and per read. Similarly, Tombo produces a significance value per site and per read. On the other hand, for DeepSignal we give the log-ratio of the probabilities for a site to be methylated over the probablity of being unmethylated for each read. Similarly, for Megalodon we used the difference of the log-probabilities to obtain a log-ratio.
Given the already pre-trained model (models available in ./saved_models/
), to run the model you need the following command:
python combination_model_prediction.py -i [path to the tsv file storing the methods' names and the paths to the per-read score file] -m [model_to_use (default or optimized)] -o [output_file]
The options (-i
, -m
and -o
) are explained below:
Inputs (-i)
The file for option -i
contains the methods you choose to use (1st column) and the path to the respective per-read score files (2nd column):
deepsignal ./example_results/deepsignal_results/example_deepsignal-perRead-score.tsv
nanopolish ./example_results/nanopolish_results/example_nanopolish-perRead-score.tsv
Here we call this file model_content.tsv
(provided as an example with the package) and use the results example_results/
produced by the method from the example dataset.
Models (-m)
You can choose to use either -m default
or -m optimised
:
default
refers to the parameters used to build the Random Forest, which in this case are the default parameters from the sklearn library (n_estimator = 100 and max_dep = None). The command below will use the model namedrf_model_default_deepsignal_nanopolish.model
to score the sites and reads that are common between the two filesexample_deepsignal-perRead-score.tsv
andexample_nanopolish-perRead-score.tsv
, as these files are defined in themodel_content.tsv
.optimised
refers to the adjusted parameters for n_estimator and max_dep used to build the Random Forest from sklern (n_estimator = 3 and max_dep = 10). The model used in the command below will berf_model_max_depth_3_n_estimator_10_deepsignal_nanopolish.model
.
# Use the default model
python combination_model_prediction.py -i model_content.tsv -m default -o example_deepsignal_nanopolish-default-model-perRead.tsv
# Use the optimized model
python combination_model_prediction.py -i model_content.tsv -m optimized -o example_deepsignal_nanopolish-optimized-model-perRead.tsv
Run with your own files using a different combination of tools
- If you want a different combination, e.g. deepsignal+guppy, you can replace the "nanopolish" line in the
model_content.tsv
file with guppy and its input file path. You can also add other methods paths as well in the same file. - The order of method's names in the model_content.tsv file should be same as the order used to generate the combined model. For example, we provide the models with name 'rf_model_default_deepsignal_nanopolish.model' so the order in the
model_content.tsv
file should be deepsignal and then nanopolish and not the other way round. You can check the models we trained and the order of the names in thesaved_models
directory.
Output (-o)
Use -o option to write the per-read consensus prediction output to the specified file.
All outputs are written into the directory called combined_model_results
. New results from subsequent runs will be saved into
the same output directory.
The output will contain per-read predictions for the reads common to both deepsignal and nanopolish method. For example, in the example_deepsignal-nanopolish-optimized-model-perRead.tsv
ID Chr Pos Prediction Prob_methylation
c3e4c0c6-11f9-4ecb-858e-877e81e31a0c NC_000913.3 3508489 0 0.41759488451060933
c3e4c0c6-11f9-4ecb-858e-877e81e31a0c NC_000913.3 3508467 0 0.41759488451060933
c3e4c0c6-11f9-4ecb-858e-877e81e31a0c NC_000913.3 3508454 1 0.9565835858934412
The column prediction
(0 refers to unmethylated and 1 refers to methylated) is based on a threshold of 0.5 score. That is, if the P(methylation) is <= 0.5,
it is predicted as unmethylated (0), otherwise as methylated (1). Predictions for a different thresholds can be obtained by parsing the column Prob_methylation
.
We also provide a Python script to convert the per-site predictions for each individual read (methylated / unmethylated) into per-site predictions at genome level (methylation frequency) by summarizing the predictions on individual reads from the model.
prediction_to_mod_frequency.py [-h] -i INPUT [-t THRESHOLD] -o OUTPUT
where the input is the file produced from the previous step, and the output is the per-site result at genome level. You can specify a threshold which applies on the Prob_methylation column in the per-read combined prediction file (see above) to define the site on the read as methylated or unmethylated.
Run python prediction_to_mod_frequency.py -h
for help.
For example:
# You can specify a threshold when -t option is used
python prediction_to_mod_frequency.py -i combined_model_results/example_deepsignal_nanopolish-optimized-model-perRead.tsv -t 0.46 -o combined_model_results/example_deepsignal_nanopolish-optimized-model-freq-perCG.tsv
# Or you can use the default threshold of 0.5
python prediction_to_mod_frequency.py -i combined_model_results/example_deepsignal_nanopolish-default-model-perRead.tsv -o combined_model_results/example_deepsignal_nanopolish-default-model-freq-perCG.tsv
The per-site prediction output contains the following fields:
Chr Pos Unmodified_read Modified_read Coverage Methylation_frequency
NC_000913.3 3504995 6 10 16 0.625
NC_000913.3 3505005 1 15 16 0.9375
NC_000913.3 3505010 1 15 16 0.9375
NC_000913.3 3505034 3 13 16 0.8125
We provide the script to train a combined model from the per-read and per-site output from any number of methods (from 2 to 5). For instance, the command to train a model with 5 methods would be:
python combination_model_train.py -d [path of deepsignal file] -n [path of nanopolish file] -g [path of guppy file] -m [path of megalodon file] -t [path of tombo file] -c [number of methods to combine together for training (range from 2-5)] -o [output_path_to_save_model]
The order of the methods in the command is not relevant, but it will determine how to specify the model when making predictions, as described above. For example, the command to train a model with DeepSignal, Nanopolish, and guppy, could take the following form:
python combination_model_train.py -d deepsignal.tsv -n nanopolish.tsv -g guppy.tsv -c 3 -o output_models
This command will create an output directory called output_models
, and the models named rf_model_default_deepsignal_nanopolish_guppy.model
and rf_model_max_depth_3_n_estimator_10_deepsignal_nanopolish_guppy.model
will be saved inside this directory. To run this new model, you will need to create a new file
new_model_content
of the form, e.g.:
deepsignal deepsignal_test2.tsv
nanopolish nanopolish_test2.tsv
guppy guppy_test2.tsv
where the _test2.tsv
files contain the reads and scores that you want to use for consensus prediction in the format
ID Pos Strand Score
b9fdd6aa-ba93-4424-8f4b-c632e4d16d2e 1817032 - 2.45
...
The command to obtain the consensus predictions could be of the form
python combination_model_prediction.py -i new_model_content.tsv -m optimized -o output2
This command will use the model rf_model_max_depth_3_n_estimator_10_deepsignal_nanopolish_guppy.model
to predict on the sites and reads common in the input files
deepsignal_test2.tsv
, nanopolish_test2.tsv
, and guppy_test2.tsv
.
An alternative combination model based on multiple linear regression (REG) is available through the meteore_reg.py
script. This is a standalone script that first builds a model training data and then applies it to unknown test data generated from the same tools.
It's a simple script to run but requires upwards of 6 GB of RAM, and approximately five minutes for the data sets used in our publication.
The METEORE REG model was trained on the E.coli datasets with 11 different mixture sets of methylated and unmethylated reads (0%, 10%,..., 90%, 100% of methylated reads), each set with ~2400 reads. Here we provide the training datasets of the three best performing tools (Nanopolish, DeepSignal, Megalodon) where you can download them from figshare. The minimum needed for training is to have two mixture proportions that are known.
The example usage below would train the REG model on Nanopolish and DeepSignal outputs from the "mix1" dataset, and apply this model to the data in "mix2" for the same tools. Inputs and outputs rely on the ordering of arguments to remain consistent.
python meteore_reg.py --train set1_nanopolish.tsv set1_deepsignal.tsv \
--tool_names nanopolish deepsignal --train_desc mix1 --test_desc mix2 \
--test set2_nanopolish.tsv set2_deepsignal.tsv --testIsTraining \
--trunc_min -50 -5 --trunc_max 50 5 --filter 0.1
python meteore_reg.py -h
usage: meteore_reg.py [-h] --train TRAIN [TRAIN ...] --train_desc TRAIN_DESC [--test [TEST [TEST ...]]]
[--test_desc TEST_DESC] --tool_names TOOL_NAMES [TOOL_NAMES ...]
[--trunc_max [TRUNC_MAX [TRUNC_MAX ...]]] [--trunc_min [TRUNC_MIN [TRUNC_MIN ...]]] [--debug]
[--inc_pred] [--filter FILTER] [--testIsTraining] [--no_scaling]
optional arguments:
-h, --help show this help message and exit
--train TRAIN [TRAIN ...]
Path to training data
--train_desc TRAIN_DESC
Describe the training data set
--test [TEST [TEST ...]]
Path to test data. Must be omitted or same number of tool files as as --train (and in the same order)
--test_desc TEST_DESC
Describe the test data set
--tool_names TOOL_NAMES [TOOL_NAMES ...]
Names of each tool in the same order as given for --train (and --test if provided)
--trunc_max [TRUNC_MAX [TRUNC_MAX ...]]
Truncate maximum values. Must match order of training data
--trunc_min [TRUNC_MIN [TRUNC_MIN ...]]
Truncate minimum values. Must match order of training data
--debug Turn on debugging output (STDERR)
--inc_pred Include tool prediction categories in regression. NOTE: must be present in all input files
--filter FILTER Percentage of total reads to remove around mean score, default -1 (disabled)
--testIsTraining Use functions for handling training data for the test parameters
--no_scaling Disable score scaling
The trunc_min and trunc_max parameters can be used to restrict the useful range of log difference scores on a per-tool basis. Most of the existing methylation calling tools are well behaved but Nanopolish in particular can output extreme scores. By default REG scales each tool's scores linearly between 0 and 1. Extreme positive or negative values will reduce the contribution by scores that are less extreme but still meaningful indictors of methylation status. Values less than trunc_min INT for tool N will be replaced with the user provided score, and vice versa for the scores larger than trunc_max INT.
Enables methylation labels to be considered as a contributor to the prediction model. This requires the test data to have the same format as the test data, with "group" and "label" columns. In some case it can improve prediction if the tool makes good classifications in spite of oddly distributed scores.
Filters out a fixed proportion of reads, evenly distributed either side of the scaling mid-point. If scaling is disabled then the mean score is considered the mid-point.
Informs REG to use the input format corresponding to training data for test data as well.
Disables MinMax (0..1) scaling. Not recommended in general but might be necessary for tools with wildly different scoring systems.
python meteore_reg.py --train set1_deepsignal.tsv.gz set1_nanopolish_updated.tsv.gz --tool_names deepsignal nanopolish \
--train_desc mix1 --test set2_deepsignal.tsv.gz set2_nanopolish.tsv.gz --test_desc mix2 --testIsTraining \
--trunc_min -5 -50 --trunc_max 5 50
Filter Ignored deepsignal_pass deepsignal_prop nanopolish_pass nanopolish_prop Model_succ Model_prop
-1 0 2981104 0.9105415715991803 2729348 0.8336457961081127 3029092 0.9251989163070138
Filter Ignored deepsignal_pass deepsignal_prop nanopolish_pass nanopolish_prop Model_succ Model_prop
-1 0 5728226 0.9151707857313122 5332267 0.8519103436420188 5778967 0.923277428318178
Filter is the proportion of (presumably low-accuracy) reads to be removed, with -1 indicated it is disabled. Likewise the Ignored column is the corresponding number of reads removed by filtering.
Each tool is then reported with the number of site observations that match the expected methylation status (not accurate for hemi-methylated sites), and the overall proportion accuracy.
Finally the combined regression model numbers are reported for training data where labels are available to ascertain truth.