-
Notifications
You must be signed in to change notification settings - Fork 51
Running SQANTI3 rescue
As of SQANTI3 v5.1, a new module has been added to the SQANTI3 workflow for transcriptome characterization and quality control: SQANTI3 rescue.
The SQANTI3 rescue algorithm is designed to be run after transcriptome filtering and uses the long read-based evidence provided by discarded isoforms (i.e. artifacts) to recover transcripts in the associated reference transcriptome. The idea behind this strategy is to avoid losing transcripts/genes that are detected as expressed by long read sequencing, but whose start/end/junctions could not be confidently validated using orthogonal data, resulting in the removal of those genes/transcripts from the transriptome. More details about this can be found in the Motivation section below.
In particular, during the rescue, SQANTI3 will try to confidently assign each discarded artifact to the best matching reference transcript. As a result, SQANTI3 rescue will generate an expanded transcriptome GTF including a set of reference transcripts as well as the long read-defined isoforms that passed the filter.
Similarly to the SQANTI3 filter, the SQANTI3 rescue is designed as a
dual implementation, depending on whether the rules or the machine learning filter was previously run. Therefore, the sqanti3_rescue.py
script requires a flag to be provided to activate either the ml
or rules
specific rescue.
usage: sqanti3_rescue.py [-h] {ml,rules} ...
Rescue artifacts discarded by the SQANTI3 filter, i.e. find closest match for
the artifacts in the reference transcriptome and add them to the
transcriptome.
positional arguments:
{ml,rules}
optional arguments:
-h, --help show this help message and exit
To be completed: explain why transcript rescue is required after filtering.
The SQANTI3 rescue algorithm consists in the following general steps:
As explained above, the rescue strategy in SQ3 was conceived to recover transcriptome diversity lost during filtering. This, among other things, means verifying that mone of the reference-supported junction chains that were initially detected by long-reads are lost due to stringent artifact removal.
To achieve this during automatic rescue, all reference transcripts that were represented by at least one FSM in the original post-QC transcriptome are first retrieved -note that this information is available in the associated_transcript
column of the *_classification.txt
file. Then, those reference transcripts for which all FSM representatives were removed by the filter are rescued.
The previous analytic decision is justified because, in practice, any case where all FSMs with the same associated_transcript
are removed can be interpreted as follows: 1) the TSS and/or TTS of the long read-defined transcript is different from that of the matching reference transcript, however, it could not be validated by SQ3 QC-supplied orthogonal data; and 2) the junctions are identical to those found in the reference, which can be interpreted as evidence that this isoform is real. As a result, SQ3 will not rescue any of the discarded FSMs, but the associated_transcript
from the reference.
In spite of its potential to achieve the goals that we set for the rescue, the previous strategy does not consider ISM, NIC or NNC artifacts. These will be included in the rescue candidate group, i.e. transcripts classified as artifacts for which SQANTI3 will try to find a matching reference transcript to include in the final, curated transcriptome.
-
For ISM artifact transcripts, there are two possible situations:
-
There will be cases where the same discarded reference transcript is supported by FSM and ISM. ISM artifacts with an FSM artifact counterpart will therefore be "collapsed" into the rescued reference transcript during the automatic rescue step.
-
Conversely, there will be
associated_transcript
references that are only supported by one or more ISM. Those ISM artifacts that constituted evidence of a non-FSM supported reference will therefore be included in the rescue candidate list.
-
-
For novel transcripts from the NIC and NNC categories, since there is no associated transcript information, all transcripts classified as artifacts will be included in the rescue candidate list.
As a result, we consider all reference or long read-defined transcripts from genes that have at least one rescue candidate to be rescue targets.
SQ3 rescue next tries to find matches between each rescue target and its same-gene candidates based on sequence similarity. To achieve this, we perform an internal mapping step using minimap2. In it, rescue candidates are considered to be "reads" and rescue targets are used as a "reference genome" in which each transcript sequence constitutes a different "chromosome".
To map candidates, we use the map-hifi
option in minimap2 and the -a -x
parameters:
minimap2 --secondary=yes -ax map-hifi rescue_targets.fasta rescue_candidates.fasta > mapped_rescue.sam
Finally, all candidate-target pairs obtained during mapping -referred to as mapping hits- are obtained from the output SAM file regardless of whether they are primary or secondary alignments.
Validation of transcripts using orthogonal sources of data is an important part of the SQANTI3 philosophy. In consequence, the rescue strategy in SQ3 includes a validation step for all reference rescue targets before considering them for inclusion in the transcriptome, since no QC information is available for them (in contrast to long read-defined targets).
This requires users to run SQANTI3 quality control on the reference transcriptome and supply the output *_classification.txt
file to SQANTI3 rescue using the --refClassif
(-k
) flag. This must be done using the same orthogonal data files that were used when running SQANTI3 QC for the long read-transcriptome, since the rescue is based on the assumption that the same evidence is required to validate all rescue targets.
These are the arguments accepted by sqanti3_rescue.py rules
:
usage: sqanti3_rescue.py rules [-h] [--isoforms ISOFORMS] [--gtf GTF] [-g REFGTF]
[-f REFGENOME] [-k REFCLASSIF]
[-e {all,fsm,none}] [-o OUTPUT] [-d DIR]
[--skip_report] [-v] [-j JSON]
sqanti_filter_classif
Rescue for rules-filtered transcriptomes.
positional arguments:
sqanti_filter_classif
SQANTI filter (ML or rules) output classification file.
optional arguments:
-h, --help show this help message and exit
--isoforms ISOFORMS FASTA file output by SQANTI3 QC (*_corrected.fasta),
i.e. the full long read transcriptome.
--gtf GTF GTF file output by SQANTI3 filter (*.filtered.gtf).
-g REFGTF, --refGTF REFGTF
Full path to reference transcriptome GTF used when
running SQANTI3 QC.
-f REFGENOME, --refGenome REFGENOME
Full path to reference genome FASTA used when
running SQANTI3 QC.
-k REFCLASSIF, --refClassif REFCLASSIF
Full path to the classification file obtained when
running SQANTI3 QC on the reference transcriptome.
-e {all,fsm,none}, --rescue_mono_exonic {all,fsm,none}
Whether or not to include mono-exonic artifacts in
the rescue. Options include: none, fsm and all (default).
-o OUTPUT, --output OUTPUT
Prefix for output files.
-d DIR, --dir DIR Directory for output files. Default: Directory where
the script was run.
--skip_report Skip creation of a report about the filtering
-v, --version Display program version number.
-j JSON, --json JSON Full path to the JSON file including the rules used when
running the SQANTI3 rules filter.
These are the arguments accepted by sqanti3_rescue.py rules
:
usage: sqanti3_rescue.py ml [-h] [--isoforms ISOFORMS] [--gtf GTF] [-g REFGTF]
[-f REFGENOME] [-k REFCLASSIF]
[-e {all,fsm,none}] [-o OUTPUT] [-d DIR]
[--skip_report] [-v] [-r RANDOMFOREST] [-j THRESHOLD]
sqanti_filter_classif
Rescue for ML-filtered transcriptomes.
positional arguments:
sqanti_filter_classif
SQANTI filter (ML or rules) output classification file.
optional arguments:
-h, --help show this help message and exit
--isoforms ISOFORMS FASTA file output by SQANTI3 QC (*_corrected.fasta),
i.e. the full long read transcriptome.
--gtf GTF GTF file output by SQANTI3 filter (*.filtered.gtf).
-g REFGTF, --refGTF REFGTF
Full path to reference transcriptome GTF used when
running SQANTI3 QC.
-f REFGENOME, --refGenome REFGENOME
Full path to reference genome FASTA used when
running SQANTI3 QC.
-k REFCLASSIF, --refClassif REFCLASSIF
Full path to the classification file obtained when
running SQANTI3 QC on the reference transcriptome.
-e {all,fsm,none}, --rescue_mono_exonic {all,fsm,none}
Whether or not to include mono-exonic artifacts in
the rescue. Options include: none, fsm and all (default).
-o OUTPUT, --output OUTPUT
Prefix for output files.
-d DIR, --dir DIR Directory for output files. Default: Directory where
the script was run.
--skip_report Skip creation of a report about the filtering
-v, --version Display program version number.
-r RANDOMFOREST, --randomforest RANDOMFOREST
Full path to the randomforest.RData object obtained when
running the SQANTI3 ML filter.
-j THRESHOLD, --threshold THRESHOLD
Default: 0.7. Machine learning probability threshold to
filter elegible rescue targets (mapping hits).
Wiki index
- Introduction to SQANTI3
- Dependencies and installation
- Version history
- Isoform classification: categories and subcategories
- Running SQANTI3 quality control
- Understanding the output of SQANTI3 QC
- IsoAnnotLite
- Running SQANTI3 filter
- Running SQANTI3 rescue
- Tutorial: running SQANTI3 on an example dataset
- Running SQANTI-reads
- Memory requirements to use parallelization