-
Notifications
You must be signed in to change notification settings - Fork 51
Running SQANTI3 rescue
As of SQANTI3 v5.1, a new module has been added to the SQANTI3 workflow for transcriptome characterization and quality control: SQANTI3 rescue.
The SQANTI3 rescue algorithm is designed to be run after transcriptome filtering and uses the long read-based evidence provided by discarded isoforms (i.e. artifacts) to recover transcripts in the associated reference transcriptome. The idea behind this strategy is to avoid losing transcripts/genes that are detected as expressed by long read sequencing, but whose start/end/junctions could not be confidently validated using orthogonal data, resulting in the removal of those genes/transcripts from the transriptome. More details about this can be found in the Motivation section below.
In particular, during the rescue, SQANTI3 will try to confidently assign each discarded artifact to the best matching reference transcript. As a result, SQANTI3 rescue will generate an expanded transcriptome GTF including a set of reference transcripts as well as the long read-defined isoforms that passed the filter.
Here is a summary of the SQ3 rescue workflow:
The SQANTI3 rescue algorithm consists in the following steps:
As explained above, the rescue strategy in SQ3 was conceived to recover transcriptome diversity lost during filtering. This, among other things, means verifying that mone of the reference-supported junction chains that were initially detected by long-reads are lost due to stringent artifact removal.
To achieve this during automatic rescue, all reference transcripts that were represented by at least one FSM in the original post-QC transcriptome are first retrieved -note that this information is available in the associated_transcript
column of the *_classification.txt
file. Then, those reference transcripts for which all FSM representatives were removed by the filter are rescued.
The previous analytic decision is justified because, in practice, any case where all FSMs with the same associated_transcript
are removed can be interpreted as follows: 1) the TSS and/or TTS of the long read-defined transcript is different from that of the matching reference transcript, however, it could not be validated by SQ3 QC-supplied orthogonal data; and 2) the junctions are identical to those found in the reference, which can be interpreted as evidence that this isoform is real. As a result, SQ3 will not rescue any of the discarded FSMs, but the associated_transcript
from the reference.
In spite of its potential to achieve the goals that we set for the rescue, the previous strategy does not consider ISM, NIC or NNC artifacts. These will be included in the rescue candidate group, i.e. transcripts classified as artifacts for which SQANTI3 will try to find a matching reference transcript to include in the final, curated transcriptome.
-
For ISM artifact transcripts, there are two possible situations:
-
There will be cases where the same discarded reference transcript is supported by FSM and ISM. ISM artifacts with an FSM artifact counterpart will therefore be "collapsed" into the rescued reference transcript during the automatic rescue step.
-
Conversely, there will be
associated_transcript
references that are only supported by one or more ISM. Those ISM artifacts that constituted evidence of a non-FSM supported reference will therefore be included in the rescue candidate list.
-
-
For novel transcripts from the NIC and NNC categories, since there is no associated transcript information, all transcripts classified as artifacts will be included in the rescue candidate list.
As a result, we consider all reference or long read-defined transcripts from genes that have at least one rescue candidate to be rescue targets.
SQ3 rescue next tries to find matches between each rescue target and its same-gene candidates based on sequence similarity. To achieve this, we perform an internal mapping step using minimap2. In it, rescue candidates are considered to be "reads" and rescue targets are used as a "reference genome" in which each transcript sequence constitutes a different "chromosome".
To map candidates, we use the map-hifi
option in minimap2 and the -a -x
parameters:
minimap2 --secondary=yes -ax map-hifi rescue_targets.fasta rescue_candidates.fasta > mapped_rescue.sam
Finally, all candidate-target pairs obtained during mapping -referred to as mapping hits- are obtained from the output SAM file regardless of whether they are primary or secondary alignments.
Validation of transcripts using orthogonal sources of data is an important part of the SQANTI3 philosophy. In consequence, the rescue strategy in SQ3 includes a validation step for all reference rescue targets before considering them for inclusion in the transcriptome, since no QC information is available for them (in contrast to long read-defined targets).
This requires users to run SQANTI3 quality control on the reference transcriptome and supply the output *_classification.txt
file to SQANTI3 rescue using the --refClassif
(-k
) flag. This must be done using the same orthogonal data files that were used when running SQANTI3 QC for the long read-transcriptome, since the rescue is based on the assumption that the same evidence is required to validate all rescue targets.
Using the supplied reference classification file, SQ3 rescue will next apply SQ3 filter to the reference transcriptome. The filter to be applied will be specified by the rules
or ml
flags used when running the rescue. This means that, if you run SQ3 machine learning-based filter, you should also run the rescue using the ml
option (and the same is true for the rules filter).
In the final rescue step, hereby referred to as rescue-by-mapping, mapping hits (i.e. candidate-target pairs obtained during mapping) and reference transcriptome filter results are combined to generate a final list of reference transcripts to be rescued.
This part of the rescue can be divided into the following tasks/criteria:
-
SQ3 filtering of targets: first, candidate-target pairs are removed if the rescue target did not pass the filter (ML or rules). If the user is working with the rules filter, this will mean that the reference (or long-read) target transcript did not pass the rules defined in the JSON file. If the user is working with the ML filter, this will mean that the target transcript did not pass the specified ML probability threshold. In the case of the ML filter, however, only the target transcript(s) with maximum ML probability are selected to continue in the rescue.
-
Select reference targets: as explained above, both long read-defined and reference target transcripts are considering during mapping. As a result, rescue candidates (i.e. artifacts) can match to either of these transcripts. From this step onwards, however, only targets selected from the reference transcriptome are considered. This is done to avoid introducing redundancy in the final transcriptome: if a rescue candidate's perfect match was another long read-defined counterpart from the same gene, there is no need to include another transcript as a rescued representative, since the best matching transcript is already part of the transcriptome.
-
Remove redundant reference transcripts: during rescue, there is a chance that some of the reference transcripts that we wish to include in our transcriptome is already represented by an FSM from the same gene (or that it has been already retrieved during automatic rescue). In this situation, target reference transcripts that are already represented by a long read-defined isoform are not to be included in the transcriptome in order to avoid increasing redundancy. In other words -and similarly to what was previously stated-, these are cases in which the artifact's (i.e. rescue candidate) best matching transcript is already represented by an isoform in the transcriptome.
After performing this last filter of the rescue target list, SQ3 rescue outputs a list of rescued reference transcripts, which are then added to the long-read transcriptome GTF.
Similarly to the SQANTI3 filter, the SQANTI3 rescue is designed as a
dual implementation, depending on whether the rules or the machine learning filter was previously run. Therefore, the sqanti3_rescue.py
script requires a flag to be provided to activate either the ml
or rules
specific rescue.
usage: sqanti3_rescue.py [-h] {ml,rules} ...
Rescue artifacts discarded by the SQANTI3 filter, i.e. find closest match for
the artifacts in the reference transcriptome and add them to the
transcriptome.
positional arguments:
{ml,rules}
optional arguments:
-h, --help show this help message and exit
Regardless of the rescue mode that is selected, SQ3 has the following common arguments, all of which are mandatory:
-
Long read-defined isoforms: should be the FASTA file generated by SQANTI3 QC (
*_corrected.fasta
). The isoform sequences will be used to extract rescue candidates and targets for mapping, and must be supplied via the--isoforms
argument. -
Long read-defined transcriptome annotation (filtered): should be the GTF file output by SQANTI3 filter (
*.filtered.gtf
). Rescued transcripts will be appended to this GTF file to generate the final curated transcriptome. This file must be supplied via the--gtf
argument. -
Reference transcriptome annotation: reference GTF used to run SQANTI3 QC. This file will be used to extract rescue targets for mapping, and must be supplied via the
--refGTF
(or-g
) argument. -
Reference transcriptome classification file generated after running SQANTI3 QC on the reference transcriptome, which must be done previously to running the rescue and using the same orthogonal data as for long read-defined transcriptome QC. This file must be supplied via the
--refClassif
(or-k
) argument and will be used to evaluate reference rescue target support (see details above).
Additionally, the following parameters can be set to modify the behavior of the rescue algorithm:
-
Define rescue for mono-exon transcripts: the
-e
flag can be used to determine whether (and how) to perform the rescue on mono-exonic transcripts. The following options can be supplied:-
all
(default): all mono-exon transcripts classified as artifacts will be considered as rescue candidates, regardless of the category that they are classified into. -
fsm
: mono-exonic transcripts will only be considered for the rescue if they belong to the full-splice match (FSM) category. In this case, mono-exons will be rescued via the automatic rescue strategy. -
none
: do not consider mono-exonic transcripts for the rescue (all mono-exons classified as artifacts will be discarded and no reference representatives included in the final transcriptome).
-
-
Output directory and outfile prefix: the output directory can be set via the
-d
flag (default: current directory). Output files will be named using the prefix provided using the-o
argument (default: SQANTI3). -
Skip report creation: when the
--skip_report
is supplied, no summary rescue report will be generated.
In addition to the common arguments, the rules rescue requires the following specific files:
-
Rules JSON file: using the
-j
flag, the user must provide the JSON file used for running SQANTI3 rules filter on the long read-defined transcriptome. This same set of rules will be applied to the reference transcriptome to evaluate the reliability of reference rescue targets. Note that, to achieve this, SQANTI3 rescue requires the classification file output after running SQANTI3 QC on the reference to be provided via the--refClassif
argument.
All in all, these are the arguments accepted by sqanti3_rescue.py rules
:
usage: sqanti3_rescue.py rules [-h] [--isoforms ISOFORMS] [--gtf GTF] [-g REFGTF]
[-f REFGENOME] [-k REFCLASSIF]
[-e {all,fsm,none}] [-o OUTPUT] [-d DIR]
[--skip_report] [-v] [-j JSON]
sqanti_filter_classif
Rescue for rules-filtered transcriptomes.
positional arguments:
sqanti_filter_classif
SQANTI filter (ML or rules) output classification file.
optional arguments:
-h, --help show this help message and exit
--isoforms ISOFORMS FASTA file output by SQANTI3 QC (*_corrected.fasta),
i.e. the full long read transcriptome.
--gtf GTF GTF file output by SQANTI3 filter (*.filtered.gtf).
-g REFGTF, --refGTF REFGTF
Full path to reference transcriptome GTF used when
running SQANTI3 QC.
-f REFGENOME, --refGenome REFGENOME
Full path to reference genome FASTA used when
running SQANTI3 QC.
-k REFCLASSIF, --refClassif REFCLASSIF
Full path to the classification file obtained when
running SQANTI3 QC on the reference transcriptome.
-e {all,fsm,none}, --rescue_mono_exonic {all,fsm,none}
Whether or not to include mono-exonic artifacts in
the rescue. Options include: none, fsm and all (default).
-o OUTPUT, --output OUTPUT
Prefix for output files.
-d DIR, --dir DIR Directory for output files. Default: Directory where
the script was run.
--skip_report Skip creation of a report about the filtering
-v, --version Display program version number.
-j JSON, --json JSON Full path to the JSON file including the rules used when
running the SQANTI3 rules filter.
In addition to the common arguments, the machine learning rescue requires the following specific files:
-
Pre-trained random forest classifier: using the
-r
flag, users must provide therandomforest.RData
object that was obtained when running the machine learning filter on the long read-defined transcriptome. This classifier will be used to run the ML filter on reference transcriptome isoforms, i.e. obtain an ML probability value for reference rescue targets and evaluate their likelihood to be a true isoform based on orthogonal data supplied to SQANTI3 QC. Note that, to achieve this, SQANTI3 rescue requires the classification file output after running SQANTI3 QC on the reference to be provided via the--refClassif
argument. -
ML probability threshold: minimum probability value that is required in order to consider a transcript to be a true isoform. It should be set using the
-j
flag (default: 0.7). We recommend setting the same threshold that was used when running SQANTI3 filter.
All in all, these are the arguments accepted by sqanti3_rescue.py ml
:
usage: sqanti3_rescue.py ml [-h] [--isoforms ISOFORMS] [--gtf GTF] [-g REFGTF]
[-f REFGENOME] [-k REFCLASSIF]
[-e {all,fsm,none}] [-o OUTPUT] [-d DIR]
[--skip_report] [-v] [-r RANDOMFOREST] [-j THRESHOLD]
sqanti_filter_classif
Rescue for ML-filtered transcriptomes.
positional arguments:
sqanti_filter_classif
SQANTI filter (ML or rules) output classification file.
optional arguments:
-h, --help show this help message and exit
--isoforms ISOFORMS FASTA file output by SQANTI3 QC (*_corrected.fasta),
i.e. the full long read transcriptome.
--gtf GTF GTF file output by SQANTI3 filter (*.filtered.gtf).
-g REFGTF, --refGTF REFGTF
Full path to reference transcriptome GTF used when
running SQANTI3 QC.
-f REFGENOME, --refGenome REFGENOME
Full path to reference genome FASTA used when
running SQANTI3 QC.
-k REFCLASSIF, --refClassif REFCLASSIF
Full path to the classification file obtained when
running SQANTI3 QC on the reference transcriptome.
-e {all,fsm,none}, --rescue_mono_exonic {all,fsm,none}
Whether or not to include mono-exonic artifacts in
the rescue. Options include: none, fsm and all (default).
-o OUTPUT, --output OUTPUT
Prefix for output files.
-d DIR, --dir DIR Directory for output files. Default: Directory where
the script was run.
--skip_report Skip creation of a report about the filtering
-v, --version Display program version number.
-r RANDOMFOREST, --randomforest RANDOMFOREST
Full path to the randomforest.RData object obtained when
running the SQANTI3 ML filter.
-j THRESHOLD, --threshold THRESHOLD
Default: 0.7. Machine learning probability threshold to
filter elegible rescue targets (mapping hits).
The final result of SQANTI3 rescue is a transcriptome GTF file - named prefix_rescued.gtf
- including all the transcripts that were classified as isoforms by either the rules or the ML filter (already included in the supplied prefix.filtered.gtf
file) and the rescued transcript isoforms that were selected from the reference transcriptome (see rescue strategy section above).
Other additional output files that are generated in the rescue process will be explained below.
-
Automatic rescue result: a single-column file named
prefix_automatic_rescued_list.tsv
containing the IDs of reference transcripts that were selected during automatic rescue. -
Rescue candidate files. Briefly, rescue candidates are a group artifact transcripts from the ISM, NIC and NNC categories (which did not undergo automatic rescue) for which a reference transcript match needs to be found. Two rescue candidate files are generated:
- A rescue candidate ID list, named
prefix_rescue_candidates.tsv
. - A rescue candidate FASTA file named
prefix_rescue_candidates.fasta
containing the sequences of rescue candidates for mapping.
- A rescue candidate ID list, named
-
Rescue target files. As explained above, rescue targets are those reference and long read-defined transcripts for which there is a same-gene rescue candidate counterpart. During the rescue, candidates are mapped to targets using minimap2. Similarly to rescue candidates, two rescue target files are generated:
- ID list:
prefix_rescue_targets.tsv
. - FASTA file:
prefix_rescue_targets.fasta
.
- ID list:
-
Mapping results: as a result of mapping with minimap2, two files are generated by SQ3 rescue:
- Minimap2 SAM file containing the mapping results, named
prefix_mapped_rescue.sam
. - A three-column table in which the mapping results are summarized, named
prefix_rescue_mapping_hits.tsv
. This table has the following format:
- Minimap2 SAM file containing the mapping results, named
PB.1.3 0 PB.7707.5
PB.1.3 256 ENST00000557932.5
PB.1.3 256 ENST00000558784.5
PB.1.3 256 ENST00000442898.5
PB.1.3 256 ENST00000692602.1
PB.1.3 256 ENST00000686495.1
Here, the first column contains the rescue candidate ID, the second column includes the SAM flag indicating the type of alignment and the last column indicates the ID of the rescue target that the candidate was successfully mapped to.
-
Rescue summary table: the
prefix_rescue_table.tsv
file contains a table in which rescue outcome for all the target-candidate pairs included in the mapping hit table is detailed.-
Column 1 (
rescue_candidate
): rescue candidate ID. -
Column 2 (
sam_flag
): SAM flag indicating the type of alignment. -
Column 3 (
mapping_hit
): ID of the rescue target that the candidate was successfully mapped to. -
Column 4 shows the filter result for the mapping hit (or rescue candidate), and may be named
hit_POS_MLprob
if theml
mode was used, orhit_filter_result
if therules
mode was used.- For ML, the
hit_POS_MLprob
column will contain the probability that the hit is a true isoform provided by the pre-trained random forest classifier. - For rules, the
hit_filter_result
column will read Isoform or Artifact depending on whether or not the hit passed the rules defined in the JSON file.
- For ML, the
-
Column 5 (
candidate_structural_category
): SQANTI3 category that the rescue candidate (i.e. the artifact that is being considered for rescue) belongs to. -
Column 6 (
rescue_result
): final rescue decision. Will berescued
ornot_rescued
depending on whether or not the isoform was finally included in the rescued transcriptome GTF -
Column 7 (
exclusion_reason
): ifrescue_result == not_rescued
, this column will include a flag explaining why the rescue target was not included in the rescue. On the contrary, ifrescue_result == rescued
, the column will containNA
. Possible exclusion reasons are:-
long_read_transcript
if the mapping hit is a long read-defined isoform, and is therefore already present in the transcriptome. -
reference_already_present
if the mapping hit is a reference transcript that is already represented by an isoform, be it an FSM that passed the filter or a transcript that was obtained during automatic rescue. - For ML rescue,
ML_probability
if the mapping hit did not pass the supplied ML probability threshold. Alternatively, for rules rescue, the flag will readartifact_by_rules
, meaning that the mapping hit did not pass the rules specified in the JSON file.
-
-
Column 8 (
best_match_for_candidate
): rescue candidate-level (i.e. transcript-level) summary of the rescue. For each rescue candidate ID, a flag is included summarizing all the decisions that were made regarding all potential rescue targets (mapping hits):-
reference_transcript
: in cases where one or more reference transcripts have been effectively rescued, or where the rescue target was already represented by another transcsript (i.e. thereference_already_present
case described above). -
long_read_transcript
: whenever no reference transcripts were validated for a given rescue candidate, but at least one good long read-defined match (i.e. passing ML probability/rules) was detected -even if not rescued. -
unknown
: this flag is assigned if none of the rescue targets selected for the artifact (i.e. the candidate) pass the ML probability threshold or the specified rules.
-
-
-
Column 9 (
best_match_id
): for each rescue candidate ID, this column will contain the transcript ID of the candidate's best match, as found by the rescue. This will be defined by the following rules:- Normally, the best match will be the rescue target with the maximum ML filter probability (above the specified threshold) or the one that passed the rules defined in the JSON file, depending on whether the rescue was run in the
ml
orrules
modes. - In case of ties (i.e. multiple targets have scored the maximum ML probability or passed the filter rules), the primary alignment obtained during mapping of candidates vs targets will be selected, if present. If none of the ties include the primary alignment, all transcripts that passed the probability/rules criteria will be reported.
- Finally, in case of
best_match_for_candidate == unknown
, the best match ID will also beunknown
.
- Normally, the best match will be the rescue target with the maximum ML filter probability (above the specified threshold) or the one that passed the rules defined in the JSON file, depending on whether the rescue was run in the
An example of what this table would look like for one rescue candidate and all its assigned mapping hits/rescue targets (in this case for the ml
rescue mode) can be viewed here:
rescue_candidate sam_flag mapping_hit hit_POS_MLprob candidate_structural_category rescue_result exclusion_reason best_match_for_candidate best_match_id
PB.1.3 0 PB.7707.5 0.754 novel_not_in_catalog not_rescued long_read_transcript reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000557932.5 0.93 novel_not_in_catalog not_rescued reference_already_present reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000558784.5 0.098 novel_not_in_catalog not_rescued ML_probability reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000442898.5 0.18 novel_not_in_catalog not_rescued ML_probability reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000692602.1 0.998 novel_not_in_catalog not_rescued reference_already_present reference_transcript ENST00000692602.1
PB.1.3 256 ENST00000686495.1 0.066 novel_not_in_catalog not_rescued ML_probability reference_transcript ENST00000692602.1
PB.1.4 0 PB.7707.5 0.754 novel_not_in_catalog not_rescued long_read_transcript reference_transcript ENST00000692602.1
-
Rescue inclusion list: a single-column file containing the IDs of all rescued transcripts, named
prefix_rescue_inclusion-list.tsv
. In this list, both the IDs of rescued transcripts obtained via automatic rescue and via rescue-by-mapping are combined into a single inclusion list. This list is used to generate the final rescued transcriptome GTF, i.e.prefix_rescued.gtf
.
In addition to the common output files, the rules rescue will create a new folder within the designated output directory to save rules filter output files, i.e. the results of running the rules filer on the reference transcriptome. This folder is named reference_rules_filter
in all cases, while all files within it will have reference_
as prefix. The reference_RulesFilter_result_classification.txt
file will be used by SQANTI3 rescue to get the rules filter result for reference transcripts.
In addition to the common output files listed above, the ML rescue will also output a table including the result obtained after running the pre-trained ML classifier on the reference transcriptome. This file, named prefix_reference_isoform_predict.tsv
, will contain reference transcript IDs in the first column and random forest classifier-generated probability values on the second. This probability can be interpreted as the extent to which the transcript is supported by SQANTI3 QC orthogonal data (short-reads, CAGE/polyA peaks, etc.). An example of this output is represented below:
isoform POS_MLprob
DQ459430 0.25
DQ516784 0.262
DQ516752 0.27
DQ668364 0.262
DQ883670 0.804
EF011062 0.512
DQ875385 0.758
Wiki index
- Introduction to SQANTI3
- Dependencies and installation
- Version history
- Isoform classification: categories and subcategories
- Running SQANTI3 quality control
- Understanding the output of SQANTI3 QC
- IsoAnnotLite
- Running SQANTI3 filter
- Running SQANTI3 rescue
- Tutorial: running SQANTI3 on an example dataset
- Running SQANTI-reads
- Memory requirements to use parallelization