Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The isoform classification is not accurate for Genic Intron and Intergenic #255

Closed
dolittle007 opened this issue Feb 23, 2024 · 18 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@dolittle007
Copy link

dolittle007 commented Feb 23, 2024

Hi SQANTI3 developers,

I found that SQANTI3 usually identifies genic intron as intergenic by mistake. Could you please fix this issue?

SQANTI3/sqanti3_qc.py

Lines 1335 to 1381 in a9b6604

def associationOverlapping(isoforms_hit, trec, junctions_by_chr):
# at this point: definitely not FSM or ISM or NIC or NNC
# possibly (in order of preference assignment):
# - antisense (on opp strand of a known gene)
# - genic (overlaps a combination of exons and introns), ignore strand
# - genic_intron (completely within an intron), ignore strand
# - intergenic (does not overlap a gene at all), ignore strand
isoforms_hit.str_class = "intergenic"
isoforms_hit.transcripts = ["novel"]
isoforms_hit.subtype = "mono-exon" if trec.exonCount==1 else "multi-exon"
#if trec.id.startswith('PB.37872.'):
# pdb.set_trace()
if len(isoforms_hit.genes) == 0:
# completely no overlap with any genes on the same strand
# check if it is anti-sense to a known gene, otherwise it's genic_intron or intergenic
if len(isoforms_hit.AS_genes) == 0:
if trec.chrom in junctions_by_chr:
# no hit even on opp strand
# see if it is completely contained within a junction
da_pairs = junctions_by_chr[trec.chrom]['da_pairs']
i = bisect.bisect_left(da_pairs, (trec.txStart, trec.txEnd))
while i < len(da_pairs) and da_pairs[i][0] <= trec.txStart:
if da_pairs[i][0] <= trec.txStart <= trec.txEnd <= da_pairs[i][1]:
isoforms_hit.str_class = "genic_intron"
break
i += 1
else:
pass # remain intergenic
else:
# hits one or more genes on the opposite strand
isoforms_hit.str_class = "antisense"
isoforms_hit.genes = ["novelGene_{g}_AS".format(g=g) for g in isoforms_hit.AS_genes]
else:
# (Liz) used to put NNC here - now just genic
isoforms_hit.str_class = "genic"
# overlaps with one or more genes on the same strand
#if trec.exonCount >= 2:
# # multi-exon and has a same strand gene hit, must be NNC
# isoforms_hit.str_class = "novel_not_in_catalog"
# isoforms_hit.subtype = "at_least_one_novel_splicesite"
#else:
# # single exon, must be genic
# isoforms_hit.str_class = "genic"
return isoforms_hit

Thanks a lot.

@alexpan00
Copy link
Collaborator

Hi @dolittle007,

Could you provide an example of this bug? Or highlight/do a pull request of the part of the code that is causing this bug?

Thanks,
Alejandro.

@dolittle007
Copy link
Author

dolittle007 commented Feb 26, 2024

Example:

sqanti3_qc.py test.gtf RNF115_MRNIP_SEC13.gtf hg38.fa --report skip --skipORF -o test

example.zip

Hi @alexpan00, I am providing the example data in the example.zip file.
It includes

  • test.sh: test command line
  • test.gtf: Input transcripts in GTF file
  • RNF115_MRNIP_SEC13.gtf: reference annotation file. I extracted three genes from NCBI refseq GTF file.
  • test_classification.txt: output classification file.

According to the test_classification.txt, all three test transcripts will be reported as intergenic ones.
However, the ground truth for these three transcripts should be, genic_intron, genic_intron and antisense.

Thanks a lot.

@alexpan00
Copy link
Collaborator

Hi @dolittle007,

Thanks for your detailed example. I am looking at this right now and let you now as soon as possible.

Alejandro

@alexpan00
Copy link
Collaborator

Hi @dolittle007,

We have identified the cause of the problem and working on a fix. However, the third isoform of your example would be classified as a genic intron, because it does not overlap with any exon of the transcript in the other strand. This is necessary for an isoform to be classified as antisense.

Thank you,
Alejandro

@dolittle007
Copy link
Author

Hi @alexpan00,
Thanks a lot for your help. I am looking forward to the bug-fixed release, thus I can apply SQANTI3 to my project soon. 😄

@Tang-pro
Copy link

Tang-pro commented Mar 8, 2024

Hi, @alexpan00
image
image
The isoform here is obviously the same as the reference, why is it classified as intergenic?

@alexpan00
Copy link
Collaborator

Hi @Tang-pro

Is the reference shorter than 200bp?

Alejandro

@Tang-pro
Copy link

Tang-pro commented Mar 8, 2024

Hi, @alexpan00
No, Some are over 200bp, but most are short

@alexpan00
Copy link
Collaborator

alexpan00 commented Mar 8, 2024

Hi @Tang-pro
Try using the option --min_ref_len 0, or just pull the repository and run SQANTI3 again. The default value was changed from 200 to 0 a month ago.

In the example that you are showing the reference transcript seems to be quite short, probably under 200 bp, so that isoform is discarded while parsing the reference annotation with the previous --min_ref_len default of 200. If the issue persists, please, open a new issue.

@Tang-pro
Copy link

Tang-pro commented Mar 8, 2024

Hi,@alexpan00
Yes, indeed, regions greater than 200bp are intergenic, while regions less than 200bp are considered overlapping. Here is another question: Why was the 200bp cutoff chosen at that time? Is there any significance behind this particular number?

@alexpan00
Copy link
Collaborator

Sorry, @Tang-pro, but I am not sure if I fully understood your answer. You are saying that there are isoforms in your gtf that overlap a transcript in the reference annotation that is longer than 200bp while the isoform is being classified as intergenic, is this correct? If that is the case, please, open a new issue and provide some examples (in gtf format if possible).

The idea of the 200bp cutoff was introduced to avoid hits with families of small RNAs. After getting a couple of issues reporting incorrect classification on isoforms as intergenic (when the reference is shorter than 200bp), we decided that it may be better to set the default value to 0 and give the user full control of the cutoff.

@Tang-pro
Copy link

Tang-pro commented Mar 8, 2024

Hi, @alexpan00
So far, no isoforms larger than 200bp have been found to overlap with the reference. Most of the isoforms smaller than 200bp have this kind of situation.
Thank you very much for your reply

@carolinamonzo carolinamonzo added the bug Something isn't working label Mar 21, 2024
alexpan00 added a commit that referenced this issue Apr 2, 2024
Fixed a bug on classification of genic_intron (issue #255)
@alexpan00
Copy link
Collaborator

This issue with the genic intron has been fixed in the last commit #283

@MengjunWu
Copy link

Hi, I want to ask if this bug is fixed in the latest SQANTI version 5.2.1? I have updated my previous 5.2 to 5.2.1, however this bug still persists

@alexpan00
Copy link
Collaborator

Hi,
I think this bug was fixed after the release of 5.2.1, could you clone the repository and give it a try?

Alejandro

@MoMoTPark
Copy link

Hi team, I'm using SQ 5.2.1 and noticed I still get misclassified transcripts as 'intergenic' when they should be 'genic_intronic'. Attached screen shot of both PB.6617.24 and PB.6611.1 which are categorized as intergenic.

Many thanks for your help

Mo

PB.6617.24      chr1    +       1876    1       intergenic      novelGene_0_40581       novel   NA      NA      NA      NA      NA      NA      mono-exon       FALSE   NA      NA      NA      NANA      15      NA      NA      NA      NA      NA      NA      A       non_coding      NA      NA      NA      NA      NA      NA      NA      75      TTAAAAAAATAAATAAATAA    NA      FALSE   NANA      AGTAAA  -13     TRUE    NA      NA      Artifact
PB.6611.1       chr1    -       1671    1       intergenic      novelGene_0_40582       novel   NA      NA      NA      NA      NA      NA      mono-exon       FALSE   NA      NA      NA      NANA      3       NA      NA      NA      NA      NA      NA      A       non_coding      NA      NA      NA      NA      NA      NA      NA      50      AAGGAAATAAAATCATGTTC    NA      FALSE   NANA      NA      NA      FALSE   NA      NA      Artifact
Screenshot 2024-07-11 at 13 39 59

@alexpan00
Copy link
Collaborator

Hi @MoMoTPark,

This change has not been included in a release yet. You should clone the repo in order to get this change.

Sorry for the inconvenience,
Alejandro

@MoMoTPark
Copy link

Hi Alejandro,

I see, thank you very much for the prompt reply. I'll do that then.

Many thanks,
Mo

pabloati pushed a commit that referenced this issue Dec 16, 2024
Fixed a bug on classification of genic_intron (issue #255)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants