Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stringtie2utr generates multiple five_prime_UTRs and three_prime_UTRs within a gene #723

Closed
spoonbender76 opened this issue Dec 14, 2023 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@spoonbender76
Copy link

Hi,

I tried stringtie2utr.py with the GeneMark-ETP/rnaseq/stringtie/transcripts_merged.gff file to add utr into braker.gtf.

However, I encountered the problem that multiple five_prime_UTRs and three_prime_UTRs are generated within a gene, the same issue as #716 (comment).

Here are some examples.

chr01	AUGUSTUS	gene	1265570	1337711	.	+	.	g25
chr01	AUGUSTUS	transcript	1265570	1337711	1	+	.	g25.t1
chr01	stringtie2utr	five_prime_UTR	1265570	1265641	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1267245	1267346	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1268300	1268427	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1268857	1269048	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1270085	1270362	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1271057	1271273	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1273003	1273117	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1274180	1274306	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1275368	1275508	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1276316	1276514	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1277498	1277613	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1279421	1279738	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1281067	1281465	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1283176	1283443	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1284457	1284568	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1287752	1287821	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1288516	1288661	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1289245	1289401	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1289880	1290078	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1290804	1291036	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1291414	1292379	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	start_codon	1292380	1292382	.	+	0	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	CDS	1292380	1292518	1	+	0	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	exon	1292380	1292518	.	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	intron	1292519	1293897	1	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	CDS	1293898	1294256	1	+	2	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	exon	1293898	1294256	.	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	stop_codon	1294254	1294256	.	+	0	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	three_prime_UTR	1294257	1294738	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	three_prime_UTR	1337331	1337711	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	gmst	gene	1600956	1659382	.	-	.	g39
chr01	gmst	transcript	1600956	1659382	.	-	.	g39.t1
chr01	stringtie2utr	three_prime_UTR	1600956	1601209	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1601856	1601983	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1602513	1602581	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1603205	1603301	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1612960	1613142	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1613778	1613862	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1630424	1630588	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1641347	1641473	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	stop_codon	1641474	1641476	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	CDS	1641474	1641483	24.335131	-	1	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	exon	1641474	1641483	24.335131	-	1	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	intron	1641484	1643629	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	CDS	1643630	1643765	24.335131	-	2	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	exon	1643630	1643765	24.335131	-	2	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	intron	1643766	1646726	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	CDS	1646727	1646898	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	exon	1646727	1646898	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	start_codon	1646896	1646898	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	five_prime_UTR	1646899	1646901	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	five_prime_UTR	1656810	1656979	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	five_prime_UTR	1659327	1659382	1000	-	.	transcript_id "g39.t1"; gene_id "g39"
@KatharinaHoff
Copy link
Member

KatharinaHoff commented Dec 14, 2023 via email

@KatharinaHoff KatharinaHoff self-assigned this Dec 14, 2023
@KatharinaHoff KatharinaHoff added the question Further information is requested label Dec 14, 2023
@spoonbender76
Copy link
Author

Thank you for your response. I'm still a bit puzzled and would appreciate further clarification. As I understand it - it could be wrong - a transcript should have only one single continuous 5' UTR, starting at the beginning of the transcript and ending just before the start codon, and similarly, one single continuous 3' UTR, beginning right after the stop codon and extending to the end of the transcript. Does this situation mean transcript variants have different UTRs (I'm not sure if they really exist or if it's due to assembly reasons) and these UTRs are all added to the annotation? Or are these multiple 5' UTRs just parts of a large 5' UTR? Should I only reserve one 5' UTR and one 3' UTR, or is it okay to just leave it here?

@KatharinaHoff
Copy link
Member

In eukaryotes, UTRs can be spliced. Less frequently so in the 3'UTR, but it also happens there.

This is not to say that all the stringtie assemblies and all the genes are correct. Everything in structural genome annotation may contain errors.

@ChuanzhengWei
Copy link

I guess the issue arose because I used transcriptome data from different varieties of the same species (since I didn't perform transcriptome sequencing on my sequenced material). After reads mapping, it's possible that the edges of transcripts of the same gene appeared different. Of course, this is just a speculation, and I haven't checked it with IGV.

@KatharinaHoff
Copy link
Member

KatharinaHoff commented Dec 15, 2023 via email

@KatharinaHoff
Copy link
Member

I will close this issue because I believe there is nothing wrong with the software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants