stringtie2utr generates multiple five_prime_UTRs and three_prime_UTRs within a gene #723

spoonbender76 · 2023-12-14T08:04:19Z

Hi,

I tried stringtie2utr.py with the GeneMark-ETP/rnaseq/stringtie/transcripts_merged.gff file to add utr into braker.gtf.

However, I encountered the problem that multiple five_prime_UTRs and three_prime_UTRs are generated within a gene, the same issue as #716 (comment).

Here are some examples.

chr01	AUGUSTUS	gene	1265570	1337711	.	+	.	g25
chr01	AUGUSTUS	transcript	1265570	1337711	1	+	.	g25.t1
chr01	stringtie2utr	five_prime_UTR	1265570	1265641	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1267245	1267346	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1268300	1268427	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1268857	1269048	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1270085	1270362	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1271057	1271273	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1273003	1273117	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1274180	1274306	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1275368	1275508	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1276316	1276514	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1277498	1277613	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1279421	1279738	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1281067	1281465	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1283176	1283443	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1284457	1284568	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1287752	1287821	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1288516	1288661	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1289245	1289401	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1289880	1290078	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1290804	1291036	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	five_prime_UTR	1291414	1292379	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	start_codon	1292380	1292382	.	+	0	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	CDS	1292380	1292518	1	+	0	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	exon	1292380	1292518	.	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	intron	1292519	1293897	1	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	CDS	1293898	1294256	1	+	2	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	exon	1293898	1294256	.	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	AUGUSTUS	stop_codon	1294254	1294256	.	+	0	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	three_prime_UTR	1294257	1294738	1000	+	.	transcript_id "g25.t1"; gene_id "g25";
chr01	stringtie2utr	three_prime_UTR	1337331	1337711	1000	+	.	transcript_id "g25.t1"; gene_id "g25";

chr01	gmst	gene	1600956	1659382	.	-	.	g39
chr01	gmst	transcript	1600956	1659382	.	-	.	g39.t1
chr01	stringtie2utr	three_prime_UTR	1600956	1601209	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1601856	1601983	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1602513	1602581	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1603205	1603301	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1612960	1613142	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1613778	1613862	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1630424	1630588	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	three_prime_UTR	1641347	1641473	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	stop_codon	1641474	1641476	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	CDS	1641474	1641483	24.335131	-	1	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	exon	1641474	1641483	24.335131	-	1	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	intron	1641484	1643629	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	CDS	1643630	1643765	24.335131	-	2	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	exon	1643630	1643765	24.335131	-	2	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	intron	1643766	1646726	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	CDS	1646727	1646898	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	exon	1646727	1646898	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	gmst	start_codon	1646896	1646898	24.335131	-	0	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	five_prime_UTR	1646899	1646901	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	five_prime_UTR	1656810	1656979	1000	-	.	transcript_id "g39.t1"; gene_id "g39";
chr01	stringtie2utr	five_prime_UTR	1659327	1659382	1000	-	.	transcript_id "g39.t1"; gene_id "g39"

The text was updated successfully, but these errors were encountered:

KatharinaHoff · 2023-12-14T12:03:29Z

These are 2 UTRs per transcript. The UTRs are spliced. This is not an error. This results from the stringtie assembly and from the location of the protein coding gene in that assembled transcript. Or do we have overlapping coordinates that I now overlooked?

…

On Thu, Dec 14, 2023 at 9:04 AM spoonbender76 ***@***.***> wrote: Hi, I tried stringtie2utr.py <https://github.com/Gaius-Augustus/BRAKER/blob/utr_from_stringtie/scripts/stringtie2utr.py> with the GeneMark-ETP/rnaseq/stringtie/transcripts_merged.gff file to add utr into braker.gtf. However, I encountered the problem that multiple five_prime_UTRs and three_prime_UTRs are generated within a gene, the same issue as #716 (comment) <#716 (comment)> . Here are some examples. chr01 AUGUSTUS gene 1265570 1337711 . + . g25 chr01 AUGUSTUS transcript 1265570 1337711 1 + . g25.t1 chr01 stringtie2utr five_prime_UTR 1265570 1265641 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1267245 1267346 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1268300 1268427 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1268857 1269048 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1270085 1270362 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1271057 1271273 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1273003 1273117 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1274180 1274306 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1275368 1275508 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1276316 1276514 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1277498 1277613 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1279421 1279738 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1281067 1281465 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1283176 1283443 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1284457 1284568 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1287752 1287821 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1288516 1288661 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1289245 1289401 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1289880 1290078 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1290804 1291036 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1291414 1292379 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS start_codon 1292380 1292382 . + 0 transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS CDS 1292380 1292518 1 + 0 transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS exon 1292380 1292518 . + . transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS intron 1292519 1293897 1 + . transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS CDS 1293898 1294256 1 + 2 transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS exon 1293898 1294256 . + . transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS stop_codon 1294254 1294256 . + 0 transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr three_prime_UTR 1294257 1294738 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr three_prime_UTR 1337331 1337711 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 gmst gene 1600956 1659382 . - . g39 chr01 gmst transcript 1600956 1659382 . - . g39.t1 chr01 stringtie2utr three_prime_UTR 1600956 1601209 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1601856 1601983 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1602513 1602581 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1603205 1603301 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1612960 1613142 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1613778 1613862 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1630424 1630588 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1641347 1641473 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 gmst stop_codon 1641474 1641476 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst CDS 1641474 1641483 24.335131 - 1 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst exon 1641474 1641483 24.335131 - 1 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst intron 1641484 1643629 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst CDS 1643630 1643765 24.335131 - 2 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst exon 1643630 1643765 24.335131 - 2 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst intron 1643766 1646726 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst CDS 1646727 1646898 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst exon 1646727 1646898 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst start_codon 1646896 1646898 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr five_prime_UTR 1646899 1646901 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr five_prime_UTR 1656810 1656979 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr five_prime_UTR 1659327 1659382 1000 - . transcript_id "g39.t1"; gene_id "g39" — Reply to this email directly, view it on GitHub <#723>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJMC6JEQMPRBMHUSPOSYOWLYJKXJBAVCNFSM6AAAAABAUNBKDOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2DCMJVG43TENQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

spoonbender76 · 2023-12-14T16:43:12Z

Thank you for your response. I'm still a bit puzzled and would appreciate further clarification. As I understand it - it could be wrong - a transcript should have only one single continuous 5' UTR, starting at the beginning of the transcript and ending just before the start codon, and similarly, one single continuous 3' UTR, beginning right after the stop codon and extending to the end of the transcript. Does this situation mean transcript variants have different UTRs (I'm not sure if they really exist or if it's due to assembly reasons) and these UTRs are all added to the annotation? Or are these multiple 5' UTRs just parts of a large 5' UTR? Should I only reserve one 5' UTR and one 3' UTR, or is it okay to just leave it here?

KatharinaHoff · 2023-12-14T17:00:11Z

In eukaryotes, UTRs can be spliced. Less frequently so in the 3'UTR, but it also happens there.

This is not to say that all the stringtie assemblies and all the genes are correct. Everything in structural genome annotation may contain errors.

ChuanzhengWei · 2023-12-15T02:53:01Z

I guess the issue arose because I used transcriptome data from different varieties of the same species (since I didn't perform transcriptome sequencing on my sequenced material). After reads mapping, it's possible that the edges of transcripts of the same gene appeared different. Of course, this is just a speculation, and I haven't checked it with IGV.

KatharinaHoff · 2023-12-15T05:24:12Z

UTRs inferred from evidence often look differently from reference annotation UTRs and from evidence in an independent experiment. ChuanzhengWei ***@***.***> schrieb am Fr. 15. Dez. 2023 um 03:53:

…

I guess the issue arose because I used transcriptome data from different varieties of the same species (since I didn't perform transcriptome sequencing on my sequenced material). After reads mapping, it's possible that the edges of transcripts of the same gene appeared different. Of course, this is just a speculation, and I haven't checked it with IGV. — Reply to this email directly, view it on GitHub <#723 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJMC6JGVN3NOLJKGNGONE4LYJO3RPAVCNFSM6AAAAABAUNBKDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJXGE4DQNRVHE> . You are receiving this because you were assigned.Message ID: ***@***.***>

KatharinaHoff · 2023-12-22T13:45:07Z

I will close this issue because I believe there is nothing wrong with the software.

KatharinaHoff self-assigned this Dec 14, 2023

KatharinaHoff added the question Further information is requested label Dec 14, 2023

KatharinaHoff closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stringtie2utr generates multiple five_prime_UTRs and three_prime_UTRs within a gene #723

stringtie2utr generates multiple five_prime_UTRs and three_prime_UTRs within a gene #723

spoonbender76 commented Dec 14, 2023

KatharinaHoff commented Dec 14, 2023 via email

spoonbender76 commented Dec 14, 2023

KatharinaHoff commented Dec 14, 2023

ChuanzhengWei commented Dec 15, 2023

KatharinaHoff commented Dec 15, 2023 via email

KatharinaHoff commented Dec 22, 2023

stringtie2utr generates multiple five_prime_UTRs and three_prime_UTRs within a gene #723

stringtie2utr generates multiple five_prime_UTRs and three_prime_UTRs within a gene #723

Comments

spoonbender76 commented Dec 14, 2023

KatharinaHoff commented Dec 14, 2023 via email

spoonbender76 commented Dec 14, 2023

KatharinaHoff commented Dec 14, 2023

ChuanzhengWei commented Dec 15, 2023

KatharinaHoff commented Dec 15, 2023 via email

KatharinaHoff commented Dec 22, 2023