Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No stop codons in spaln.gff #13

Open
cdmmoeller opened this issue Aug 16, 2020 · 8 comments
Open

No stop codons in spaln.gff #13

cdmmoeller opened this issue Aug 16, 2020 · 8 comments

Comments

@cdmmoeller
Copy link

Dear Thomas,

I am trying to run ProtHint on an assembly of a plant genome (~13k contigs, ~1.9Gbp). As a protein set, I used the OrthoDB Viridiplantae set (https://v100.orthodb.org/download/odb10_plants_fasta.tar.gz) supplemented by proteins from four more organisms of the same taxonomic family (Compositae).

My command looks like this: ~/software/ProtHint-2.5.0/bin/prothint.py --threads 4 ~/projects/ascaba/genomes/Ascaber/assembly/Ascaber_100k.fa.masked /home/christian/projects/ascaba/BRAKER_protein_sets/plant.proteins.fasta

ProtHint fails to complete the run and prints the following: [Sun Aug 16 21:20:35 2020] error: ProtHint exited due to an error in command: grep stop_codon Spaln/spaln.gff > stops.gff || [[ $? == 1 ]]

So I checked spaln.gff and indeed there are no stop codons, only CDS and intron features. Moreover, the file seems surprisingly small (184 lines). I am not sure what this is caused by.

Best,
Christian

@tomasbruna
Copy link
Contributor

Hello Christian,

as you correctly found, the real problem here is that there are almost no hints in the spaln.gff, there should be tens of thousands of lines...

My suspicion is that something went wrong with GeneMark-ES, the step of ProtHint which generates gene seeds for alignment. Let's check that first.

Can you check how many genes are in the file GeneMark_ES/genemark.gtf file, and if the gene predictions in that file make at least some sense?

Can you also share the full ProtHint log?

Thanks!
Tomas

@cdmmoeller
Copy link
Author

Hi Tomas,

Thanks for taking the time to help.

In genemark.gtf there are 5696 unique genes. This seems quite low, right?

Some basic stats about the genemark.gtf file: Average of ~15 exons/gene. Mean exon length: 117bp. Mean intron length: 103bp.

For your information, here are the first two genes of genemark.gtf:. Besides the first line (single stop codon following intron, I don't see any outstanding issues).

ctg13   GeneMark.hmm    exon    805     807     0       -       .       gene_id "1_g"; transcript_id "1_t";
ctg13   GeneMark.hmm    stop_codon      805     807     .       -       0       gene_id "1_g"; transcript_id "1_t"; count "1_1";
ctg13   GeneMark.hmm    CDS     805     807     .       -       0       gene_id "1_g"; transcript_id "1_t"; cds_type "Terminal"; count "7_7";
ctg13   GeneMark.hmm    intron  808     1214    0       -       0       gene_id "1_g"; transcript_id "1_t"; count "5_6";
ctg13   GeneMark.hmm    exon    1215    1342    0       -       .       gene_id "1_g"; transcript_id "1_t";
ctg13   GeneMark.hmm    CDS     1215    1342    .       -       2       gene_id "1_g"; transcript_id "1_t"; cds_type "Internal"; count "6_7";
ctg13   GeneMark.hmm    intron  1343    1363    0       -       1       gene_id "1_g"; transcript_id "1_t"; count "4_6";
ctg13   GeneMark.hmm    exon    1364    1796    0       -       .       gene_id "1_g"; transcript_id "1_t";
ctg13   GeneMark.hmm    CDS     1364    1796    .       -       0       gene_id "1_g"; transcript_id "1_t"; cds_type "Internal"; count "5_7";
ctg13   GeneMark.hmm    intron  1797    1862    0       -       0       gene_id "1_g"; transcript_id "1_t"; count "3_6";
ctg13   GeneMark.hmm    exon    1863    2386    0       -       .       gene_id "1_g"; transcript_id "1_t";
ctg13   GeneMark.hmm    CDS     1863    2386    .       -       2       gene_id "1_g"; transcript_id "1_t"; cds_type "Internal"; count "4_7";
ctg13   GeneMark.hmm    intron  2387    2415    0       -       1       gene_id "1_g"; transcript_id "1_t"; count "2_6";
ctg13   GeneMark.hmm    exon    2416    2518    0       -       .       gene_id "1_g"; transcript_id "1_t";
ctg13   GeneMark.hmm    CDS     2416    2518    .       -       0       gene_id "1_g"; transcript_id "1_t"; cds_type "Internal"; count "3_7";
ctg13   GeneMark.hmm    intron  2519    2591    0       -       0       gene_id "1_g"; transcript_id "1_t"; count "1_6";
ctg13   GeneMark.hmm    exon    2592    2782    0       -       .       gene_id "1_g"; transcript_id "1_t";
ctg13   GeneMark.hmm    CDS     2592    2782    .       -       2       gene_id "1_g"; transcript_id "1_t"; cds_type "Internal"; count "2_7";
ctg13   GeneMark.hmm    intron  2783    2938    0       -       1       gene_id "1_g"; transcript_id "1_t"; count "0_6";
ctg13   GeneMark.hmm    exon    2939    2945    0       -       .       gene_id "1_g"; transcript_id "1_t";
ctg13   GeneMark.hmm    CDS     2939    2945    .       -       0       gene_id "1_g"; transcript_id "1_t"; cds_type "Initial"; count "1_7";
ctg13   GeneMark.hmm    start_codon     2943    2945    .       -       0       gene_id "1_g"; transcript_id "1_t"; count "1_1";
ctg13   GeneMark.hmm    exon    3027    3053    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    start_codon     3027    3029    .       +       0       gene_id "2_g"; transcript_id "2_t"; count "1_1";
ctg13   GeneMark.hmm    CDS     3027    3053    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Initial"; count "1_17";
ctg13   GeneMark.hmm    intron  3054    3123    0       +       0       gene_id "2_g"; transcript_id "2_t"; count "1_16";
ctg13   GeneMark.hmm    exon    3124    3263    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     3124    3263    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "2_17";
ctg13   GeneMark.hmm    intron  3264    3297    0       +       2       gene_id "2_g"; transcript_id "2_t"; count "2_16";
ctg13   GeneMark.hmm    exon    3298    3446    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     3298    3446    .       +       1       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "3_17";
ctg13   GeneMark.hmm    intron  3447    3472    0       +       1       gene_id "2_g"; transcript_id "2_t"; count "3_16";
ctg13   GeneMark.hmm    exon    3473    3546    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     3473    3546    .       +       2       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "4_17";
ctg13   GeneMark.hmm    intron  3547    3589    0       +       0       gene_id "2_g"; transcript_id "2_t"; count "4_16";
ctg13   GeneMark.hmm    exon    3590    3644    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     3590    3644    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "5_17";
ctg13   GeneMark.hmm    intron  3645    3712    0       +       1       gene_id "2_g"; transcript_id "2_t"; count "5_16";
ctg13   GeneMark.hmm    exon    3713    3762    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     3713    3762    .       +       2       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "6_17";
ctg13   GeneMark.hmm    intron  3763    3838    0       +       0       gene_id "2_g"; transcript_id "2_t"; count "6_16";
ctg13   GeneMark.hmm    exon    3839    3931    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     3839    3931    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "7_17";
ctg13   GeneMark.hmm    intron  3932    3951    0       +       0       gene_id "2_g"; transcript_id "2_t"; count "7_16";
ctg13   GeneMark.hmm    exon    3952    3998    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     3952    3998    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "8_17";
ctg13   GeneMark.hmm    intron  3999    4345    0       +       2       gene_id "2_g"; transcript_id "2_t"; count "8_16";
ctg13   GeneMark.hmm    exon    4346    4519    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     4346    4519    .       +       1       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "9_17";
ctg13   GeneMark.hmm    intron  4520    4633    0       +       2       gene_id "2_g"; transcript_id "2_t"; count "9_16";
ctg13   GeneMark.hmm    exon    4634    4745    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     4634    4745    .       +       1       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "10_17";
ctg13   GeneMark.hmm    intron  4746    4815    0       +       0       gene_id "2_g"; transcript_id "2_t"; count "10_16";
ctg13   GeneMark.hmm    exon    4816    4963    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     4816    4963    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "11_17";
ctg13   GeneMark.hmm    intron  4964    5022    0       +       1       gene_id "2_g"; transcript_id "2_t"; count "11_16";
ctg13   GeneMark.hmm    exon    5023    5186    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     5023    5186    .       +       2       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "12_17";
ctg13   GeneMark.hmm    intron  5187    5238    0       +       0       gene_id "2_g"; transcript_id "2_t"; count "12_16";
ctg13   GeneMark.hmm    exon    5239    5319    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     5239    5319    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "13_17";
ctg13   GeneMark.hmm    intron  5320    5354    0       +       0       gene_id "2_g"; transcript_id "2_t"; count "13_16";
ctg13   GeneMark.hmm    exon    5355    5495    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     5355    5495    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "14_17";
ctg13   GeneMark.hmm    intron  5496    5516    0       +       0       gene_id "2_g"; transcript_id "2_t"; count "14_16";
ctg13   GeneMark.hmm    exon    5517    5699    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     5517    5699    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "15_17";
ctg13   GeneMark.hmm    intron  5700    5722    0       +       0       gene_id "2_g"; transcript_id "2_t"; count "15_16";
ctg13   GeneMark.hmm    exon    5723    5801    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     5723    5801    .       +       0       gene_id "2_g"; transcript_id "2_t"; cds_type "Internal"; count "16_17";
ctg13   GeneMark.hmm    intron  5802    5827    0       +       1       gene_id "2_g"; transcript_id "2_t"; count "16_16";
ctg13   GeneMark.hmm    exon    5828    5865    0       +       .       gene_id "2_g"; transcript_id "2_t";
ctg13   GeneMark.hmm    CDS     5828    5865    .       +       2       gene_id "2_g"; transcript_id "2_t"; cds_type "Terminal"; count "17_17";
ctg13   GeneMark.hmm    stop_codon      5863    5865    .       +       0       gene_id "2_g"; transcript_id "2_t"; count "1_1";

Regarding the prothint log, are you referring to the stdout of the ProtHint command? I'm afraid I didn't redirect this to a file.. If this file is necessary for you, I can rerun ProtHint to generate this.

Many thanks for your help!
Christian

@tomasbruna
Copy link
Contributor

Hi Christian,

this definitely seems like a problem with GeneMark-ES, the number of genes is too low and the genes have too many introns.

Can you share more details about the genome itself? For example, how many contigs are longer than 50 Kb and how much of the sequence is soft-masked? Undermasking of repeats can cause issues with large genomes such as yours.

To debug issues with GeneMark-ES, it will be helpful to run it separately and report what the log files and standard output say. You can run it like this:

gmes_petap.pl --verbose --cores 4 --ES --seq Ascaber_100k.fa.masked --soft auto

Specifically, contents of the standard output and gmes.log file could be useful for determining why it fails.

Best,
Tomas

@cdmmoeller
Copy link
Author

Hi Tomas,

The total assembly is ~220k contigs, and a total size ~7Gb. Due to limited computing resources and time limitations, I have selected as a subset all contigs of length >100kb (~13k contigs) for a combined length of ~1.9Gb for this primary analysis.

For repeatmasking, RepeatModeler was run on a subset of these contigs (to limit runtime) to create a custom repeat library. From this library and the dfam library, 70% of the bases of the selected contigs were masked (~42% LTR elements, ~27% unclassified). RepeatMasker was run with with the -nolow option to avoid masking low complexity regions.

I have now run GeneMark-ES as you suggested. gmes.log is big (18M) and so I have attached it here:
gmes.log
and here is the standard output:

check before run
create directories
commit input data
soft_mask is in the 'auto' mode. soft_mask was set to: 100
data report
commit training data
training data report
prepare initial model
get GC of sequence
GC 37
build initial ES model
running step ES_A
running gm.hmm on local multi-core system
7314 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_A_1
training level ES_A: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_A_1
Initial	3656
Internal	10973
Terminal	3790
Single	120
Intron:	14946	30690277
Intergenic:	1920	5346220
running gm.hmm on local multi-core system
7314 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_A_2
training level ES_A: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_A_2
Initial	3434
Internal	16281
Terminal	3527
Single	170
Intron:	19877	9654638
Intergenic:	2941	2146290
running step ES_B
running gm.hmm on local multi-core system
7314 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_B_1
training level ES_B: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_B_1
Initial	2611
Internal	12060
Terminal	2663
Single	116
Intron:	14785	9031638
Intergenic:	2186	2524947
running step ES_C
running gm.hmm on local multi-core system
7314 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_C_1
training level ES_C: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_C_1
Initial	2992
Internal	27363
Terminal	3015
Single	36
Intron:	30455	7048146
Intergenic:	2437	2541647
SvsM  0.00935388879265547 vs 0.990646111207345
genes in IvsT: 2994
1, -3.78485785980347
2, -2.69609786757837
3, -2.32419295596251
4, -2.19023503315451
5, -2.35891866733634
6, -2.36245849404146
7, -2.71609853428504
8, -2.72625090574906
9, -3.06989163184888
10, -3.04152093471967
11, -3.30388519918716
12, -3.39919537899148
13, -3.39919537899148
14, -3.7003004717754
15, -3.74168568793826
16, -4.07253993225525
17, -4.26669594669621
18, -4.24316544928601
19, -4.47800504036341
20, -4.34080391884993
21, -4.74626902695809
22, -4.60316818331742
23, -5.43941620751804
24, -5.36530823536432
25, -5.05992658581313
26, -5.23177684273979
27, -5.11399380708341
28, -5.11399380708341
29, -5.23177684273979
31, -5.6064702921812
32, -5.70178047198553
IvsT  0.892549410494127 vs 0.107450589505873
running gm.hmm on local multi-core system
7314 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_C_2
training level ES_C: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_C_2
Initial	3402
Internal	38955
Terminal	3440
Single	9
Intron:	42448	6775199
Intergenic:	2432	2127119
SvsM  0.00249337696743026 vs 0.99750662303257
genes in IvsT: 3365
1, -5.0766608043554
2, -3.91649062268786
3, -3.11723693613337
4, -2.50441214441226
5, -2.34353091885617
6, -2.40415554067261
7, -2.50805513569076
8, -2.59972232421658
9, -2.75054521395117
10, -2.93939969178674
11, -3.12397096831471
12, -3.11054794798257
13, -3.42070287628641
14, -3.43905201495461
15, -3.50606272523757
16, -3.61032373556198
17, -3.7904499017925
18, -3.88707673748157
19, -4.1699395234974
20, -4.01030937790552
21, -4.33699360816057
22, -4.38351362379546
23, -4.6554473392791
24, -4.5102653294346
25, -4.68719603759368
26, -5.03014078872051
27, -5.48212591246357
28, -5.17674426291239
29, -5.23081148418266
30, -5.0766608043554
31, -5.41313304097662
32, -5.41313304097662
33, -5.63627659229083
34, -5.72328796928046
35, -5.81859814908478
IvsT  0.913752085900204 vs 0.0862479140997964
running gm.hmm on local multi-core system
7314 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_C_3
training level ES_C: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_C_3
Initial	3562
Internal	48746
Terminal	3584
Single	3
Intron:	52386	6680378
Intergenic:	2365	1446607
SvsM  0.000906070673512534 vs 0.999093929326487
genes in IvsT: 3492
1, -5.26785815906333
2, -4.79093408697302
3, -3.85416482375532
4, -3.25295513852106
5, -2.69864440281533
6, -2.60915383206427
7, -2.42813013398592
8, -2.5937095096368
9, -2.66929219080281
10, -2.80164364228748
11, -2.96527306606928
12, -3.07063358172711
13, -3.07682555197503
14, -3.18841661738349
15, -3.47609868983527
16, -3.50426956680197
17, -3.81442449510581
18, -3.8677704758111
19, -3.96857517493307
20, -3.84074180342318
21, -3.93872221178339
22, -3.78878206449247
23, -4.35156742718917
24, -4.30808231524943
25, -4.37404028304123
26, -4.39702980126593
27, -4.60288185547008
28, -4.57471097850338
29, -4.90013337893801
30, -5.51917258734423
31, -5.26785815906333
32, -5.06718746360118
33, -5.21379093779305
34, -5.06718746360118
36, -5.26785815906333
37, -5.51917258734423
40, -5.85564482396545
42, -5.85564482396545
IvsT  0.934407364190733 vs 0.0655926358092666
running gm.hmm on local multi-core system
7314 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_C_4
training level ES_C: /home/christian/software/BRAKER/Ascaber100k_GenemarkES/run/ES_C_4
Initial	3659
Internal	57967
Terminal	3685
Single	3
Intron:	61704	6846620
Intergenic:	2361	1284676
SvsM  0.000887442685993196 vs 0.999112557314007
genes in IvsT: 3594
1, -5.47897086624129
2, -5.41443234510372
3, -4.40283143342524
4, -3.72111294868892
5, -3.3587073300412
6, -2.90381733860552
7, -2.62633943632798
8, -2.63019300564397
9, -2.5849022464638
10, -2.78885836582575
11, -2.8787533699423
12, -2.92952569531572
13, -3.05712235242043
14, -3.11184725210968
15, -3.19658848056477
16, -3.39123052174676
17, -3.57190055050224
18, -3.63314417574296
19, -3.78030182007925
20, -4.14396979950895
21, -3.85628772705717
22, -4.07614720317019
23, -3.95291456274624
24, -4.38035857757318
25, -4.14396979950895
26, -4.16166937660836
27, -4.38035857757318
28, -4.52345942121386
29, -4.66066054272734
30, -4.81972523735703
31, -4.66066054272734
32, -4.9681452424753
33, -4.92892452932202
34, -4.92892452932202
35, -5.35380772328729
36, -5.00896723699556
37, -5.41443234510372
38, -5.29664930944734
39, -5.41443234510372
41, -5.78912579454513
43, -5.47897086624129
45, -5.78912579454513
46, -5.78912579454513
53, -5.7021144175555
IvsT  0.946514457051644 vs 0.0534855429483558
predict final gene set
running gm.hmm on local multi-core system
64370 contigs in training
64370 contigs in training

Thanks,
Christian

@tomasbruna
Copy link
Contributor

Hello Christian,

thanks for the useful report.

RepeatMasker was run with with the -nolow option to avoid masking low complexity regions.

This could actually be the cause of this problem. In large genomes, low complexity repeats can "confuse" GeneMark-ES and make it converge to rather repeat-rich regions instead of real genes. I have seen the same behavior before when low-complexity repeats were not properly masked (on a similar-sized genome of X. tropicalis) so I hope this could also be the solution to your issue.

You do not have to worry about genes being missed or split due to masking when they contain a short low-complexity stretch. With the option --soft auto, only repeats longer than 100 bp are actually hard-masked within GeneMark-ES. This threshold can be changed if you provide a number instead of auto, but we've had the best experience with using the threshold 100 in large genomes.

Also, are you hard-masking or soft-masking the repeats when running RepeastMasker? Please make sure to use soft-masking, hard-masking would disrupt many coding genes, especially with the recommended low-complexity masking.

Best,
Tomas

@cdmmoeller
Copy link
Author

Hi Tomas,

I've had a chance at masking the genome anew, without the option -nolow. The genome was softmasked like in earlier runs.

It seems however that not that many additional bases were masked with this option (Simple repeats: 0.80%, Low complexity: 0.18%). Am I right that this is quite low?

With this newly masked genome I ran GeneMarkES again. 5653 unique genes are predicted with an avg. of 13 introns pr. gene, which is quite similar to the previous run. Could it be that I'm not correctly masking low complexity regions? Here's
gmes.log and stdout:

check before run
create directories
commit input data
soft_mask is in the 'auto' mode. soft_mask was set to: 100
data report
commit training data
training data report
prepare initial model
get GC of sequence
GC 37
build initial ES model
running step ES_A
running gm.hmm on local multi-core system
7317 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_A_1
training level ES_A: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_A_1
Initial	3608
Internal	10900
Terminal	3732
Single	117
Intron:	14816	30415127
Intergenic:	1898	5300461
running gm.hmm on local multi-core system
7317 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_A_2
training level ES_A: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_A_2
Initial	3394
Internal	16091
Terminal	3487
Single	165
Intron:	19646	9680977
Intergenic:	2903	2185213
running step ES_B
running gm.hmm on local multi-core system
7317 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_B_1
training level ES_B: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_B_1
Initial	2601
Internal	11931
Terminal	2649
Single	110
Intron:	14640	9051719
Intergenic:	2185	2561882
running step ES_C
running gm.hmm on local multi-core system
7317 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_C_1
training level ES_C: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_C_1
Initial	2986
Internal	27416
Terminal	3004
Single	39
Intron:	30497	6935176
Intergenic:	2442	2479781
SvsM  0.010752688172043 vs 0.989247311827957
genes in IvsT: 2993
1, -3.79933888846173
2, -2.72082777911471
3, -2.37641039416206
4, -2.17214903056918
5, -2.27393172487912
6, -2.43187747567493
7, -2.75175807980607
8, -2.72591684862218
9, -2.89204371949616
10, -3.21653976507065
11, -3.2678330594582
12, -3.32190028072848
13, -3.49317200133585
14, -3.72736538883664
15, -4.05278778927127
16, -3.81437676582627
17, -4.39311359520847
18, -4.29045944114839
19, -4.24283139215914
20, -4.42051256939659
21, -4.67182699767749
22, -4.7851556829845
23, -5.29598130675049
24, -5.29598130675049
25, -4.86853729192355
26, -5.05959252868626
27, -5.11365974995653
28, -5.17081816379648
29, -5.36497417823744
30, -5.60613623505433
31, -5.70144641485865
36, -5.60613623505433
IvsT  0.896341787738525 vs 0.103658212261475
running gm.hmm on local multi-core system
7317 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_C_2
training level ES_C: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_C_2
Initial	3466
Internal	39178
Terminal	3494
Single	9
Intron:	42727	6517737
Intergenic:	2506	1988121
SvsM  0.00214592274678112 vs 0.997854077253219
genes in IvsT: 3456
1, -5.50880980030869
2, -3.70521587343363
3, -3.02987331750719
4, -2.52024601623331
5, -2.37331558437954
6, -2.39212491633703
7, -2.51665534810258
8, -2.67980698878881
9, -2.72291711244254
10, -2.95491027903374
11, -3.07896292770371
12, -3.19204007232269
13, -3.54269694393585
14, -3.33568277455153
15, -3.54269694393585
16, -3.74114788265969
17, -3.70521587343363
18, -3.94317451053298
19, -4.17757521637182
20, -4.21604149719962
21, -4.34120464015363
22, -4.31922573343485
23, -4.65135956845747
24, -4.62150660530778
25, -4.4843054837943
26, -5.15213485636996
27, -5.15213485636996
28, -5.25749537202778
29, -5.25749537202778
30, -5.37527840768417
31, -5.37527840768417
32, -5.37527840768417
33, -5.8452820369299
34, -5.43981692882174
35, -5.74997185712558
36, -5.8452820369299
39, -5.66296048013595
41, -5.8452820369299
IvsT  0.917045770942256 vs 0.0829542290577441
running gm.hmm on local multi-core system
7317 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_C_3
training level ES_C: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_C_3
Initial	3517
Internal	48092
Terminal	3537
Single	0
Intron:	51684	6570545
Intergenic:	2384	1527004
SvsM  0 vs 1
genes in IvsT: 3445
2, -4.7434818017856
3, -3.9105726788505
4, -3.23202429771171
5, -2.62722628698305
6, -2.56873008030144
7, -2.39528619753951
8, -2.62722628698305
9, -2.58785112174822
10, -2.90293216838812
11, -3.01478046852469
12, -3.08843337809945
13, -3.01478046852469
14, -3.32439761784272
15, -3.53950899745967
16, -3.39974705508451
17, -3.76265254877388
18, -3.82719106991145
19, -3.95502444142133
20, -4.01754479840267
21, -3.85421974229937
22, -4.17438726989564
23, -4.06714173954204
24, -4.3834790677542
25, -4.3604895495295
26, -4.43110711674345
27, -4.50709302372137
28, -4.64817162198128
29, -4.96662535309981
30, -5.05363673008944
31, -5.05363673008944
32, -4.88658264542628
33, -5.31146583939154
34, -5.5056218538325
35, -5.5056218538325
36, -5.65977253365976
37, -5.5056218538325
40, -5.84209409045371
48, -5.84209409045371
IvsT  0.92530320969649 vs 0.0746967903035102
running gm.hmm on local multi-core system
7317 contigs in training
concatenate predictions: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_C_4
training level ES_C: /home/christian/software/BRAKER/Ascaber100k_GenemarkES_new/run/ES_C_4
Initial	3624
Internal	56323
Terminal	3647
Single	0
Intron:	60018	6594022
Intergenic:	2333	1399435
SvsM  0 vs 1
genes in IvsT: 3551
2, -5.53592720332783
3, -4.43731491465972
4, -3.59001705427252
5, -3.27714473299218
6, -2.88671750224855
7, -2.54019492977384
8, -2.61046412562039
9, -2.58399755243223
10, -2.73256682242129
11, -2.96549838010167
12, -2.99883480036926
13, -2.993200982651
14, -3.10608033072286
15, -3.09981071770926
16, -3.45648566164799
17, -3.60027355443971
18, -3.8311791110894
19, -3.85749641940678
20, -3.95547682776698
21, -3.94087802834583
22, -4.047850147898
23, -4.08063997072099
24, -4.13193326510854
25, -4.16765134771062
26, -4.32483693123303
27, -4.51142288681344
28, -4.32483693123303
29, -4.67847697147661
30, -4.84278002276788
31, -4.87914766693876
32, -4.67847697147661
33, -5.28461277504692
34, -5.08394207958477
35, -5.40239581070331
36, -5.08394207958477
37, -5.23054555377665
38, -5.34177118888687
39, -5.87239943994904
42, -5.53592720332783
43, -5.69007788315509
45, -5.77708926014472
46, -5.87239943994904
49, -5.77708926014472
IvsT  0.936642037319033 vs 0.0633579626809669
predict final gene set
running gm.hmm on local multi-core system
64378 contigs in training
64378 contigs in training

Thank you for continued support,
Christian

@tomasbruna
Copy link
Contributor

Hello Christian,

the amount of simple repeats indeed seems low.

You could try to do simple repeat masking separately, outside of RepeatMasker, with Tandem Repeats Finder (TRF). RepeatMasker internally runs TRF as well, but you get more control over the results when you run it on your own. We described this procedure in this document https://www.biorxiv.org/content/10.1101/2020.08.10.245134v1.supplementary-material (Supplemental Materials) on page 15.

I will also ask Alexandre Lomsadze, the original author of GeneMark-ES, to take a look at this issue, maybe he will have some other ideas what's wrong.

Best,
Tomas

@cdmmoeller
Copy link
Author

Hi Tomas,

Thanks for this document. I will try following the approach you linked, and then run GeneMark-ES again. I will let you know how it goes.

Best,
Christian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants