Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing from flye assembly graph of a phage genome with 4 circular paths #319

Closed
splaisan opened this issue Mar 15, 2024 · 5 comments
Closed

Comments

@splaisan
Copy link

splaisan commented Mar 15, 2024

app installed with conda on ubuntu server
I then defined GETORG_PATH and ran get_organelle_config.py --add embplant_pt,embplant_mt with success

Hi,

I want to obtain the 4 possible circular paths from my 6 contig assembly (flye) to validate which ones are real.
I could do this with bandage easily by hand but would like to automate the process in a pipeline, for which I need some automation.

I was told (pangenome/odgi#563 (comment)) that this software should do the trick but my attempt failed (see below)

Can you please help me figure aout what is wrong here

thanks in advance

get_organelle_from_assembly.py -F anonym -g ./assembly_graph.gfa -o test -t 48 --no-slim

GetOrganelle v1.7.7.0

get_organelle_from_assembly.py isolates organelle genomes from assembly graph.
Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.

Python 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10)  [GCC 12.3.0]
PLATFORM: Linux myhost 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 x86_64
PYTHON LIBS: GetOrganelleLib 1.7.7.0; numpy 1.26.4; sympy 1.12; scipy 1.11.4
DEPENDENCIES: Bandage 0.9.0
GETORG_PATH=/opt/miniconda3/envs/flye/.GetOrganelle
WORKING DIR: /data/graph2fasta
/opt/miniconda3/envs/flye/bin/get_organelle_from_assembly.py -F anonym -g ./assembly_graph.gfa -o test -t 48 --no-slim

Traceback (most recent call last):
  File "/opt/miniconda3/envs/flye/bin/get_organelle_from_assembly.py", line 1072, in <module>
    main()
  File "/opt/miniconda3/envs/flye/bin/get_organelle_from_assembly.py", line 958, in main
    options, log_handler = get_options(description=title, version=get_versions())
  File "/opt/miniconda3/envs/flye/bin/get_organelle_from_assembly.py", line 453, in get_options
    ref_seqs = read_fasta(options.genes_fasta[got_t])[1]
IndexError: list index out of range

my flye asm gfa has the following content (truncated at 40 char)

H	VN:Z:1.0
S	1	CTATCTTGGTTCCACAAATCTCATTACACCAATATT
S	2	AAGTGGGAGTGGATTAACAGAAATGGCCCCGTACGG
S	3	AAGTGGGGGTGGAGTAAGAGAAATTGCCCCGTACTG
S	4	TAATTGAGTTCCGTGTTCCGGGCAGCACCACCACTG
S	5	GCACTCGAACGACGAAGTAAAGAACGCGAAAAAGCG
S	6	CACTCTGACAATTCGTTGATCAAGTCACGGTATTTA
L	1	+	6	-	0M	RC:i:37
L	1	-	5	+	0M	RC:i:43
L	2	+	6	+	0M	RC:i:96
L	2	-	5	-	0M	RC:i:91
L	3	+	6	+	0M	RC:i:60
L	3	-	5	-	0M	RC:i:58
L	4	-	5	+	0M	RC:i:108
L	4	+	6	-	0M	RC:i:115
P	contig_5	5+	*
P	contig_6	6+	*
P	contig_1	1+	*
P	contig_2	2+	*
P	contig_3	3+	*
P	contig_4	4+	*
@splaisan splaisan changed the title failing to use fromassembly on a phage genome with several paths failing from flye assembly graph of a phage genome with 4 circular paths Mar 15, 2024
@splaisan
Copy link
Author

I think I am on the right track to understand this error.

A similar command also fails using the embplant_pt.K115.complete.graph1.selected_graph.gfa file produced by the tutorial when I choose -F anonym BUT it succeeds when I use the correct -F embplant_pt option

Is -F anonym not a valid argument?
What -F should I use when my gfa file is not an organnelle but a random genome like in my phage example?

get_organelle_from_assembly.py -F embplant_pt -g ./embplant_pt.K115.complete.graph1.selected_graph.gfa -o test -t 48 --no-slim

GetOrganelle v1.7.7.0

get_organelle_from_assembly.py isolates organelle genomes from assembly graph.
Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.

Python 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10)  [GCC 12.3.0]
PLATFORM: Linux gbw-s-pacbio01 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 x86_64
PYTHON LIBS: GetOrganelleLib 1.7.7.0; numpy 1.26.4; sympy 1.12; scipy 1.11.4
DEPENDENCIES: Bandage 0.9.0
GETORG_PATH=/opt/miniconda3/envs/flye/.GetOrganelle
WORKING DIR: /data/NC_projects/4637_JVandierendonck_ONT/flye_results_native/phi315_30kasm/graph2fasta/Arabidopsis_simulated.plastome
/opt/miniconda3/envs/flye/bin/get_organelle_from_assembly.py -F embplant_pt -g ./embplant_pt.K115.complete.graph1.selected_graph.gfa -o test -t 48 --no-slim

2024-03-15 15:30:37,521 - INFO: Processing assembly graph ...
2024-03-15 15:30:37,522 - INFO: Processing assembly graph finished.

2024-03-15 15:30:37,522 - INFO: Extracting embplant_pt from the assemblies ...
2024-03-15 15:30:37,522 - INFO: Disentangling test/initial_assembly_graph.gfa as a circular genome ... 
2024-03-15 15:30:37,605 - INFO: Writing output ...
2024-03-15 15:30:37,668 - WARNING: More than one circular genome structure produced ...
2024-03-15 15:30:37,668 - WARNING: Please check the final result to confirm whether they are simply different in SSC direction (two flip-flop configurations)!
2024-03-15 15:30:38,447 - INFO: Detecting large repeats (>1000 bp) in PATH1 with IRs detected, Total:LSC:SSC:Repeat(bp) = 154478:84170:17780:26264
2024-03-15 15:30:38,448 - INFO: Writing PATH1 of complete embplant_pt to test/embplant_pt.complete.graph1.1.path_sequence.fasta
2024-03-15 15:30:38,450 - INFO: Writing PATH2 of complete embplant_pt to test/embplant_pt.complete.graph1.2.path_sequence.fasta
2024-03-15 15:30:38,450 - INFO: Writing GRAPH to test/embplant_pt.complete.graph1.selected_graph.gfa
2024-03-15 15:30:38,863 - INFO: Writing GRAPH image to test/embplant_pt.complete.graph1.selected_graph.png
2024-03-15 15:30:38,863 - INFO: Result status of embplant_pt: circular genome
2024-03-15 15:30:38,863 - INFO: Writing output finished.
2024-03-15 15:30:38,863 - INFO: Extracting embplant_pt from the assemblies finished.


Total cost 2.46 s
Thank you!

@splaisan
Copy link
Author

OK, lets end here, apparently this program is only applicabel to organelles and not to other type of chromosomes

"or anonym (uncertain organelle genome type). "
                                 "The anonym should be used with customized seed and label databases "
                                 "('-s' and '--genes').")

sad as it was doing exactly what I need: produce all possible closed FASTA paths from a gfa file

@JianjunJin
Copy link
Collaborator

I apologize for the delay in responding to your issue due to prior commitments, which was further complicated by a hard drive damage, including unpushed fixes for GetOrganelle, on my Macbook. It has indeed been a challenging period for me.

That being said, regarding the problem you've encountered, it appears to stem from a straightforward bug of GetOrganelle related to the option settings. I will give this matter immediate attention and aim to provide a fix as soon as possible, hoping it will still be of assistance to you.

JianjunJin added a commit that referenced this issue Apr 3, 2024
1. fix a import bug of 'from scipy import stat, log, inf' issue (issue #132 #315)
2. fix a ZeroDivisionError bug when the estimated coverage is 0 (issue #311)
3. Disentangling failed -> Disentangling unsuccessful to avoid panic (issue #308)
4. fix a bug in parsing options when '-F anonym' is used (issue #319)
5. have max_multiplicity passed to no-slim case
6. minor adjustment
@JianjunJin
Copy link
Collaborator

Hello,

I've just released an update (version 1.7.7.1) which includes a fix for the option issue you mentioned. You can find the tagged release here. Please note that it might take some time for the update to be available through conda.
After reviewing your objectives and testing the data you provided, it appears that GetOrganelle might not be the best fit for your project. This is due to the tool's three foundational assumptions, which are detailed in our publication. These assumptions are designed for the retrieval of paths from cleaned organelle assembly graph and each path should contain all contigs, making the resulting paths to be:

>1+,6-,2-,5-,4+,6-,3-,5-(circular)
>1+,6-,3-,5-,4+,6-,2-,5-(circular)

However, based on your description, it seems you are seeking to identify four distinct paths:

>1+,6-,2-,5-(circular)
>4+,6-,3-,5-(circular)
>1+,6-,3-,5-(circular)
>,4+,6-,2-,5-(circular)

Creating a custom script to automate the extraction of four paths from the graph should be straightforward if the graph topology is fixed and your goal is fixed. Nonetheless, distinguishing between all potential candidates, including the above six ones and those not mentioned, is complex based solely on the graph. For instance, if contigs 5/6 are relatively short and have double the coverage of 1/2/3/4 and only 1/2/3/4 contain essential genes, then a longer concatenated path might seem more plausible. This illustrates why certain assumptions are necessary in the absence of additional evidence.

If you have long-read data for your sample, which it seems you might, there is a possibility to identify the most likely combination of variants, by heuristically proposing possible paths (based on the evidence of both graph and long reads) and then looking for the best model according to the distribution of long read alignments. My other tool in development, Traversome, is designed to perform such analyses, and will generate relative frequencies for the result.

Please let me know.

@splaisan
Copy link
Author

splaisan commented Apr 5, 2024

Hi @JianjunJin,
I tried traversome after creating a gaf file from my gfa with GraphAligner
I get a run which seems to succeed but ends with 3 empty sequences
I post on the traversome issue page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants