Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALG with odp_nway_rbh fails with blast #21

Closed
tauanajc opened this issue Apr 26, 2023 · 10 comments
Closed

ALG with odp_nway_rbh fails with blast #21

tauanajc opened this issue Apr 26, 2023 · 10 comments

Comments

@tauanajc
Copy link

Dear Darrin,

I contact Oleg about doing synteny analyses similar to the 2022 metazoan paper and he told me about odp. It looks like a fantastic tool and I'm excited to be trying it.

I'm trying the second use case to get ALGs from 3 NCBI genomes, the lancelet, scallop and jellyfish. My first attempt was with diamond, but there was an error pointing to line 242 of the odp_functions.py script. I fixed it by replacing illegal = set("prot_to_loc", "prot-to-loc") with illegal = set().

Steps 1 and 2 then ran fine! However, the list of orthogroups in the reciprocal_best_hits.rbh.groupby was relatively short compared to the metazoan paper (2648 groups, instead of >6000). I decided to try blast instead of diamond, and also to increase the number of permutations (there was no example in the github, so I originally used num_permutations: 1000000, and now increased to 100000000). Running with blast quickly fails. The end of the log file is below, and I attach the config and log files:

Waiting at most 5 seconds for missing files.
MissingOutputException in rule diamond_blast_x_to_y in file /home/FM/tcunha/scripts/odp/scripts/odp_nway_rbh, line 165:
Job 9 completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
odp_nway_rbh/step0-blastp_results/Bfloridae_against_Myessoensis.blastp
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-04-26T163218.427529.snakemake.log
Here is an example of the first few analyses: [['Bfloridae', 'Hvulgaris', 'Myessoensis']]
There are 1 possible combinations.

step1.log
config.yaml.txt

I would love to have it running on blast, but also any ideas about how to reproduce the >6000 groups would be appreciated!
Thanks!

@conchoecia
Copy link
Owner

Hi Tauana,

Is your only goal to recreate the analyses of Simakov et al 2022? If I know what your goals are maybe I can better answer your question.

@tauanajc
Copy link
Author

Thanks for the quick reply Darrin!
I actually have a new genome that I would like to add to the analyses. It has only 4 chromosomes, so my ultimate goal is to have a ribbon plot (let's say my species + scallop, lancelet, hydra, perhaps Celegans), and a dotplot of my species against the BLG (like the nematodes in Figure S10 of the 2022 paper), so that I can understand how the ancestral bilaterian linkage groups have evolved to form these 4 chromosomes.

@tauanajc
Copy link
Author

Hi again Darrin,
I run the macrosynteny use case and that actually gave me most of what I wanted! I thought I had to infer the ALG from scratch, but using that example gave me the dotplots against the BCnS_LGs (bilaterian linkage groups correct?) as desired. One issue is that I did set up the config.yaml parameters for genus, species, and plotorder, but it doesn't seem to have had any effect: the plots still have the abbreviation names and all of the scaffolds (instead of the 4 I specified) in some other arrangement. Any ideas of how to fix that?

Command:
snakemake --cores 16 --snakefile ../../scripts/odp/scripts/odp > minimal-ppacificus.log 2>&1 &
config.yaml.txt

Is there another parameter for the config file that would generate ribbon plots?

Thanks!

@conchoecia
Copy link
Owner

conchoecia commented Apr 27, 2023

I thought I had to infer the ALG from scratch, but using that example gave me the dotplots against the BCnS_LGs (bilaterian linkage groups correct?) as desired.

Yes, that's right - the way the software is implemented now there is a set of BCnS ALGs that each analysis with the odp script is compared to using HMMs. This isn't the same exact set of proteins as the Simakov et al. 2022 paper - this set is more restrictive in that I defined it using strict reciprocal best orthologs from more species.

One issue is that I did set up the config.yaml parameters for genus, species, and plotorder, but it doesn't seem to have had any effect: the plots still have the abbreviation names and all of the scaffolds (instead of the 4 I specified) in some other arrangement. Any ideas of how to fix that?

I think plotorder is broken at the moment due to a recent change in how chromosomes are sorted along the x-axis. It sounds like maybe the software is plotting a lot of small scaffolds, in addition to the 4 chromosome scale ones. Can you figure out your smallest scaffold that you want to be plotted, then add this bit to that species' entry? minscafsize: 5000000 will only plot scaffolds 5 Mbp or larger.

  AAU:
    proteins: /home/FM/tcunha/nematomorpha-synteny/data/AAU/AAU-proteins.fa
    chrom: /home/FM/tcunha/nematomorpha-synteny/data/AAU/AAU.chrom
    genome: /home/FM/tcunha/nematomorpha-synteny/data/AAU/AAU-assembly.fa
    minscafsize: 5000000

Command:
snakemake --cores 16 --snakefile ../../scripts/odp/scripts/odp > minimal-ppacificus.log 2>&1 &

Is there another parameter for the config file that would generate ribbon plots?

It is a bit cumbersome at the moment to make these plots, but easier than other solutions out there. Working on making this easier though! Please look at this config file and it should be enough to run the script. Make sure you point to the directory with the BCnS_LGs .rbh files to see the provenance of your species' 4 chromosomes in relation to those ALGs.

cp ../../example_configs/CONFIG_rbh_to_subway.yaml ./
mv CONFIG_rbh_to_subway.yaml config.yaml
# make your edits to the config.yaml file
snakemake --cores 16 --snakefile ../../scripts/odp/scripts/rbh_to_subway

Thanks!

Sure, please let me know if you have suggestions for the documentation or anything else to improve the user experience.

@tauanajc
Copy link
Author

Thank you very much Darrin!
I do have some small suggestions for documentation, but I will come back for that once I'm done with the analysis.

I am running odp again on the set of species I want to use in the ribbon plot: Hydra, lancelet, drosophila, Celegans and my genome. Basically same command and config file as before, but now I am getting a new error related to colors in the plot (below). Is there a way I could bypass this?

Error in rule plot_synteny_of_ALGs_plus_species:
jobid: 0
input: odp/step1-rbh/UnicellMetazoanLgs_Aaustraliensis_reciprocal_best_hits.hmm.D.FET.rbh, odp/step0-chromsize/species/Aaustraliensis.chromsi
ze
output: odp/step2-figures/ALG-species_plots/UnicellMetazoanLgs_Aaustraliensis_xy_reciprocal_best_hits.plotted.rbh, odp/step2-figures/ALG-spec
ies_plots/UnicellMetazoanLgs_Aaustraliensis_xy_synteny.pdf, odp/step2-figures/ALG-species_plots/UnicellMetazoanLgs_Aaustraliensis_yx_synteny.pdf

RuleException:
ValueError in file /home/FM/tcunha/scripts/odp/scripts/odp, line 1551:
Invalid RGBA argument: '#8e7ec26'
File "/home/FM/tcunha/scripts/odp/scripts/odp", line 2132, in __rule_plot_synteny_of_ALGs_plus_species
File "/home/FM/tcunha/scripts/odp/scripts/odp", line 1551, in synteny_plot_sheet
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/table.py", line 806, in table
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/table.py", line 343, in add_cell
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/_api/deprecation.py", line 454, in wrapper
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/table.py", line 94, in init
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/_api/deprecation.py", line 454, in wrapper
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/patches.py", line 714, in init
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/_api/deprecation.py", line 454, in wrapper
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/patches.py", line 85, in init
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/patches.py", line 359, in set_facecolor
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/patches.py", line 347, in _set_facecolor
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/colors.py", line 299, in to_rgba
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/colors.py", line 374, in _to_rgba_no_colorcycle
File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/concurrent/futures/thread.py", line 58, in run
Removing output files of failed job plot_synteny_of_ALGs_plus_species since they might be corrupted:
odp/step2-figures/ALG-species_plots/UnicellMetazoanLgs_Aaustraliensis_xy_reciprocal_best_hits.plotted.rbh
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

@beroe
Copy link
Collaborator

beroe commented Apr 28, 2023

The color Invalid RGBA argument: '#8e7ec26' needs to be 8 digits long. It should probably have another 6 at the end.

@conchoecia
Copy link
Owner

@tauanajc - I found the source of the error. I think it was caused by some software automatically incrementing the numbers at the end of the color string in one of the ALG database files.

The easiest fix right now is to delete the files called UnicellMetazoanLgs and UnicellMetazoanLgsOrthofinder from the LG_db folder. After doing this it should work.

@tauanajc
Copy link
Author

tauanajc commented May 4, 2023

Hi Darrin, sorry for the late reply. I don't have a LG_db folder, so I removed all files with UnicellMetazoanLgs* in the entire analysis, but the run still failed (perhaps I deleted files that shouldn't have). The dot plots should be sufficient for what I need at the moment, so I will try the ribbon plots again later on. Thanks again for all the help.

How should I cite opd at the moment?
And you had mentioned that the set of species used to obtain the ALGs is slightly different than that in the Simakov 2022 paper right? Just for reference, could you please share which taxa odp uses? I am assuming the same phyla in the paper, Porifera, Cnidaria, and representatives of Bilateria.

Thanks!

@conchoecia
Copy link
Owner

conchoecia commented May 10, 2023

Hi Tauana, the folder will be where you installed odp, so odp/LG_db/UnicellMetazoan*. I think they should be there, because I the BCnS_LGs plots would not have appeared otherwise. I am guessing that you following the install instructions and ran make.

I will work on making the interface easier to work with, but for now please let me know if you would like to get the subway plots/ribbon diagrams working properly for your species. I am also working on improving the implementation of this tool to make it easier for people to use.

The taxa used to generate the BCnS_LGs database were Branchiostoma floridae, Ephydatia muelleri, Hydra vulgaris, Pecten yessoensis, and Rhopilema esculentum.

@conchoecia
Copy link
Owner

Closed the issue, but please open again with new relevant info if need be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants