-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge dev into master #1
Conversation
Cecilia-Sensalari
commented
Mar 8, 2021
- Rename terms according to those used in the preprint (documentation and code lines)
- Catch missing species in configuration file fields fasta_filenames and latin_names
Rename 'species of interest' and 'main species' in documentation RST files, function docstrings, comments in code, in config_elaeis.txt, but also in some actual code lines: in generate_config.py and fc_configfile.py ('focal_species' field), in main.nf (parse config file in setup process).
Exit if latin_names is empty or one or more species are missing. The new function check_complete_dictionary() lists the missing species in latin_names and exits. Check and warn if FASTA of GFF files are missing; exit after all the files have been checked.
Co-authored-by: lohausr <[email protected]>
Co-authored-by: lohausr <[email protected]>
Co-authored-by: lohausr <[email protected]>
Co-authored-by: lohausr <[email protected]>
Co-authored-by: lohausr <[email protected]>
Co-authored-by: lohausr <[email protected]>
Co-authored-by: lohausr <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Accepted suggested code blocks (they are in new commits or in a "batch" of commits)
- Pushed new changes based on your comments
- Labeled many conversations as resolved
- Responded to others
doc/source/configuration.rst
Outdated
* **paranome**: whether to build/plot the whole-paranome *K*:sub:`S` distribution of the focal species. \[yes/no\] | ||
* **colinearity**: whether to build/plot the anchor pair *K*:sub:`S` distribution of the focal species. \[yes/no\] | ||
* **paranome**: whether to build/plot the whole-paranome *K*:sub:`S` distribution of the focal species (options: "yes" and "no"). [Default: "yes"] | ||
* **colinearity**: whether to build/plot the anchor pair *K*:sub:`S` distribution of the focal species (options: "yes" and "no"). [Default: "no"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to use the same sentence but with a more generic GFF instead of GFF3.
README.md
Outdated
To illustrate how to use *ksrates*, two example datasets are provided for a simple example use case analyzing WGD signatures in monocot plants with oil palm (*Elaeis guineensis*) as the species of interest. | ||
|
||
- [`example`](example): a full dataset which contains the complete sequence data for the focal species and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers. | ||
- [`example`](example): a full dataset which contains the complete sequence data for the species of interest and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, let's change it! (Will be added as a new commit to this pull request).
ksrates/setup_correction.py
Outdated
gff_dict = config.get_gff_dict(warn_empty_dict=False) | ||
latin_names = config.get_latin_names() | ||
if latin_names == {}: | ||
logging.error("Exiting") | ||
sys.exit(1) | ||
# If a GFF is provided, check existence and content | ||
if species_of_interest in gff_dict: | ||
gff = config.get_gff_name(gff_dict, species_of_interest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed a dictionary with only one element, just for analogy with the FASTA files dictionary.
wgd/ks_distribution.py
Outdated
@@ -636,7 +636,7 @@ def ks_analysis_one_vs_one( | |||
:param n_threads: number of CPU cores to use | |||
:return: data frame | |||
""" | |||
# Filter families with one vs one orthologs for the species of interest. --- | |||
# Filter families with one vs one orthologs for the focal species. --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed as Filter families with one vs one orthologs for the species pair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more things...
README.md
Outdated
To illustrate how to use *ksrates*, two example datasets are provided for a simple example use case analyzing WGD signatures in monocot plants with oil palm (*Elaeis guineensis*) as the species of interest. | ||
|
||
- [`example`](example): a full dataset which contains the complete sequence data for the focal species and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers. | ||
- [`example`](example): a full dataset which contains the complete sequence data for the species of interest and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still "species of interest"?
doc/source/configuration.rst
Outdated
* **max_mixture_model_iterations**: maximum number of EM iterations for mixture modeling. [Default: 300] | ||
* **max_mixture_model_components**: maximum number of components considered during execution of the mixture models. [Default: 5] | ||
* **max_mixture_model_ks**: upper limit for the *K*:sub:`S` range in which the exponential-lognormal and lognormal-only mixture models are performed. [Default: 5] | ||
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods. [Default: "no"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods. [Default: "no"] | |
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution(s) with non-default mixture model methods. [Default: "no"] |
Either explain the non-defaults or link to a place where they are explained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add an internal link to the Paralog analyses page of the documentation and I'll say also to check the Supplementary Materials.
ksrates/fc_configfile.py
Outdated
If the value is an empty string, it applies the fallback filename "species + .gff". | ||
|
||
:param fasta_dict: Python dictionary that associates each informal species name to the path of its FASTA file | ||
:param species: the informal name of the species of interest | ||
:param gff_dict: Python dictionary that associates the focal species informal name to the path of its GFF file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant why is a dictionary/association necessary here? It's always a single value that is always for the focal species. Using the informal name as a key is unnecessary.
ksrates/setup_correction.py
Outdated
gff_dict = config.get_gff_dict(warn_empty_dict=False) | ||
latin_names = config.get_latin_names() | ||
if latin_names == {}: | ||
logging.error("Exiting") | ||
sys.exit(1) | ||
# If a GFF is provided, check existence and content | ||
if species_of_interest in gff_dict: | ||
gff = config.get_gff_name(gff_dict, species_of_interest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See an earlier comment. I don't think the analogy here warrants the usage of a dict, seems unnecessarily convoluted.
Co-authored-by: lohausr <[email protected]>
if colinearity: # if colinearity analysis is required, load related parameters | ||
gff = config.get_gff(species) | ||
if fcCheck.check_file_nonexistent_or_empty(gff, "GFF file"): | ||
trigger_exit = True | ||
|
||
gff_feature = config.get_feature() | ||
gff_gene_attribute = config.get_attribute() | ||
if gff_feature == "": | ||
logging.error("No GFF attribute provided in configuration file.") | ||
logging.error("No GFF attribute provided in configuration file. Will exit.") | ||
trigger_exit = True | ||
if gff_gene_attribute == "": | ||
logging.error("No GFF feature provided in configuration file.") | ||
if gff_feature == "" or gff_gene_attribute == "": | ||
logging.error("Will exit the colinearity analysis.") | ||
sys.exit(1) | ||
# Checking if IDs in FASTA (and in GFF if applicable) are compatible with wgd pipeline (paml) | ||
logging.info("") | ||
if colinearity: | ||
fcCheck.check_IDs(species_fasta_file, latin_names[species], species_gff_file) | ||
else: | ||
fcCheck.check_IDs(species_fasta_file, latin_names[species]) | ||
logging.error("No GFF feature provided in configuration file. Will exit.") | ||
trigger_exit = True | ||
|
||
# Checking if FASTA file exists and if sequence IDs are compatible with wgd pipeline (paml) | ||
fasta_names_dict = config.get_fasta_dict() | ||
species_fasta_file = config.get_fasta_name(fasta_names_dict, species) | ||
if fcCheck.check_file_nonexistent_or_empty(species_fasta_file, "FASTA file"): # if missing/empty | ||
trigger_exit = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have probably left doing the FASTA file check first before the GFF file check.