Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge dev into master #1

Merged
merged 64 commits into from
Mar 18, 2021
Merged

Merge dev into master #1

merged 64 commits into from
Mar 18, 2021

Conversation

Cecilia-Sensalari
Copy link
Collaborator

  • Rename terms according to those used in the preprint (documentation and code lines)
  • Catch missing species in configuration file fields fasta_filenames and latin_names

cesen and others added 28 commits March 3, 2021 14:12
Rename 'species of interest' and 'main species' in documentation RST files,
function docstrings, comments in code, in config_elaeis.txt, but also
in some actual code lines: in generate_config.py and fc_configfile.py
('focal_species' field), in main.nf (parse config file in setup process).
Exit if latin_names is empty or one or more species are missing.
The new function check_complete_dictionary() lists the missing
species in latin_names and exits.

Check and warn if FASTA of GFF files are missing; exit after all
the files have been checked.
Copy link
Collaborator Author

@Cecilia-Sensalari Cecilia-Sensalari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Accepted suggested code blocks (they are in new commits or in a "batch" of commits)
  • Pushed new changes based on your comments
  • Labeled many conversations as resolved
  • Responded to others

* **paranome**: whether to build/plot the whole-paranome *K*:sub:`S` distribution of the focal species. \[yes/no\]
* **colinearity**: whether to build/plot the anchor pair *K*:sub:`S` distribution of the focal species. \[yes/no\]
* **paranome**: whether to build/plot the whole-paranome *K*:sub:`S` distribution of the focal species (options: "yes" and "no"). [Default: "yes"]
* **colinearity**: whether to build/plot the anchor pair *K*:sub:`S` distribution of the focal species (options: "yes" and "no"). [Default: "no"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to use the same sentence but with a more generic GFF instead of GFF3.

README.md Outdated
Comment on lines 35 to 37
To illustrate how to use *ksrates*, two example datasets are provided for a simple example use case analyzing WGD signatures in monocot plants with oil palm (*Elaeis guineensis*) as the species of interest.

- [`example`](example): a full dataset which contains the complete sequence data for the focal species and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers.
- [`example`](example): a full dataset which contains the complete sequence data for the species of interest and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let's change it! (Will be added as a new commit to this pull request).

Comment on lines 27 to 30
gff_dict = config.get_gff_dict(warn_empty_dict=False)
latin_names = config.get_latin_names()
if latin_names == {}:
logging.error("Exiting")
sys.exit(1)
# If a GFF is provided, check existence and content
if species_of_interest in gff_dict:
gff = config.get_gff_name(gff_dict, species_of_interest)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed a dictionary with only one element, just for analogy with the FASTA files dictionary.

@@ -636,7 +636,7 @@ def ks_analysis_one_vs_one(
:param n_threads: number of CPU cores to use
:return: data frame
"""
# Filter families with one vs one orthologs for the species of interest. ---
# Filter families with one vs one orthologs for the focal species. ---
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed as Filter families with one vs one orthologs for the species pair.

Copy link
Contributor

@lohausr lohausr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more things...

README.md Outdated
Comment on lines 35 to 37
To illustrate how to use *ksrates*, two example datasets are provided for a simple example use case analyzing WGD signatures in monocot plants with oil palm (*Elaeis guineensis*) as the species of interest.

- [`example`](example): a full dataset which contains the complete sequence data for the focal species and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers.
- [`example`](example): a full dataset which contains the complete sequence data for the species of interest and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still "species of interest"?

* **max_mixture_model_iterations**: maximum number of EM iterations for mixture modeling. [Default: 300]
* **max_mixture_model_components**: maximum number of components considered during execution of the mixture models. [Default: 5]
* **max_mixture_model_ks**: upper limit for the *K*:sub:`S` range in which the exponential-lognormal and lognormal-only mixture models are performed. [Default: 5]
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods. [Default: "no"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods. [Default: "no"]
* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution(s) with non-default mixture model methods. [Default: "no"]

Either explain the non-defaults or link to a place where they are explained.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add an internal link to the Paralog analyses page of the documentation and I'll say also to check the Supplementary Materials.

If the value is an empty string, it applies the fallback filename "species + .gff".

:param fasta_dict: Python dictionary that associates each informal species name to the path of its FASTA file
:param species: the informal name of the species of interest
:param gff_dict: Python dictionary that associates the focal species informal name to the path of its GFF file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant why is a dictionary/association necessary here? It's always a single value that is always for the focal species. Using the informal name as a key is unnecessary.

Comment on lines 27 to 30
gff_dict = config.get_gff_dict(warn_empty_dict=False)
latin_names = config.get_latin_names()
if latin_names == {}:
logging.error("Exiting")
sys.exit(1)
# If a GFF is provided, check existence and content
if species_of_interest in gff_dict:
gff = config.get_gff_name(gff_dict, species_of_interest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See an earlier comment. I don't think the analogy here warrants the usage of a dict, seems unnecessarily convoluted.

Comment on lines +27 to +45
if colinearity: # if colinearity analysis is required, load related parameters
gff = config.get_gff(species)
if fcCheck.check_file_nonexistent_or_empty(gff, "GFF file"):
trigger_exit = True

gff_feature = config.get_feature()
gff_gene_attribute = config.get_attribute()
if gff_feature == "":
logging.error("No GFF attribute provided in configuration file.")
logging.error("No GFF attribute provided in configuration file. Will exit.")
trigger_exit = True
if gff_gene_attribute == "":
logging.error("No GFF feature provided in configuration file.")
if gff_feature == "" or gff_gene_attribute == "":
logging.error("Will exit the colinearity analysis.")
sys.exit(1)
# Checking if IDs in FASTA (and in GFF if applicable) are compatible with wgd pipeline (paml)
logging.info("")
if colinearity:
fcCheck.check_IDs(species_fasta_file, latin_names[species], species_gff_file)
else:
fcCheck.check_IDs(species_fasta_file, latin_names[species])
logging.error("No GFF feature provided in configuration file. Will exit.")
trigger_exit = True

# Checking if FASTA file exists and if sequence IDs are compatible with wgd pipeline (paml)
fasta_names_dict = config.get_fasta_dict()
species_fasta_file = config.get_fasta_name(fasta_names_dict, species)
if fcCheck.check_file_nonexistent_or_empty(species_fasta_file, "FASTA file"): # if missing/empty
trigger_exit = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have probably left doing the FASTA file check first before the GFF file check.

@Cecilia-Sensalari Cecilia-Sensalari merged commit a401f78 into master Mar 18, 2021
@Cecilia-Sensalari Cecilia-Sensalari deleted the dev branch March 25, 2021 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants