Merge dev into master #1

Cecilia-Sensalari · 2021-03-08T18:13:54Z

Rename terms according to those used in the preprint (documentation and code lines)
Catch missing species in configuration file fields fasta_filenames and latin_names

Rename 'species of interest' and 'main species' in documentation RST files, function docstrings, comments in code, in config_elaeis.txt, but also in some actual code lines: in generate_config.py and fc_configfile.py ('focal_species' field), in main.nf (parse config file in setup process).

Exit if latin_names is empty or one or more species are missing. The new function check_complete_dictionary() lists the missing species in latin_names and exits. Check and warn if FASTA of GFF files are missing; exit after all the files have been checked.

Co-authored-by: lohausr <[email protected]>

Cecilia-Sensalari

Accepted suggested code blocks (they are in new commits or in a "batch" of commits)
Pushed new changes based on your comments
Labeled many conversations as resolved
Responded to others

Cecilia-Sensalari · 2021-03-11T09:36:58Z

doc/source/configuration.rst

-* **paranome**: whether to build/plot the whole-paranome *K*:sub:`S` distribution of the focal species. \[yes/no\]
-* **colinearity**: whether to build/plot the anchor pair *K*:sub:`S` distribution of the focal species. \[yes/no\]
+* **paranome**: whether to build/plot the whole-paranome *K*:sub:`S` distribution of the focal species (options: "yes" and "no"). [Default: "yes"]
+* **colinearity**: whether to build/plot the anchor pair *K*:sub:`S` distribution of the focal species (options: "yes" and "no"). [Default: "no"]


I'm going to use the same sentence but with a more generic GFF instead of GFF3.

Cecilia-Sensalari · 2021-03-11T12:22:54Z

README.md

+To illustrate how to use *ksrates*, two example datasets are provided for a simple example use case analyzing WGD signatures in monocot plants with oil palm (*Elaeis guineensis*) as the species of interest.

- [`example`](example): a full dataset which contains the complete sequence data for the focal species and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers.
+- [`example`](example): a full dataset which contains the complete sequence data for the species of interest and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers.


I see, let's change it! (Will be added as a new commit to this pull request).

ksrates/cluster_anchor_ks.py

ksrates/plot_paralogs.py

Cecilia-Sensalari · 2021-03-11T13:10:49Z

ksrates/setup_correction.py

    gff_dict = config.get_gff_dict(warn_empty_dict=False)
-    latin_names = config.get_latin_names()
-    if latin_names == {}:
-        logging.error("Exiting")
-        sys.exit(1)
+    # If a GFF is provided, check existence and content
+    if species_of_interest in gff_dict:
+        gff = config.get_gff_name(gff_dict, species_of_interest)


It is indeed a dictionary with only one element, just for analogy with the FASTA files dictionary.

wgd/colinearity.py

Cecilia-Sensalari · 2021-03-11T14:56:15Z

wgd/ks_distribution.py

@@ -636,7 +636,7 @@ def ks_analysis_one_vs_one(
    :param n_threads: number of CPU cores to use
    :return: data frame
    """
-    # Filter families with one vs one orthologs for the species of interest. ---
+    # Filter families with one vs one orthologs for the focal species. ---


Renamed as Filter families with one vs one orthologs for the species pair.

wgd/utils.py

ksrates/plot_paralogs.py

ksrates/plot_tree.py

lohausr

A few more things...

lohausr · 2021-03-12T20:23:53Z

README.md

+To illustrate how to use *ksrates*, two example datasets are provided for a simple example use case analyzing WGD signatures in monocot plants with oil palm (*Elaeis guineensis*) as the species of interest.

- [`example`](example): a full dataset which contains the complete sequence data for the focal species and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers.
+- [`example`](example): a full dataset which contains the complete sequence data for the species of interest and two other species and may require hours of computations depending on the available computing resources. We advice to run this dataset on a compute cluster and using the *ksrates* Nextflow pipeline should make it fairly easy to configure this for a variety of HPC schedulers.


It's still "species of interest"?

lohausr · 2021-03-12T20:38:47Z

doc/source/configuration.rst

+* **max_mixture_model_iterations**: maximum number of EM iterations for mixture modeling. [Default: 300]
+* **max_mixture_model_components**: maximum number of components considered during execution of the mixture models. [Default: 5]
+* **max_mixture_model_ks**: upper limit for the *K*:sub:`S` range in which the exponential-lognormal and lognormal-only mixture models are performed. [Default: 5]
+* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods. [Default: "no"]


Suggested change

* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution with non default mixture model methods. [Default: "no"]

* **extra_paralogs_analyses_methods**: flag to toggle the optional analysis of the paralog *K*:sub:`S` distribution(s) with non-default mixture model methods. [Default: "no"]

Either explain the non-defaults or link to a place where they are explained.

I'll add an internal link to the Paralog analyses page of the documentation and I'll say also to check the Supplementary Materials.

ksrates/fc_check_input.py

lohausr · 2021-03-12T20:55:50Z

ksrates/fc_configfile.py

        If the value is an empty string, it applies the fallback filename "species + .gff".

-        :param fasta_dict: Python dictionary that associates each informal species name to the path of its FASTA file 
-        :param species: the informal name of the species of interest
+        :param gff_dict: Python dictionary that associates the focal species informal name to the path of its GFF file 


I meant why is a dictionary/association necessary here? It's always a single value that is always for the focal species. Using the informal name as a key is unnecessary.

ksrates/fc_lognormal_mixture.py

ksrates/plot_paralogs.py

ksrates/plot_tree.py

lohausr · 2021-03-12T21:07:11Z

ksrates/setup_correction.py

    gff_dict = config.get_gff_dict(warn_empty_dict=False)
-    latin_names = config.get_latin_names()
-    if latin_names == {}:
-        logging.error("Exiting")
-        sys.exit(1)
+    # If a GFF is provided, check existence and content
+    if species_of_interest in gff_dict:
+        gff = config.get_gff_name(gff_dict, species_of_interest)


See an earlier comment. I don't think the analogy here warrants the usage of a dict, seems unnecessarily convoluted.

Co-authored-by: lohausr <[email protected]>

lohausr · 2021-03-14T16:49:04Z

ksrates/wgd_paralogs.py

+    if colinearity:  # if colinearity analysis is required, load related parameters
+        gff = config.get_gff(species)
+        if fcCheck.check_file_nonexistent_or_empty(gff, "GFF file"):
+            trigger_exit = True

        gff_feature = config.get_feature()
        gff_gene_attribute = config.get_attribute()
        if gff_feature == "":
-            logging.error("No GFF attribute provided in configuration file.")
+            logging.error("No GFF attribute provided in configuration file. Will exit.")
+            trigger_exit = True
        if gff_gene_attribute == "":
-            logging.error("No GFF feature provided in configuration file.")
-        if gff_feature == "" or gff_gene_attribute == "":
-            logging.error("Will exit the colinearity analysis.")
-            sys.exit(1)
-    # Checking if IDs in FASTA (and in GFF if applicable) are compatible with wgd pipeline (paml)
-    logging.info("")
-    if colinearity:
-        fcCheck.check_IDs(species_fasta_file, latin_names[species], species_gff_file)
-    else:
-        fcCheck.check_IDs(species_fasta_file, latin_names[species])
+            logging.error("No GFF feature provided in configuration file. Will exit.")
+            trigger_exit = True
+
+    # Checking if FASTA file exists and if sequence IDs are compatible with wgd pipeline (paml)
+    fasta_names_dict = config.get_fasta_dict()
+    species_fasta_file = config.get_fasta_name(fasta_names_dict, species)
+    if fcCheck.check_file_nonexistent_or_empty(species_fasta_file, "FASTA file"):  # if missing/empty
+        trigger_exit = True


I would have probably left doing the FASTA file check first before the GFF file check.

cesen and others added 28 commits March 3, 2021 14:12

renaming gff_filename field

0472842

Rename (adjustment arrows, bootstrap iterations)

c1327d3

Improve parameter description in configuration.rst

a91c097

Rename "adjustment" in filenames and python output

3efd4b0

Rename rates: branch contribution or ks distances

10709e2

Minor changes

4c75c6d

Rename "adjustment" (leftover)

381ebbc

Describe logging level in configuration.rst

8bc58e2

Add database names in input_output.rst

ccce389

Rename peak_stats

ede5a3a

Rename focal species in readme

29fc752

Catch missing species in fasta_filenames

bf8f4ac

Minor change in readme

3d4f275

Catch missing species in latin_names and exit

2037338

Merge branch 'catch_missing' into dev

cd5ffe3

Bugfix for catch missing species in latin_names

5f3696f

Bugifx to catch missing species in latin_names

1efbf3b

Rename max_mixture_model_ks

bf46200

Add parameter default values in configuration.rst

4fcecaf

Merge branch 'master' into dev;

0f8c2f2

Correct RST syntax

055775e

Add output directory and filenames to docs

7035d29

Remove AIC and BIC PDFs

f1a6ef2

Catching missing species WIP

44e591f

Remove unnecessary newick_tree variable

b495222

Update test_pipeline.yml

622f881

Cecilia-Sensalari requested a review from lohausr March 8, 2021 22:45

Update test_pipeline.yml

7360750

Cecilia-Sensalari and others added 13 commits March 11, 2021 11:57

Update doc/source/configuration.rst

d535a42

Co-authored-by: lohausr <[email protected]>

Update doc/source/paralogs_analyses.rst

9d18852

Co-authored-by: lohausr <[email protected]>

Update ksrates/correct.py

97426e5

Co-authored-by: lohausr <[email protected]>

Update ksrates/fc_rrt_correction.py

bf25a05

Co-authored-by: lohausr <[email protected]>

Accept suggestions by reviewer about renaming

15122aa

Co-authored-by: lohausr <[email protected]>

Accept suggestions by reviewer about renaming

1e98aa4

Co-authored-by: lohausr <[email protected]>

Accept suggestions by reviewer about renaming

c73a726

Co-authored-by: lohausr <[email protected]>

Rename function "decompose_ortholog_ks"

ebcc1f2

Rename function to check file content

37f0589

Rename function to check complete latin_names

72115ec

Removing comment about Nextflow in plot_tree.py

091e5c4

Rename back to "species of interest" in wgd files

59d2c3f

Mention GFF file in "colinearity" config field

f263244

Cecilia-Sensalari commented Mar 11, 2021

View reviewed changes

cesen and others added 4 commits March 12, 2021 12:24

Move logging message in setup_correction.py

8ceaf1b

Rename "colinearity" to "anchors" in LMM filenames

4d8472e

Change figure title of anchor Ks clustering picture

94835ba

Merge branch 'master' into dev

e85a444

lohausr suggested changes Mar 12, 2021

View reviewed changes

cesen and others added 7 commits March 12, 2021 22:26

Rename to "focal species" in readme

343c1b9

Merge branch 'dev' of github.com:VIB-PSB/ksrates into dev

b33465b

Fix RST title

f7bc601

Link to "paralogs analyses" from expert config

c280385

Apply suggestions from code review

a31ab62

Co-authored-by: lohausr <[email protected]>

Merge branch 'dev' of github.com:VIB-PSB/ksrates into dev

bd9b3a8

Replace GFF dict with string; improve checkpoints

bb7ebbd

lohausr reviewed Mar 14, 2021

View reviewed changes

Remove unnecessary function argument

5020f2e

Cecilia-Sensalari merged commit a401f78 into master Mar 18, 2021

Cecilia-Sensalari deleted the dev branch March 25, 2021 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge dev into master #1

Merge dev into master #1

Cecilia-Sensalari commented Mar 8, 2021

Cecilia-Sensalari left a comment

Cecilia-Sensalari Mar 11, 2021

Cecilia-Sensalari Mar 11, 2021

Cecilia-Sensalari Mar 11, 2021

Cecilia-Sensalari Mar 11, 2021

lohausr left a comment

lohausr Mar 12, 2021

lohausr Mar 12, 2021

Cecilia-Sensalari Mar 12, 2021

lohausr Mar 12, 2021

lohausr Mar 12, 2021

lohausr Mar 14, 2021

	* extra_paralogs_analyses_methods: flag to toggle the optional analysis of the paralog K:sub:`S` distribution with non default mixture model methods. [Default: "no"]
	* extra_paralogs_analyses_methods: flag to toggle the optional analysis of the paralog K:sub:`S` distribution(s) with non-default mixture model methods. [Default: "no"]

Merge dev into master #1

Merge dev into master #1

Conversation

Cecilia-Sensalari commented Mar 8, 2021

Cecilia-Sensalari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lohausr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment