Skip to content

Section 4 Custom Wrangling

richard morris edited this page Nov 21, 2023 · 1 revision

In this section, we will take a close look at the data used from a publication. The first two parts will introduce the background behind data wrangling and the issues of unexpected file formats. All these experiences will then be addressed in the last part by using Cogent3 Apps.

Aims of the data sampling

The data sets

For my (Yapeng) study, I needed to use published data sets for developing my novel model-based measure for phylogenetic inference limits in relation to site-saturation. (More information will be provided in my Phylomania talk this week!)

Divergent sequences in these published data have been previously obtained and curated to develop the measure of site-saturation (Duchêne et al. 2022).

In total, there are six alignments from six previous studies all with different formats (fasta, NEXUS, phylip) and different levels of site-saturation required for my method development:

image

For this section, we will focus on one the phylip alignments from one study and the issues associated with it.

Parsing the alignments files

The properties of PHYLIP alignments

The alignments were stored and downloaded as a compressed file. The key issue was that some files did not follow the PHYLIP format definition, as indicated by the file extension (.phy). For example, the strict PHYLIP format definition from the PHYLIP documentation requires that files:

  • begins with a line indicating the number of taxa and sequence length
  • restricts 10 characters for the taxa names with blanks filled
  • the sequence can continue for more than one line

Here is an example of one of the incorrectly formatted files from the study 😞😢😭:

image

The taxa names exceed the 10 character restriction, and therefore violate the strict PHYLIP requirements

So why is this an issue for data analysis?
🚨 When attempting to load this using a PHYLIP parser, you are very likely to encounter an error 🚨

Demo for parsing bad Phylip by creating a Cogent3 App

image

This is the workflow is an overview of constructing Cogent3 apps.

The goal of this section is to construct an app that achieves:

  1. Loading incorrectly formatted Phylip alignments
  2. Sequentially estimate phylogenetic models

An instance of a study is here: phy_data.zip

  • Step1: Write Python function to read relaxed Phylip format with key components:
Show me the steps:
1. read the first line to get the indicated num. taxa and seq. length 2. read taxa names preceding the first several lines 3. concatenate the sequences following the names
  • Step2: Wrap the function into C3 App, by using the decorator @define_app(app_type=LOADER).

    The script can be downloaded here: parse_bad_phylip.py.zip

  • Step3: Construct the whole phylogenetic analysis with the composable apps.

Show me the code:
``` from cogent3 import open_data_store, get_app

in_dstore = open_data_store("data/phy_data.zip", suffix="phy", mode="r") # change "data/phy_data.zip" to your data path

out_dstore = open_data_store("data/outdb.sqlitedb", mode="w") # "data/outdb.sqlitedb" where your data base for output are

the way to construct the prior trees for each locus

dist_cal = get_app("fast_slow_dist", fast_calc="paralinear", moltype="DNA") est_tree = get_app("quick_tree", drop_invalid=False) calc_tree = dist_cal + est_tree

loader = load_bad_phylip() # your customised cogent3 app model = get_app("model", "GTR", tree_func = calc_tree) # the phylogenetic modeller writer = get_app("write_db", data_store=out_dstore) # the function to write estiamtions into data base.

construct the whole process of the phylogenetic analysis

process= loader + model + writer process.apply_to(ds[:5]) # fine to change the number it applied to


</details>