-
Notifications
You must be signed in to change notification settings - Fork 4
Section 4 Custom Wrangling
In this section, we will take a close look at the data used from a publication. The first two parts will introduce the background behind data wrangling and the issues of unexpected file formats. All these experiences will then be addressed in the last part by using Cogent3 Apps.
For my (Yapeng) study, I needed to use published data sets for developing my novel model-based measure for phylogenetic inference limits in relation to site-saturation. (More information will be provided in my Phylomania talk this week!)
Divergent sequences in these published data have been previously obtained and curated to develop the measure of site-saturation (Duchêne et al. 2022).
In total, there are six alignments from six previous studies all with different formats (fasta, NEXUS, phylip) and different levels of site-saturation required for my method development:
For this section, we will focus on one the phylip alignments from one study and the issues associated with it.
The alignments were stored and downloaded as a compressed file. The key issue was that some files did not follow the PHYLIP format definition, as indicated by the file extension (.phy). For example, the strict PHYLIP format definition from the PHYLIP documentation requires that files:
- begins with a line indicating the number of taxa and sequence length
- restricts 10 characters for the taxa names with blanks filled
- the sequence can continue for more than one line
Here is an example of one of the incorrectly formatted files from the study 😞😢😭:
The taxa names exceed the 10 character restriction, and therefore violate the strict PHYLIP requirements
So why is this an issue for data analysis?
🚨 When attempting to load this using a PHYLIP parser, you are very likely to encounter an error 🚨
This is the workflow is an overview of constructing Cogent3 apps.
The goal of this section is to construct an app that achieves:
- Loading incorrectly formatted Phylip alignments
- Sequentially estimate phylogenetic models
An instance of a study is here: phy_data.zip
- Step1: Write Python function to read relaxed Phylip format with key components:
Show me the steps:
1. read the first line to get the indicated num. taxa and seq. length 2. read taxa names preceding the first several lines 3. concatenate the sequences following the names
-
Step2: Wrap the function into C3 App, by using the decorator
@define_app(app_type=LOADER)
.The script can be downloaded here: parse_bad_phylip.py.zip
-
Step3: Construct the whole phylogenetic analysis with the composable apps.
Show me the code:
``` from cogent3 import open_data_store, get_app
in_dstore = open_data_store("data/phy_data.zip", suffix="phy", mode="r") # change "data/phy_data.zip" to your data path
out_dstore = open_data_store("data/outdb.sqlitedb", mode="w") # "data/outdb.sqlitedb" where your data base for output are
dist_cal = get_app("fast_slow_dist", fast_calc="paralinear", moltype="DNA") est_tree = get_app("quick_tree", drop_invalid=False) calc_tree = dist_cal + est_tree
loader = load_bad_phylip() # your customised cogent3 app model = get_app("model", "GTR", tree_func = calc_tree) # the phylogenetic modeller writer = get_app("write_db", data_store=out_dstore) # the function to write estiamtions into data base.
process= loader + model + writer process.apply_to(ds[:5]) # fine to change the number it applied to
</details>