-
Notifications
You must be signed in to change notification settings - Fork 2
Section 3 The road from orthologs to alignment
- Workshop data available here (link takes you to dropbox)
- Decompress the file and move it to your
Cogent3_Workshop_2023
folder.- Refer to this wiki page for instructions on importing data into your Docker environment.
Please open up a blank notebook in your docker container and set the data directory as follows. Note, that if you are using a different directory structure within Docker, you will need to change the path here accordingly.
from pathlib import Path
data_dir = Path("LO3/data/") # change path here if different.
Now, we are ready to go!
The original vector used in the above image was created by Elena Akifeva.
Curating data is like navigating a treacherous road. In this section we will show strategies for data exploration, and show how to use the outcomes of data exploration for the construction of well-informed data-wrangling pipelines, using cogent3
apps. Once we have constructed a pipeline, filtering data is like a well-maintained highway, it's efficient and straightforward. More importantly, these pipelines can be integrated into the alignment process, ensuring automated data provenance and reproducibility.
Small issues that can be easily fixed but might slow you down if you don't know how to fix them.
Sometimes a fasta file will have an overloaded sequence label. For example, the label of this tardigrade mitochondrial sequence contains a lot of information! Let's have a look using load_seq
from cogent3 import load_seq
overloaded_fasta = data_dir / "EU244602.fa"
tardigrade_mt_seq = load_seq(overloaded_fasta, moltype="dna")
tardigrade_mt_seq.name
- Downstream analysis may require consistent naming to collate genomes from one species.
- In
cogent3
, to annotate aSequence
(allowing you the functionality of features) requires an exact match between the names in the sequence and annotation files.
Take a look at the help for load_seq
and see if there is a way to get around this issue (without changing the name in the file, which would be tedious if you had a lot of files!).
Tip Execute
load_seq?
in a notebook cell to see the help for the function.
👀 Show solution
- To change the name of the tardigrade sequence, use the
label_to_name
argument with a lambda function. - Lambda function:
lambda x: x.split()[0]
i.e., split the label on whitespace and take the first element.
seq_stripped = load_seq(
overloaded_fasta, moltype="dna", label_to_name=lambda x: x.split()[0]
)
seq_stripped.name
Issues that you want to avoid, but if you hit one, it shouldn't be too hard to get out of if you are prepared.
Sometimes a file does not match the standard of its suffix. For example, a file with the suffix .txt
may contain sequence data in fasta format. AM088141.txt
is an example of this. Let's see what happens if we try to load it using load_seq
.
hidden_fasta = data_dir / "AM088141.txt"
tardigrade_rRNA = load_seq(hidden_fasta, moltype="dna")
If you had a lot of files, it would be a pain to change the suffix for all of them. Take a look at the help for load_seq
and see if there is a way to get around this issue.
👀 Show solution
We can use the format
argument to specify the format of the file.
tardigrade_rRNA = load_seq(hidden_fasta, format="fasta", moltype="dna")
tardigrade_rRNA
Note If the content of the file isn’t a supported format, that is a different can of worms.. we will discuss this problem later.
Roadblocks are issues that require a bit more thought to solve. There is often more than one way to get around a roadblock, you just need to find the right detour.
Imagine you have just sequenced (and annotated) the genome of a new soil bacteria 🦠! A core step in science is to put your research into the context of what is known in the field. The context in this case is to compare to other soil bacteria, available in the REFSOIL reference dataset. For any sequence comparison or molecular evolution analyses, we will need information on what genes/sequences are homologous (we will be referring to orthologs since that's what is desired for species relationship reconstruction).
We have a collection of genomes and their annotations. We want to know what genes are related to each other. Can you come up with a strategy to solve this problem?
👀 Possible strategy
This is a hypothetical phylogenomic study based on the ~1k whole microbial genomes downloaded in GenBank format from NCBI.
To do phylogenetics, we need to know orthologs. A naive strategy for identifying these is to select identically named genes from the genomes. Because of gene content variability (typical of microbes), there will be incomplete coverage.
We are aiming to identify a set of genes present in as many species from this collection as possible.
Using cogent3
, and in particular the annotation_db
module, we can select the maximum number of genes (identified by identical gene names) from the maximum number of species and write them to file. These ~90 genes in 370 species are in the data directory at the top of the section (in data/raw/
) and will be used as an example for the rest of this section.
You can load the AnnotationDb
for yourself as follows:
from cogent3.core.annotation_db import GenbankAnnotationDb
path = data_dir / "features.gbdb"
db = GenbankAnnotationDb(source=path)
db
Take a look at the documentation for annotation handling here and explore! Ask someone for tips if needed, and tell us about the experience.
These are hard problems to solve.
Following on from the previous step, if we were to blindly assume that identical gene identifiers were indicative of sequences being orthologous, we would be putting a lot of trust into the person who annotated the sequences. The meta-data associated with a sequence is frequently dubious. How can we verify that the sequences are actually related? Although this is a specific example, the question of whether sequences purported to be orthologous, are actually orthologous, is an extremely common one!
👀 Possible strategy
There is no "one size fits all" solution to this problem. So let's start exploring.
Starting with just one gene selected from the previous step, "alaS"
, which we can load into a SequenceCollection
using load_unaligned_seqs
. Accessing the .num_seqs
attribute, we can see that there are 370 sequences in the collection.
from cogent3 import load_unaligned_seqs
alaS_path = data_dir / "raw/alaS.fa"
seq_coll = load_unaligned_seqs(alaS_path, moltype="dna")
seq_coll.num_seqs
In cogent3
, a k-mer-based approach offers a quick estimation of sequence relationships without requiring alignment.
This approach is an example of cogent3
"apps"! Apps are special functions designed to simplify complex tasks. Apps can be used individually, or combined to define a “composed function” (aka pipeline). Composed functions can be applied to a single, or thousands, of data file(s)!
We need two apps for this task:
-
jaccard_dist()
: Calculates pairwise Jaccard distances between the set of k-mers in each sequence within aSequenceCollection
. -
approx_pdist()
: Transforms the matrix of pairwise Jaccard distances into an approximation of "proportion sites different," a measure of sequence divergence.
Note that before we can apply the apps to our data, we need to initialise them.
from cogent3.app.dist import jaccard_dist, approx_pdist
# create instances of the apps
p_dist = approx_pdist()
j_dist = jaccard_dist()
To combine these apps into a pipeline, so that the output of the first is the input of the second, we add them together using the +
operator.
pipeline = j_dist + p_dist
If we are applying the pipeline to a single dataset, we can simply provide the data set as the argument to the app. Later in this workshop, we will learn how to apply a pipeline to a collection of datasets using the apply_to()
method.
Applying the pipeline to our sequence collection returns a DistanceMatrix
object with an approximation of the distance between each pair of sequences.
pw_pdists = pipeline(seq_coll)
pw_pdists
A way to get a feel for the data is to visualise the distribution of all pairwise distances. Are there any outliers?
from itertools import combinations
import plotly.express as px
# get distances into list for plotting
dists = [pw_pdists[pair] for pair in combinations(pw_pdists.keys(), 2)]
fig = px.histogram(dists, nbins=50, labels={"value": "p-distance"})
fig.show(width=500, height=500)
How would you describe the distribution❓
Which are the (approximately) most divergent sequences:
import numpy as np
# find max index (using numpy magic)
max_index_flat = np.argmax(pw_pdists)
max_index_1, max_index_2 = np.unravel_index(max_index_flat, pw_pdists.shape)
# index <DistanceMatrix>.names to get the names of these sequences
max_pair = (pw_pdists.names[max_index_1], pw_pdists.names[max_index_2])
max_pair
A dotplot is a visualisation of the relationship between two sequences. The x-axis is the first sequence, the y-axis is the second sequence. Where the sequences have segments that are similar beyond a threshold, a line is drawn.
The visualisation backend for cogent3
is plotly
, which means that the dotplot is ✨interactive✨! Try zooming in by clicking and dragging on the plot. Double-click to reset the zoom. Click on the legend to toggle the visibility of the reverse complement matches.
dp = seq_coll.dotplot(max_pair[0], max_pair[1], rc=True)
dp.show(width=500, height=500)
❓Are you convinced that these sequences are orthologous?
What does kath think...
i'm not convinced! But this is an open question.When comparing sequences, a fundamental question arises: what's the probability of obtaining a similarity score purely by chance? Karlin and Altschul (1990) derived the theoretical null distribution of local alignment scores1, enabling the calculation of the probability of a score equal to or higher than the obtained score (a p-value). This theory forms the basis of the BLAST algorithms.
We can directly use this theory to calculate the probability that the observed similarity between two sequences is due to chance. Inside the script filter.py
2 are a few functions that implement this logic. One of them is theoretical_null_filter()
, an app that filters a SequenceCollection
based on the theoretical null distribution, retaining only the sequences that satisfy a threshold of probability of being related!3
To use theoretical_null_filter()
, we need to set a reference sequence for comparison against all other sequences. Here, let's choose E. coli as the reference. Additionally, we'll specify a desired p-value threshold.
from LO3.filter import theoretical_null_filter
# we want E. coli as our reference
ref_seq = "NC_000913"
# create instance of the app, with a threshold of 0.01
tnf = theoretical_null_filter(ref_seq, threshold=0.01)
tnf
We can see that the app has been initialised with E. coli as reference, there are other default parameters that we could have changed if we wanted to.
Now we can apply the app to seq_coll
to filter out likely non-homologous genes.
seq_coll_filtered = tnf(seq_coll)
seq_coll_filtered.num_seqs
We started with 370 sequences, so the filtering process removed several sequences!
pw_pdists = seq_coll_filtered.distance_matrix()
max_index_flat = np.argmax(pw_pdists)
max_index_1, max_index_2 = np.unravel_index(max_index_flat, pw_pdists.shape)
max_pair = (pw_pdists.names[max_index_1], pw_pdists.names[max_index_2])
dp = seq_coll.dotplot(max_pair[0], max_pair[1], rc=True)
dp.show(width=500, height=500)
❓Are you convinced that these sequences are orthologous? (this is again an open question!)
Why use cogent3 for alignment?
A quick note on why to use cogent3
for alignments?
- even once sequences have been aligned, annotated features remain on the sequences, and can be mapped onto the alignment. This allows you to do things like slice your new sequence based on an annotated gene on another sequence via the alignment.
- novel codon aligners!
Prior to aligning, we need to remove the terminal stop codons. Interestingly, some (cds) sequences have internal in-frame stop codons (@genbank 😬) that we need to remove altogether! There is a custom app in the filter.py
script that will do this.
from LO3.filter import remove_stop_codons
# its an app, so init it
no_stop_codons = remove_stop_codons(check_inframe=True)
# apply it to our collection
seq_coll_filtered = no_stop_codons(seq_coll_filtered)
# how many sequences did we lose?
seq_coll_filtered.num_seqs
Now we should be good to align! cogent3
has a progressive alignment app that can align coding sequences using a codon mode. We can use get_app
to grab the app, and then apply it to our collection.
from cogent3 import get_app
codon_aligner = get_app("progressive_align", "codon")
aln = codon_aligner(seq_coll_filtered)
aln
As you can see, there are plenty of gaps! To take a look at non-degenerate positions, execute the following
aln.no_degenerates()
❓ How can we assess the quality of this alignment? cogent3
has three different alignment quality metrics, the information content score ("ic_score"
), the sum of pairs score ("sp_score"
), and the log-likelihood of the alignment ("cogent3_score"
). You can access them on an alignment in the following way:
aln.alignment_quality("ic_score")
Warning Alignment quality scores are imperfect, and there are a few things to keep in mind:
- what does a singular score tell you?
- describing the quality of an alignment with a score that has good properties is an open problem.
- the
cogent3_score
is only available for alignments produced using cogent3. It is a likelihood measure but suffers from underflow issues.
Are there other ways we can assess the quality of an alignment?
👀 Possible strategies
We can quantify alignment quality via visualisation.
This could be with dotplots, as we did earlier. We can also visualise the number of gaps and entropy using the information_plot
method.
aln.information_plot()
There are a lot of gaps! In some regions, almost all sequences have a gap. Some regions where there are very few gaps.
To see if there are certain sequences that are contributing to the gaps, we can use the aln.count_gaps_per_seq()
which returns the number of gaps per sequence. Plotting this as a violin plot shows up the distribution of gaps per sequence.
gaps = aln.count_gaps_per_seq()
fig = px.violin(gaps, box=True,)
fig.update_layout(width=500, height=500)
fig.show()
To remove sequences that are contributing a significant number of gaps, we can use the omit_bad_seqs
method. omit_bad_seqs
returns new alignment without sequences for which the number of uniquely introduced gaps exceeds the specified quantile of the distribution of gaps per sequence (shown in the violin plot).
aln_omitted = aln.omit_bad_seqs(quantile=0.99)
aln_omitted.num_seqs
or use omit_gap_seqs
., which removes sequences that have more than a specified percentage gap.
aln_omitted = aln_omitted.omit_gap_seqs(allowed_gap_frac=0.65)
aln_omitted.num_seqs
All that we've learned about our data can be seamlessly woven into a pipeline.
We've already seen that apps can be composed together to form a pipeline (recall the jaccard_dist
and approx_pdist
pipeline). However, we then applied this pipeline to a single SequenceCollection
. We can also apply a pipeline to many files at once. This requires us to collect the files that we want to apply the data to in a DataStore
. If our files are already in a directory, this is easy! We just use open_data_store
, with the following information:
- the path to the data
- the suffix of the data of interest (indicating that all files with this suffix are of are a "DataMember" of the DataStore)
- the mode (read/write)
- optionally, an argument to limit the number of files to read/write. In this case, because we're building a prototype and want things to be fast, we will limit the number of files to be 5.
from cogent3.app.io import open_data_store
in_dstore = open_data_store(data_dir / "raw", suffix=".fa", mode="r", limit=5)
Similarly, we need a data store for the output, where the pipeline's result will be written out. Once more, we specify the path, suffix, and mode using open_data_store
:
# open data stores for the output
out_dstore = open_data_store(data_dir / "aligned", suffix=".fa", mode="w")
Now let's build our pipeline. Directly from our exploration, we know that we need to do the following:
- filter out sequences that are likely not orthologous
- remove stop codons (and check for internal stop codons!)
- remove sequences that are contributing a significant number of gaps
- aligning with the default parameters of the progressive codon aligner appears to be reasonable.
Additionally, there are filtering steps that are common to many analyses, many of these will be predefined apps. To take a look at all predefined apps, execute available_apps()
in a notebook cell.
from cogent3.app import available_apps
available_apps()
Our methods may require a minimum length of alignment. For this, we would use the min_length
app to filter out alignments that are too short. Perhaps we wish to apply methods to data where the impact of natural selection is minimised, for which we would use the take_codon_positions
app to select only the third codon position.
# we can use ``get_app`` to grab the apps that we need
from cogent3 import get_app
reader = get_app("load_unaligned", moltype="dna")
filterer = get_app("theoretical_null_filter", ref_seq)
trim_stops = remove_stop_codons(check_inframe=True)
no_shorties = get_app("min_length", length=300)
codon_aligner = get_app("progressive_align", "codon")
writer = get_app("write_seqs", out_dstore)
# Now we can create the pipeline
align_pipeline = reader + filterer + trim_stops + no_shorties + codon_aligner + writer
Apply the pipeline using the apply_to
method using the input data store as the first argument. The pipeline can easily be parallelised using the parallel=True
argument. A more complex configuration can be done if required (like telling it to use MPI) but we won't go into this here! Importantly, assigning the result of apply_to
to a variable will allow us to interrogate the results of the pipeline run.
r = align_pipeline.apply_to(
in_dstore,
show_progress=True,
parallel=True
)
We can look at the summary of the pipeline run by accessing the describe
attribute of the result object.
r.describe
Conveniently, failures do not cause the whole pipeline to fail. Importantly, they are tracked, if we have failures we can interrogate them in the not_completed
directory in the data store. To get a summary of failures we can use the summary_not_completed
attribute.
r.summary_not_completed
Look inside the output directory to see the results of the pipeline run. In addition to the "curated" data, there is a directory with a detailed file for all failures. There is a directory that collects the log files, take a look at these to see a summary of the pipeline. It provides all the information needed to recreate the exact apps and the exact pipeline you ran, i.e., it is reproducible! 🎉
1: The theorem assumes that nucleotides in the sequence are independent and identically distributed variables! Something that is not true for real sequences.
2: Functions in the filter.py
script are not yet part of the cogent3
package as they are still in development. Use at your own risk!
3: This approach works best for sequences with a reasonably even distribution of nucleotides. For sequences with a skewed nucleotide composition, using numerical simulation to generate the null is a better approach, see empirical_null_filter
!