-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add separate columns for subject and query alignment lengths in --run_results
#422
Comments
The additional columns in the When running the
Rerunning the analysis under the
The code, input, and output data are provided in aln_length_issue.zip. As we are now running 2 comparisons instead of just one, both forward and reverse, we also had to update the matrices. However, since this was initially explained in more detail in issue #151, I discussed this in more details there. |
Looking through the changes, I see that the calculation looks sensible, and the changes to reporting etc. seem OK - good job! There are a few snags, some of which are being picked up when running
run.comparisons.append(
Comparison(
query=job.query,
subject=job.subject,
query_aln_length=qaln_length,
subject_aln_length=saln_length,
sim_errs=sim_errs,
query_identity=query_pid,
subject_identity=subject_pid,
cov_query=qcov,
cov_subject=scov,
program="nucmer", In the new code we're storing a single value, run.comparisons.append(
Comparison(
query=job.query,
subject=job.subject,
query_aln_length=qaln_length,
subject_aln_length=saln_length,
sim_errs=sim_errs,
perc_id=pc_id,
cov_query=qcov,
cov_subject=scov,
program="nucmer", but the database schema and
There are other side-effects (e.g. presence/absence of the |
We're getting quite a bit discrepancy in the concordance test for ANIm:
We'll need to look into this to see what's going on. |
I've modified the expected output in UPDATE: this induces a new error in
The test results matrix is on top (expected values are underneath). The lower triangle there is all The other values are more worrying. They agree in only one case, and we'll need to investigate what that is. |
These changes were made due to our conversation last week about how we want to calculate the percentage identity. Our previous assumption was that the percentage identity should be calculated separately for the query and subject by calculating it as follows: However, we later agreed that the best option would be to calculate the average across all alignments. This would be calculated as follows (assuming that there are two alignments between the genomes we are comparing): Alignment 1 identity weighted = (query length alignment 1 + subject length alignment 1) - (2 * similarity error) With this, I assumed that there would only be one calculation for percentage identity, hence just one column, rather than two. |
Yes - the point here is rather now that the call to create a |
I'm looking into this just now, and running this on my local machine I get the following matrix_identity:
Am I right to assume that you have run this comparisons under branch |
I've run
|
Yes - which branch should I be looking at? |
Branch |
So we're running under the same branch but getting different results? In both cases the calculated identities do not correspond to the expected identity, or to applying |
Considering our discussion yesterday @kiepczi - I've updated the reference matrix for ANIm identity consistency tests in 97e5570. This now reflects (i) the new method for calculating identity, and (ii) running A vs B and B vs A (i.e. nonsymmetry). A couple of things worth noting:
|
I have updated more tests and fixtures in i) Adding more commands to the (pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_mummer_multiple
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item
tests/test_anim.py::test_mummer_multiple PASSED [100%]
==================================================================== 1 passed in 0.12s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % (pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_mummer_job_generation
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item
tests/test_anim.py::test_mummer_job_generation PASSED [100%]
==================================================================== 1 passed in 0.13s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % ii) Modification of the (pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_deltafile_parsing
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item
tests/test_anim.py::test_deltafile_parsing PASSED [100%]
==================================================================== 1 passed in 0.13s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % |
Although the majority of tests pass, three are currently failing: ================================================================= short test summary info =================================================================
FAILED tests/test_anim.py::test_genome_sorting - AssertionError: assert ('nucmer --mu...first.filter') == ('nucmer --mu...econd.filter')
FAILED tests/test_parsing.py::test_anim_delta - AssertionError: assert (4016947, 401...7031752, 2191) == (4016947, 401...4447228, 2191)
FAILED tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani - TypeError: 'aln_length' is an invalid keyword argument for Comparison
======================================= 3 failed, 115 passed, 1 xfailed, 1 xpassed, 3 warnings in 389.17s (0:06:29) ======================================= Here are my current thoughts on them:
@widdowquinn, if you have any ideas comments regarding this, let me know. |
It would be worth looking back through the commit logs to be sure, but my memory is that sorting was introduced because…
Now, with |
Thing 1 is to satisfy ourselves that that's the origin of the discrepancy ;) If it is, then the job of implementing that part of |
One solution here is - essentially - polymorphism. We would declare new classes that describe output from distinct tools (e.g. |
I looked into how To begin, it assigns the combined alignment lengths ( my ($rqSumLen1, $rqSumLenM) = (0,0); # combined alignment length sum
my ($rqSumIdy1, $rqSumIdyM) = (0,0); # weighted alignment identity sum The weighted alignment lengths are then calculated by dividing the average %ID of each sequence by 100 and multiplying it by the sum of the lengths of both the reference and query sequences (note that these values are extracted from the $rqSumIdy1 += ($A[6] / 100.0) * ($A[4] + $A[5]); The length of alignments is calculated as the sum of the lengths of both the reference and query sequences: $rqSumLen1 += ($A[4] + $A[5]); The Average Identity is calculated by dividing the sum of all weighted alignment lengths by the sum of all combined alignment lengths and then multiplying that value by 100. ($rqSumLen1 ? $rqSumIdy1 / $rqSumLen1 * 100.0 : 0); I manually calculated this for a small comparison of two viral genomes, where according to
The
Then, the calculations I get are: I attempted to replicate this value using So, let's calculate these values without rounding the individual sequence %IDs from the
For the first alignment calculations: For the second alignment calculations: Average %ID = (74834.99999999999 + 43082.0) / (37629 + 37636 + 21545 + 21551) * 100 = 99.62487643733999 @widdowquinn suggested that there could be a quicker way to calculate this by skipping the intermediate calculation of identity as follows: alignment 1 identity weighted = (37629 + 37636) - (2 * 215) = 74835 This example demonstrates that the differences in the reported numbers could arise from a rounding error, and it appears that our calculations would be more accurate. I have written a Python script that calculates these values using these approaches (available here). I ran this on a few examples and found that in some cases, even when we calculate the Average %ID where the individual aligned sequences %ID is rounded, the values differ from the ones reported by When the values are calculated in Python, I obtain the following results:
The average %ID reported by
I checked if the sequence percentage identities calculated for each alignment match the ones reported in the One example where the percentage identity for the alignment does not match is:
Using the above information, I found the alignment in the
When calculating the percentage identity manually, I found that these do not match. In this example, the difference is almost 0.035%. As this is happening in most of the alignments in this comparison, the differences add up to the point that the overall Average %ID differs significantly. Reference alignment length = 3045 - 1486 + 1 = 1560 |
I should probably mention that all code and data used for the investigation can be found here. |
After discussion between @kiepczi and me, and inspecting the code of We cannot then guarantee that our At this point, my view is that the The least time-consuming, and probably most useful, way forward is, I think, this:
|
Adding the |
To expand on this… the error below
arises because the keywords provided from a If we require only one of those two lengths for the ANI result, then we should prefer one over the other, and we ought to be able to provide it to the
But, if we require two lengths, then we will have to change the schema/reporting/ORM to accommodate this. Which do we think we want? |
If we only require one alignment length per comparison for our reporting purposes, then the first solution is sufficient. Having both results from a tool like |
Summary:
When calling the report for specific runs with the --run_results argument, a single value for the length of alignment is given in the alignment length column. However, the lengths of the alignments for the subject and query might be different.
Description:
When two genomes are aligned with
nucmer
, the length of the query alignment and subject alignment might differ. For example, we have two genomes, genome A with 100 bases and genome B with 98 bases.We compare them with pyANI anim, and when we inspect .filer files, we see that there was a deletion of the 11th and 16th bases in genome B.
However, when calling pyani report with the --run_result argument, the .tab file reported by pyANI contains a single column for alignment length, which in this case seems to report the value of the length of the subject.
To fix this issue I propose that above
.tab
output is modified to provide two columns for alignment lengths, eg.subject alignment length
andquery alignment length
.NOTE: We are also aware that nucmer is not symmetrical (see #151). Therefore, we have decided to run two comparisons (reverse and forward), which we are currently working on. Since these issues are related, the output for matrices and reports will need to change to reflect this. Some work has already been done on this under the branch issue_421, and I will continue working on this issue there as well.
Reproducible Steps:
All input, outputs and code are provided here:
aln_length_issue.zip.
Current Output:
Expected Output:
pyani Version:
pyani 0.3v
The text was updated successfully, but these errors were encountered: