-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure ANIm identity scores are true metrics #151
Comments
Since this will affect the |
One way to check is to (i) read the algorithm to understand whether it is guaranteed to be symmetrical, (ii) read the MUMmer source to check that the implementation is guaranteed to be symmetrical, and (iii) perform a series of comparisons to confirm that there are no practicl counterexamples in those comparisons. I think MUMmer/nucmer is symmetrical in principle. But if it is not symmetrical in practice there are multiple alternative ways to force symmetry, including:
N.B. option 2 there is the way I've been thinking about for ensuring symmetry for a set of n>2 genomes:
|
My understanding of the issue is that regardless of the order in which the genomes are processed, we should obtain the same values for aligned bases. For example, comparing genome 1 versus genome 2 should return the same number of total aligned bases as comparing genome 2 versus genome 1. After fixing issue 340 with IntervalTree, where overlapped regions are no longer counted twice, I decided to run an analysis to see whether nucmer is symmetrical in practice, just as @widdowquinn suggested. Reproducibility steps:
The results of the analysis are attached in mummer_symetry_comparisions.csv. Each row in the CSV file provides the following information:
I think
Next step:
|
I looked further into this issue to see what really happens when we swap the order of the genomes. First, I ran At first glance, it looks like they have the same problem. It might seem like the order in which genomes are processed does not matter for viral comparisons, but a different value for I reproduced this issue on much smaller datasets and examined the delta-filter files directly to observe differences in the alignment of genome segments. I expected to find instances where certain sequences were extended either to the left or to the right, leading to discrepancies in the reported number of AlignedBases. For instance, Genome 1 might be aligned from 1bp to 450bp when compared to Genome 2, but the alignment of Genome 2 to Genome 1 might span from 1bp to 464bp. However, this wasn't the case. It appears that the problem arises when segments of Genome 1 are aligned to different regions of Genome 2. I attempted to visualize this issue, and you can view it here: nucmer_symmetry_investigation.pdf |
My initial thoughts - on a read of the literature - were not just that we would obtain the same number of aligned bases, but that we would obtain the same alignments between genomes A and B, regardless of which was the query and which was the reference. That seems like it may not be the case. In This would not necessarily be expected to be symmetrical, so the AlignedBases reported value would not necessarily be expected to be symmetrical (A vs B giving the same value as B vs A). The main implication of this, I think, is that we would need to modify |
That may be true, but I would be interested to know whether the IntervalTree count of aligned bases is symmetrical, where AlignedBases (reported by |
My apologies for not being clear earlier. All counts of aligned bases provided in The only thing we need to think about is what to do with the alignments that are not the same when the order in which genomes are processed is changed. Should we include them in the count of aligned bases, or not? |
Earlier this week, @widdowquinn and I discussed this issue and the necessary steps to address it. As part of our documentation and to keep track of current progress, it would be beneficial to write a few notes here. When running two comparisons (genome A vs genome B and genome B vs genome A), we are faced with choices regarding how to calculate and report the total number of aligned bases. The reported aligned bases can be:
Generally speaking, alignment results may vary depending on factors like the software and parameters used, the order of genomes etc. The returned alignment is an estimation rather than a ground truth. Therefore, there's no definitive answer on how to report the total number of aligned bases. It's important to make a consistent choice in reporting methods across all comparisons to ensure reproducibility and reliability of results. We agreed that calculating a summary between A:B and B:A might not be the best choice, as differences in reported alignments when changing the order of genomes may come from sequences of one genome aligning to different parts of the other genome, rather than an extended alignment. This leads us to the choice of reporting aligned bases calculated for either the query or reference genome. We agreed that for To achieve this, we need to make the following modifications to the current code:
|
NOTE: There are related issues to this one (see: #421), where work has already been in progress under the branch The work is still in progress, and it is possible that I have made some things worse, but here's a quick update on what we have so far.
The code was modified, and pyani available on branch
Here is a partially working example on two viral genomes where current pyani (0.3v) returns genome coverage of more than 1.0 using the default
[INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(citation=False, classes=PosixPath('symmetry_data/input/test_1/classes.txt'), dbpath=PosixPath('.pyani/pyanidb'), debug=True, disable_tqdm=False, filter_exe=PosixPath('delta-filter'), func=<function subcmd_anim at 0x7fec08257d30>, indir=PosixPath('symmetry_data/input/test_1'), jobprefix='PYANI', labels=PosixPath('symmetry_data/input/test_1/labels.txt'), logfile=PosixPath('test_1.log'), maxmatch=False, name='test_1', nofilter=False, nucmer_exe=PosixPath('nucmer'), outdir=PosixPath('symmetry_data/output/test_1'), recovery=False, scheduler='multiprocessing', sgeargs=None, sgegroupsize=10000, verbose=False, version=False, workers=None)
[INFO] [pyani.scripts.pyani_script]: command-line: /opt/anaconda3/envs/pyani_issue_421/bin/pyani anim -i symmetry_data/input/test_1 -o symmetry_data/output/test_1 -l test_1.log --name test_1 --labels symmetry_data/input/test_1/labels.txt --classes symmetry_data/input/test_1/classes.txt --debug
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[INFO] [pyani.scripts.pyani_script]: CITATION INFO
[INFO] [pyani.scripts.pyani_script]: If you use pyani in your work, please cite the following publication:
[INFO] [pyani.scripts.pyani_script]: Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G.,
[INFO] [pyani.scripts.pyani_script]: & Toth, I.K. (2016) 'Genomics and taxonomy in diagnostics for
[INFO] [pyani.scripts.pyani_script]: food security: soft-rotting enterobacterial plant pathogens.'
[INFO] [pyani.scripts.pyani_script]: Analytical Methods, 8(1), 12–24. http://doi.org/10.1039/C5AY02550H
[INFO] [pyani.scripts.pyani_script]: DEPENDENCIES
[INFO] [pyani.scripts.pyani_script]: The authors of pyani gratefully acknowledge its dependence on
[INFO] [pyani.scripts.pyani_script]: the following bioinformatics software:
[INFO] [pyani.scripts.pyani_script]: MUMmer3: S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway,
[INFO] [pyani.scripts.pyani_script]: C. Antonescu, and S.L. Salzberg (2004), 'Versatile and open software
[INFO] [pyani.scripts.pyani_script]: for comparing large genomes' Genome Biology 5:R12
[INFO] [pyani.scripts.pyani_script]: BLAST+: Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J.,
[INFO] [pyani.scripts.pyani_script]: Bealer K., & Madden T.L. (2008) 'BLAST+: architecture and applications.'
[INFO] [pyani.scripts.pyani_script]: BMC Bioinformatics 10:421.
[INFO] [pyani.scripts.pyani_script]: BLAST: Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J.,
[INFO] [pyani.scripts.pyani_script]: Zhang, Z., Miller, W. & Lipman, D.J. (1997) 'Gapped BLAST and PSI-BLAST:
[INFO] [pyani.scripts.pyani_script]: a new generation of protein database search programs.' Nucleic Acids Res.
[INFO] [pyani.scripts.pyani_script]: 25:3389-3402
[INFO] [pyani.scripts.pyani_script]: Biopython: Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A,
[INFO] [pyani.scripts.pyani_script]: Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL
[INFO] [pyani.scripts.pyani_script]: (2009) Biopython: freely available Python tools for computational
[INFO] [pyani.scripts.pyani_script]: molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
[INFO] [pyani.scripts.pyani_script]: fastANI: Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis K, and
[INFO] [pyani.scripts.pyani_script]: Aluru S (2018) 'High throughput ANI analysis of 90K prokaryotic
[INFO] [pyani.scripts.pyani_script]: genomes reveals clear species boundaries.' Nature Communications 9, 5114
[INFO] [pyani.scripts.pyani_script]: Checking for database file: .pyani/pyanidb
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Running ANIm analysis
[INFO] [pyani.scripts.subcommands.subcmd_anim]: MUMMer nucmer version: Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Analysis name: test_1
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Connecting to database .pyani/pyanidb
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Adding run info to database .pyani/pyanidb...
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: ...added run ID: 1 to the database
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Adding genomes for run 1 to database...
[INFO] [pyani.pyani_files]: Checking for hashfile: symmetry_data/input/test_1/MGV-GENOME-0264574.fna.md5.
[INFO] [pyani.pyani_files]: Checking for hashfile: symmetry_data/input/test_1/MGV-GENOME-0266457.fna.md5.
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: ...added genome IDs: [None, None]
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Generating ANIm command-lines
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: NUCmer output will be written temporarily to symmetry_data/output/test_1/nucmer_output
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Creating output directory symmetry_data/output/test_1/nucmer_output
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Compiling genomes for comparison
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Collected 2 genomes for this run
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Compiling pairwise comparisons (this can take time for large datasets)...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 22192.08it/s]
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...total pairwise comparisons to be performed: 1
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Checking database for existing comparison data...
[DEBUG] [pyani.pyani_orm]: Existing comparisons
{}
[DEBUG] [pyani.pyani_orm]: Checking for existing comparisons, with unique constraints
program: nucmer
version: Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)
fragsize: None
maxmatch: False
kmersize: None
minmatch: None
[DEBUG] [pyani.pyani_orm]: Checking for existing comparison: Genome 1: MGV_MGV-GENOME-0264574 (1) vs Genome 2: MGV_MGV-GENOME-0266457 (2)
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...after check, still need to run 1 comparisons
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Assuming no pre-existing output files
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Creating NUCmer jobs for ANIm
0%| | 0/1 [00:00<?, ?it/s][DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Commands to run:
Genome 1: MGV_MGV-GENOME-0264574, Genome 2: MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Commands to run:
nucmer --mum -p symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574 /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0266457.fna /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0264574.fna
delta_filter_wrapper.py delta-filter -1 symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574.delta symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Expected output file for db: symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Building job
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Commands to run:
nucmer --mum -p symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457 /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0264574.fna /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0266457.fna
delta_filter_wrapper.py delta-filter -1 symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457.delta symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Expected output file for db: symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Building job
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Results not found for 2 comparisons; 2 new jobs built.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 986.20it/s]
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Generated 2 jobs, 1 comparisons
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Passing 2 jobs to multiprocessing...
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Scheduler: multiprocessing
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Running jobs with multiprocessing
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: (using maximum number of worker threads)
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Multiprocessing run completed without error
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...jobs complete
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Adding comparison results to database...
0%| | 0/2 [00:00<?, ?it/s][DEBUG] [pyani.scripts.subcommands.subcmd_anim]: MGV_MGV-GENOME-0266457 vs MGV_MGV-GENOME-0264574
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: query cov 0.997860036175579, query length: 39253, query aln length: 39169, query descripton: MGV_MGV-GENOME-0264574
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: subject cov 0.9894428448754862, subject length: 39594, subject aln length: 39176, subject descripton: MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: MGV_MGV-GENOME-0264574 vs MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: query cov 0.9894428448754862, query length: 39594, query aln length: 39176, query descripton: MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: subject cov 0.997860036175579, subject length: 39253, subject aln length: 39169, subject descripton: MGV_MGV-GENOME-0264574
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 378.51it/s]
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Committing results to database
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...database updated.
[INFO] [pyani.scripts.pyani_script]: Completed. Time taken: 1.632
When trying to generate reports with the following command: pyani report --runs -o <path_to_the_output> --formats=stdout --run_results 1 The reports are also generated without any errors, and the values are improved and as expected when overlaps are not counted twice. When calling the report for specific runs, we might want to add separate columns for the
The only issue, we have is that we cannot generate matrices (eg. coverage_matrix, identity_matrix) as it results in an error being raised: (pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro issue_421 % pyani report -o symmetry_data/output/test_1 --formats=stdout --run_matrices 1
Traceback (most recent call last):
File "/opt/anaconda3/envs/pyani_issue_421/bin/pyani", line 33, in <module>
sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
File "/Users/angelikakiepas/Desktop/pyani/pyani/scripts/pyani_script.py", line 143, in run_main
returnval = args.func(args)
File "/Users/angelikakiepas/Desktop/pyani/pyani/scripts/subcommands/subcmd_report.py", line 265, in subcmd_report
matlabel_dict = get_matrix_labels_for_run(session, run_id)
File "/Users/angelikakiepas/Desktop/pyani/pyani/pyani_orm.py", line 386, in get_matrix_labels_for_run
session.query(Genome.genome_id, Label.label)
File "<string>", line 2, in join
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/base.py", line 280, in _generative
x = fn(self, *args, **kw)
File "<string>", line 2, in join
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/orm/base.py", line 303, in generate
fn(self, *args, **kw)
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 2428, in join
onclause_element = coercions.expect(
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/coercions.py", line 396, in expect
resolved = impl._literal_coercion(
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/coercions.py", line 899, in _literal_coercion
self._raise_for_expected(element)
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/coercions.py", line 518, in _raise_for_expected
raise exc.ArgumentError(msg, code=code) from err
sqlalchemy.exc.ArgumentError: ON clause, typically a SQL expression or ORM relationship attribute expected, got <class 'pyani.pyani_orm.Run'>. I understand that, I will need to update the matrices to reflect the difference between the two comparisons, but I can't get to figure out what is causing the above error. |
I thought I'd fixed that error on |
There you go: run_fail.zip. It made me wonder if maybe I haven't rebase to |
Just a quick update on the matter. The issue turned out to be caused by incorrect rebasing on my end. The code is working as expected under the branch Before we can implement the changes in the
|
After working on issue #422 and adding additional columns for the I have been working on
For example, running
However, running the comparisons under branch
For reference, here is a table output for the parameter
The only issue I am currently facing is that no graphical output is provided when running the analysis under branch [INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(version=False, citation=False, logfile=None, verbose=True, debug=False, disable_tqdm=False, outdir=PosixPath('output'), run_ids=['1'], dbpath=PosixPath('.pyani/pyanidb'), formats=['pdf'], method='seaborn', workers=None, func=<function subcmd_plot at 0x1137e8d30>)
[INFO] [pyani.scripts.pyani_script]: command-line: /opt/anaconda3/envs/pyani_v3/bin/pyani plot -o output --run_id 1 -v --formats pdf
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[INFO] [pyani.scripts.pyani_script]: CITATION INFO
[INFO] [pyani.scripts.pyani_script]: If you use pyani in your work, please cite the following publication:
[INFO] [pyani.scripts.pyani_script]: Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G.,
[INFO] [pyani.scripts.pyani_script]: & Toth, I.K. (2016) 'Genomics and taxonomy in diagnostics for
[INFO] [pyani.scripts.pyani_script]: food security: soft-rotting enterobacterial plant pathogens.'
[INFO] [pyani.scripts.pyani_script]: Analytical Methods, 8(1), 12–24. http://doi.org/10.1039/C5AY02550H
[INFO] [pyani.scripts.pyani_script]: DEPENDENCIES
[INFO] [pyani.scripts.pyani_script]: The authors of pyani gratefully acknowledge its dependence on
[INFO] [pyani.scripts.pyani_script]: the following bioinformatics software:
[INFO] [pyani.scripts.pyani_script]: MUMmer3: S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway,
[INFO] [pyani.scripts.pyani_script]: C. Antonescu, and S.L. Salzberg (2004), 'Versatile and open software
[INFO] [pyani.scripts.pyani_script]: for comparing large genomes' Genome Biology 5:R12
[INFO] [pyani.scripts.pyani_script]: BLAST+: Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J.,
[INFO] [pyani.scripts.pyani_script]: Bealer K., & Madden T.L. (2008) 'BLAST+: architecture and applications.'
[INFO] [pyani.scripts.pyani_script]: BMC Bioinformatics 10:421.
[INFO] [pyani.scripts.pyani_script]: BLAST: Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J.,
[INFO] [pyani.scripts.pyani_script]: Zhang, Z., Miller, W. & Lipman, D.J. (1997) 'Gapped BLAST and PSI-BLAST:
[INFO] [pyani.scripts.pyani_script]: a new generation of protein database search programs.' Nucleic Acids Res.
[INFO] [pyani.scripts.pyani_script]: 25:3389-3402
[INFO] [pyani.scripts.pyani_script]: Biopython: Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A,
[INFO] [pyani.scripts.pyani_script]: Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL
[INFO] [pyani.scripts.pyani_script]: (2009) Biopython: freely available Python tools for computational
[INFO] [pyani.scripts.pyani_script]: molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
[INFO] [pyani.scripts.subcommands.subcmd_plot]: Generating graphical output for analyses
[INFO] [pyani.scripts.subcommands.subcmd_plot]: Writing output to: output
[INFO] [pyani.scripts.subcommands.subcmd_plot]: Rendering method: seaborn
[INFO] [pyani.scripts.pyani_script]: Completed. Time taken: 6.745 |
I'm marking this for closure, because my understanding is that this issue was fixed with 1f10a6c. |
We've established that ANIm - as originally implemented - is not a true metric. But we still have to identify a method that does constitute an actual metric (though it may be constrained to the specific genome input set…). |
Summary:
It's not precisely established that ANIm identities are true metrics - we need to demonstrate and ensure this, otherwise the clique-based classification is not as sound as we'd hope.
The text was updated successfully, but these errors were encountered: