Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure ANIm identity scores are true metrics #151

Closed
widdowquinn opened this issue Oct 2, 2019 · 16 comments
Closed

Ensure ANIm identity scores are true metrics #151

widdowquinn opened this issue Oct 2, 2019 · 16 comments
Assignees
Labels
enhancement something we'd like pyani to do that it doesn't already Marked for closure Ready to close method the issue relates to how results are calculated
Milestone

Comments

@widdowquinn
Copy link
Owner

Summary:

It's not precisely established that ANIm identities are true metrics - we need to demonstrate and ensure this, otherwise the clique-based classification is not as sound as we'd hope.

@widdowquinn widdowquinn added the enhancement something we'd like pyani to do that it doesn't already label Oct 2, 2019
@widdowquinn widdowquinn self-assigned this Oct 2, 2019
@widdowquinn widdowquinn added this to the 0.3.0 milestone May 28, 2020
@widdowquinn widdowquinn added the method the issue relates to how results are calculated label May 29, 2020
@baileythegreen
Copy link
Contributor

Since this will affect the pyani classify stuff, do you have a method in mind for ensuring this, or is that another thing ot think about?

@widdowquinn
Copy link
Owner Author

One way to check is to (i) read the algorithm to understand whether it is guaranteed to be symmetrical, (ii) read the MUMmer source to check that the implementation is guaranteed to be symmetrical, and (iii) perform a series of comparisons to confirm that there are no practicl counterexamples in those comparisons.

I think MUMmer/nucmer is symmetrical in principle. But if it is not symmetrical in practice there are multiple alternative ways to force symmetry, including:

  1. Align A:B and B:A and then take a summary (max, min, mean)
  2. Align A:B and B:A and then use, e.g. BedTools to identify the common intervals in the alignments, and report values for that

N.B. option 2 there is the way I've been thinking about for ensuring symmetry for a set of n>2 genomes:

  • carry out all pairwise comparisons
  • on each genome, identify all intervals that participate in all pairwise comparisons (these are the regions of the genome that are common to all genomes in the input set)
  • calculate identity/coverage etc. only for this set of regions (coverage here has a different interpretation than in pairwise-only comparisons)

@baileythegreen
Copy link
Contributor

Discussions in #340, #342, and #373 are relevant here.

@kiepczi
Copy link
Collaborator

kiepczi commented Feb 9, 2024

One way to check is to (i) read the algorithm to understand whether it is guaranteed to be symmetrical, (ii) read the MUMmer source to check that the implementation is guaranteed to be symmetrical, and (iii) perform a series of comparisons to confirm that there are no practicl counterexamples in those comparisons.

My understanding of the issue is that regardless of the order in which the genomes are processed, we should obtain the same values for aligned bases. For example, comparing genome 1 versus genome 2 should return the same number of total aligned bases as comparing genome 2 versus genome 1.

After fixing issue 340 with IntervalTree, where overlapped regions are no longer counted twice, I decided to run an analysis to see whether nucmer is symmetrical in practice, just as @widdowquinn suggested.

Reproducibility steps:

  1. Generated 20 test sets (1 to 10 viral and 11 to 20 bacterial genomes).
  2. Ran nucmer and compared genomes 1 vs genome 2 (G1_vs_G2) and genome 2 vs genome 1 (G2_vs_G1).
  3. Compared the values returned.

The results of the analysis are attached in mummer_symetry_comparisions.csv. Each row in the CSV file provides the following information:

  • Test number provided in test_number column.
  • Accession numbers for genome 1 (Genome_1) and genome 2 (Genome_2).
  • Parameters used in the analysis (e.g., default nucmer settings, maxmatch, noextend, etc.) are given in cmd_param.
  • Values for the comparisons are given in columns with the following format G<num>_vs_G<num>|Genome_<num>, where the order of the genomes used in the analysis is given before |, and the remaining part specifies the genome, e.g., Genome_1 for genome 1 and Genome_2 for genome 2.
  • Differences between the two comparisons are given in G1_diff (for genome 1) and G2_diff (for genome 2).

I think MUMmer/nucmer is symmetrical in principle.

I think nucmer is not symmetrical, at least not in practice. This is based on the following observations:

  • Excluding overlaps from the calculation of the number of aligned bases, the value might differ depending on the order in which the genomes are processed, suggesting that nucmer is not symmetrical in practice.
  • The difference between the values is smaller when just the --maxmatch parameter is used.
  • The difference is much smaller, in fact mostly 0 for viral genomes.

Next step:

  • Determine where these differences come from, perhaps by checking which aligned regions do not match when the order of the genome is changed.

@kiepczi
Copy link
Collaborator

kiepczi commented Feb 9, 2024

I looked further into this issue to see what really happens when we swap the order of the genomes.

First, I ran dnadiff on all test sets as before (e.g., genome 1 vs genome 2 and genome 2 vs genome 1), and I compared the number of reported AlignedBases. I updated the mummer_symmetry_comparisons.csv file. All values and differences in reported values can be found in rows where the column cmd_param is dnadiff.

At first glance, it looks like they have the same problem. It might seem like the order in which genomes are processed does not matter for viral comparisons, but a different value for AlignedBases is reported when much bigger bacterial genomes are compared.

I reproduced this issue on much smaller datasets and examined the delta-filter files directly to observe differences in the alignment of genome segments. I expected to find instances where certain sequences were extended either to the left or to the right, leading to discrepancies in the reported number of AlignedBases. For instance, Genome 1 might be aligned from 1bp to 450bp when compared to Genome 2, but the alignment of Genome 2 to Genome 1 might span from 1bp to 464bp. However, this wasn't the case. It appears that the problem arises when segments of Genome 1 are aligned to different regions of Genome 2. I attempted to visualize this issue, and you can view it here: nucmer_symmetry_investigation.pdf
. The raw data is provided here small_test.zip.

@widdowquinn
Copy link
Owner Author

widdowquinn commented Feb 10, 2024

My understanding of the issue is that regardless of the order in which the genomes are processed, we should obtain the same values for aligned bases.

My initial thoughts - on a read of the literature - were not just that we would obtain the same number of aligned bases, but that we would obtain the same alignments between genomes A and B, regardless of which was the query and which was the reference. That seems like it may not be the case.

In mummer_symmetry_comparisons.csv you establish that the reported AlignedBases measure from dnadiff.pl is not symmetrical. However, as we discovered from reading the source code for show-diff, the value reported from dnadiff.pl is not a straight count of the total number of aligned bases, but a count of the number of bases aligned in a path that is chosen through multiple possible pairwise alignments and ordered into an alignment of the genomes.

This would not necessarily be expected to be symmetrical, so the AlignedBases reported value would not necessarily be expected to be symmetrical (A vs B giving the same value as B vs A).

The main implication of this, I think, is that we would need to modify pyani so that we run A vs B and B vs A for all pairwise comparisons, if reporting/using this figure.

@widdowquinn
Copy link
Owner Author

I think nucmer is not symmetrical, at least not in practice.

That may be true, but I would be interested to know whether the IntervalTree count of aligned bases is symmetrical, where AlignedBases (reported by dnadiff.pl) is not. Did you happen to try that?

@kiepczi
Copy link
Collaborator

kiepczi commented Feb 10, 2024

That may be true, but I would be interested to know whether the IntervalTree count of aligned bases is symmetrical, where AlignedBases (reported by dnadiff.pl) is not. Did you happen to try that?

My apologies for not being clear earlier. All counts of aligned bases provided in mummer_symmetry_comparisons.csv were obtained with our newest code, which relies on IntervalTree to correct for overlapped regions. So, I guess that's bad news... I think you're right that we will need to run A vs B and B vs A for all pairwise comparisons. Hopefully, this will not slow pyANI as much.

The only thing we need to think about is what to do with the alignments that are not the same when the order in which genomes are processed is changed. Should we include them in the count of aligned bases, or not?

@kiepczi
Copy link
Collaborator

kiepczi commented Feb 14, 2024

Earlier this week, @widdowquinn and I discussed this issue and the necessary steps to address it. As part of our documentation and to keep track of current progress, it would be beneficial to write a few notes here.

When running two comparisons (genome A vs genome B and genome B vs genome A), we are faced with choices regarding how to calculate and report the total number of aligned bases. The reported aligned bases can be:

  • Calculation of a summary (max, min, mean) between A:B and B:A.
  • Reporting for the query, such as reporting for genome B in A:B comparisons and genome A in B:A comparisons.
  • Reporting for the reference, for example, reporting for genome A in A:B comparisons and genome B in B:A comparisons.

Generally speaking, alignment results may vary depending on factors like the software and parameters used, the order of genomes etc. The returned alignment is an estimation rather than a ground truth. Therefore, there's no definitive answer on how to report the total number of aligned bases. It's important to make a consistent choice in reporting methods across all comparisons to ensure reproducibility and reliability of results.

We agreed that calculating a summary between A:B and B:A might not be the best choice, as differences in reported alignments when changing the order of genomes may come from sequences of one genome aligning to different parts of the other genome, rather than an extended alignment. This leads us to the choice of reporting aligned bases calculated for either the query or reference genome. We agreed that for pyANI, we will report the number of aligned bases for the query.

To achieve this, we need to make the following modifications to the current code:

  1. Obtain two comparisons (A:B and B:A) for nucmer and delta-filter. To ensure backward compatibility, it was agreed that no command-line parameters would be changed (e.g., ensuring 1-to-1 alignments that allow for rearrangements, using the --mum parameter as default, but allowing --maxmatch as an option).
  2. Read generated files and add them to the database.
  3. Calculate the values for the query using IntervalTree to correct for overlapping regions.

@kiepczi
Copy link
Collaborator

kiepczi commented Feb 22, 2024

NOTE: There are related issues to this one (see: #421), where work has already been in progress under the branch issue_421. So, the progress can be found/tested there.

The work is still in progress, and it is possible that I have made some things worse, but here's a quick update on what we have so far.

  1. Obtain two comparisons (A:B and B:A) for nucmer and delta-filter. To ensure backward compatibility, it was agreed that no command-line parameters would be changed (e.g., ensuring 1-to-1 alignments that allow for rearrangements, using the --mum parameter as default, but allowing --maxmatch as an option).
  2. Read generated files and add them to the database.
  3. Calculate the values for the query using IntervalTree to correct for overlapping regions.

The code was modified, and pyani available on branch issue_421, will now result in two comparisons (forward and reverse). The output of nucmer for both comparisons is now provided in nucmer_output, which will have the following structure (example for comparisons of 2 genomes):

nucmer_output
├── Genome_A
│   ├── Genome_A_vs_Genome_B.delta
│   └── Genome_A_vs_Genome_B.filter
└── Genome_B
    ├── Genome_B_vs_Genome_A.delta
    └── Genome_B_vs_Genome_A.filter 

.filter files from both comparisons are read and added to the database. Additionally, I have also implemented IntervalTree to exclude overlaps when calculating genome coverage.

Here is a partially working example on two viral genomes where current pyani (0.3v) returns genome coverage of more than 1.0 using the default pyani anim command:

[INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(citation=False, classes=PosixPath('symmetry_data/input/test_1/classes.txt'), dbpath=PosixPath('.pyani/pyanidb'), debug=True, disable_tqdm=False, filter_exe=PosixPath('delta-filter'), func=<function subcmd_anim at 0x7fec08257d30>, indir=PosixPath('symmetry_data/input/test_1'), jobprefix='PYANI', labels=PosixPath('symmetry_data/input/test_1/labels.txt'), logfile=PosixPath('test_1.log'), maxmatch=False, name='test_1', nofilter=False, nucmer_exe=PosixPath('nucmer'), outdir=PosixPath('symmetry_data/output/test_1'), recovery=False, scheduler='multiprocessing', sgeargs=None, sgegroupsize=10000, verbose=False, version=False, workers=None)
[INFO] [pyani.scripts.pyani_script]: command-line: /opt/anaconda3/envs/pyani_issue_421/bin/pyani anim -i symmetry_data/input/test_1 -o symmetry_data/output/test_1 -l test_1.log --name test_1 --labels symmetry_data/input/test_1/labels.txt --classes symmetry_data/input/test_1/classes.txt --debug
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[INFO] [pyani.scripts.pyani_script]: CITATION INFO
[INFO] [pyani.scripts.pyani_script]: If you use pyani in your work, please cite the following publication:
[INFO] [pyani.scripts.pyani_script]:    Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G.,
[INFO] [pyani.scripts.pyani_script]:    & Toth, I.K. (2016) 'Genomics and taxonomy in diagnostics for
[INFO] [pyani.scripts.pyani_script]:    food security: soft-rotting enterobacterial plant pathogens.'
[INFO] [pyani.scripts.pyani_script]:    Analytical Methods, 8(1), 12–24. http://doi.org/10.1039/C5AY02550H
[INFO] [pyani.scripts.pyani_script]: DEPENDENCIES
[INFO] [pyani.scripts.pyani_script]: The authors of pyani gratefully acknowledge its dependence on
[INFO] [pyani.scripts.pyani_script]: the following bioinformatics software:
[INFO] [pyani.scripts.pyani_script]:    MUMmer3: S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway,
[INFO] [pyani.scripts.pyani_script]:    C. Antonescu, and S.L. Salzberg (2004), 'Versatile and open software
[INFO] [pyani.scripts.pyani_script]:    for comparing large genomes' Genome Biology 5:R12
[INFO] [pyani.scripts.pyani_script]:    BLAST+: Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J.,
[INFO] [pyani.scripts.pyani_script]:    Bealer K., & Madden T.L. (2008) 'BLAST+: architecture and applications.'
[INFO] [pyani.scripts.pyani_script]:    BMC Bioinformatics 10:421.
[INFO] [pyani.scripts.pyani_script]:    BLAST: Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J.,
[INFO] [pyani.scripts.pyani_script]:    Zhang, Z., Miller, W. & Lipman, D.J. (1997) 'Gapped BLAST and PSI-BLAST:
[INFO] [pyani.scripts.pyani_script]:    a new generation of protein database search programs.' Nucleic Acids Res.
[INFO] [pyani.scripts.pyani_script]:    25:3389-3402
[INFO] [pyani.scripts.pyani_script]:    Biopython: Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A,
[INFO] [pyani.scripts.pyani_script]:    Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL
[INFO] [pyani.scripts.pyani_script]:    (2009) Biopython: freely available Python tools for computational
[INFO] [pyani.scripts.pyani_script]:    molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
[INFO] [pyani.scripts.pyani_script]:    fastANI: Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis K, and
[INFO] [pyani.scripts.pyani_script]:    Aluru S (2018) 'High throughput ANI analysis of 90K prokaryotic
[INFO] [pyani.scripts.pyani_script]:    genomes reveals clear species boundaries.' Nature Communications 9, 5114
[INFO] [pyani.scripts.pyani_script]: Checking for database file: .pyani/pyanidb
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Running ANIm analysis
[INFO] [pyani.scripts.subcommands.subcmd_anim]: MUMMer nucmer version: Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Analysis name: test_1
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Connecting to database .pyani/pyanidb
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Adding run info to database .pyani/pyanidb...
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: ...added run ID: 1 to the database
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Adding genomes for run 1 to database...
[INFO] [pyani.pyani_files]: Checking for hashfile: symmetry_data/input/test_1/MGV-GENOME-0264574.fna.md5.
[INFO] [pyani.pyani_files]: Checking for hashfile: symmetry_data/input/test_1/MGV-GENOME-0266457.fna.md5.
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]:        ...added genome IDs: [None, None]
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Generating ANIm command-lines
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: NUCmer output will be written temporarily to symmetry_data/output/test_1/nucmer_output
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Creating output directory symmetry_data/output/test_1/nucmer_output
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Compiling genomes for comparison
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Collected 2 genomes for this run
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Compiling pairwise comparisons (this can take time for large datasets)...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 22192.08it/s]
[INFO] [pyani.scripts.subcommands.subcmd_anim]:         ...total pairwise comparisons to be performed: 1
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Checking database for existing comparison data...
[DEBUG] [pyani.pyani_orm]: Existing comparisons
{}
[DEBUG] [pyani.pyani_orm]: Checking for existing comparisons, with unique constraints 
        program: nucmer
        version: Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)
        fragsize: None
        maxmatch: False
        kmersize: None
        minmatch: None

[DEBUG] [pyani.pyani_orm]: Checking for existing comparison: Genome 1: MGV_MGV-GENOME-0264574 (1) vs Genome 2: MGV_MGV-GENOME-0266457 (2)
[INFO] [pyani.scripts.subcommands.subcmd_anim]:         ...after check, still need to run 1 comparisons
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]:        Assuming no pre-existing output files
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Creating NUCmer jobs for ANIm
  0%|                                                                                                                                                                         | 0/1 [00:00<?, ?it/s][DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Commands to run:
        Genome 1: MGV_MGV-GENOME-0264574, Genome 2: MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Commands to run:
        nucmer --mum -p symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574 /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0266457.fna /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0264574.fna
        delta_filter_wrapper.py delta-filter -1 symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574.delta symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Expected output file for db: symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Building job
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Commands to run:
        nucmer --mum -p symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457 /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0264574.fna /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0266457.fna
        delta_filter_wrapper.py delta-filter -1 symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457.delta symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Expected output file for db: symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Building job
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Results not found for 2 comparisons; 2 new jobs built.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 986.20it/s]
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Generated 2 jobs, 1 comparisons
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Passing 2 jobs to multiprocessing...
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Scheduler: multiprocessing
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Running jobs with multiprocessing
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: (using maximum number of worker threads)
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Multiprocessing run completed without error
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...jobs complete
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Adding comparison results to database...
  0%|                                                                                                                                                                         | 0/2 [00:00<?, ?it/s][DEBUG] [pyani.scripts.subcommands.subcmd_anim]:        MGV_MGV-GENOME-0266457 vs MGV_MGV-GENOME-0264574
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: query cov      0.997860036175579, query length: 39253, query aln length: 39169, query descripton: MGV_MGV-GENOME-0264574
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: subject cov    0.9894428448754862, subject length: 39594, subject aln length: 39176, subject descripton: MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]:        MGV_MGV-GENOME-0264574 vs MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: query cov      0.9894428448754862, query length: 39594, query aln length: 39176, query descripton: MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: subject cov    0.997860036175579, subject length: 39253, subject aln length: 39169, subject descripton: MGV_MGV-GENOME-0264574
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 378.51it/s]
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Committing results to database
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...database updated.
[INFO] [pyani.scripts.pyani_script]: Completed. Time taken: 1.632

When trying to generate reports with the following command:

pyani report --runs -o <path_to_the_output> --formats=stdout --run_results 1

The reports are also generated without any errors, and the values are improved and as expected when overlaps are not counted twice. When calling the report for specific runs, we might want to add separate columns for the subject alignment length and query alignment length, as these are not the same. Currently, the alignment length column holds the value for query alignment length.

TABLE: symmetry_data/output/test_1/runs
   run ID    name method                   date run                                                                                                                                                                                                                                 command-line
0       1  test_1   ANIm 2024-02-22 17:46:55.213302  /opt/anaconda3/envs/pyani_issue_421/bin/pyani anim -i symmetry_data/input/test_1 -o symmetry_data/output/test_1 -l test_1.log --name test_1 --labels symmetry_data/input/test_1/labels.txt --classes symmetry_data/input/test_1/classes.txt

TABLE: symmetry_data/output/test_1/results_1
   Comparison ID  Query ID       Query description  Subject ID     Subject description  % identity  % query coverage  % subject coverage  alignment length  similarity errors program                                                      version fragment size  maxmatch  Run ID
0              1         1  MGV_MGV-GENOME-0264574           2  MGV_MGV-GENOME-0266457    0.994332          0.997860            0.989443             39169                222  nucmer  Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)          None     False       1
1              2         2  MGV_MGV-GENOME-0266457           1  MGV_MGV-GENOME-0264574    0.994333          0.989443            0.997860             39176                222  nucmer  Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)          None     False       1

The only issue, we have is that we cannot generate matrices (eg. coverage_matrix, identity_matrix) as it results in an error being raised:

(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro issue_421 % pyani report -o symmetry_data/output/test_1 --formats=stdout --run_matrices 1
Traceback (most recent call last):
  File "/opt/anaconda3/envs/pyani_issue_421/bin/pyani", line 33, in <module>
    sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
  File "/Users/angelikakiepas/Desktop/pyani/pyani/scripts/pyani_script.py", line 143, in run_main
    returnval = args.func(args)
  File "/Users/angelikakiepas/Desktop/pyani/pyani/scripts/subcommands/subcmd_report.py", line 265, in subcmd_report
    matlabel_dict = get_matrix_labels_for_run(session, run_id)
  File "/Users/angelikakiepas/Desktop/pyani/pyani/pyani_orm.py", line 386, in get_matrix_labels_for_run
    session.query(Genome.genome_id, Label.label)
  File "<string>", line 2, in join
  File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/base.py", line 280, in _generative
    x = fn(self, *args, **kw)
  File "<string>", line 2, in join
  File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/orm/base.py", line 303, in generate
    fn(self, *args, **kw)
  File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 2428, in join
    onclause_element = coercions.expect(
  File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/coercions.py", line 396, in expect
    resolved = impl._literal_coercion(
  File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/coercions.py", line 899, in _literal_coercion
    self._raise_for_expected(element)
  File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/coercions.py", line 518, in _raise_for_expected
    raise exc.ArgumentError(msg, code=code) from err
sqlalchemy.exc.ArgumentError: ON clause, typically a SQL expression or ORM relationship attribute expected, got <class 'pyani.pyani_orm.Run'>.

I understand that, I will need to update the matrices to reflect the difference between the two comparisons, but I can't get to figure out what is causing the above error.

@widdowquinn
Copy link
Owner Author

widdowquinn commented Feb 22, 2024

I thought I'd fixed that error on master - do you have a minimal reproducible example as a script and set of input files I can try? You can upload the inputs/script as a .zip file here, if you like.

@kiepczi
Copy link
Collaborator

kiepczi commented Feb 23, 2024

There you go: run_fail.zip.

It made me wonder if maybe I haven't rebase to master properly.

@kiepczi
Copy link
Collaborator

kiepczi commented Feb 23, 2024

Just a quick update on the matter. The issue turned out to be caused by incorrect rebasing on my end. The code is working as expected under the branch issue_421.

Before we can implement the changes in the master branch, we first need to:

  • Update matrices and reports to reflect the changes made (eg. forward and reverse comparisons)
  • Modify tests so the expected values/objects reflect the changes in anim.py and subcmd_anim.py
  • Test locally with pytest (all tests should pass)

@kiepczi
Copy link
Collaborator

kiepczi commented Mar 4, 2024

After working on issue #422 and adding additional columns for the --run_results output table, I have also worked on things relevant to this issue.

I have been working on pyANI under branch issue_421, and the code will now result in:

  • Running 2 comparisons (forward and reverse)
  • Overlapping regions are no longer counted twice (excluded with the IntervalTree Python package)
  • Matrices are now updated to reflect these changes, where the values are provided for the query

For example, running pyANI analysis on 2 troublesome viral genomes using the current code of pyANI available under the master branch (all code, input, and output available here test_run_pyani_current.zip) will result in:

  • Genome coverage more than 1.0:
	MGV_MGV-GENOME-0264574:1	MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1	1.0	1.5078337961
MGV_MGV-GENOME-0266457:2	1.4948477042000001	1.0
  • Alignment lengths longer than either genome, and symmetrical values (e.g., the length of the alignment provided in matrix_aln_lengths.tab are the same for A vs B, and B vs A):
	MGV_MGV-GENOME-0264574:1	MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1	39253	59187
MGV_MGV-GENOME-0266457:2	59187	39594
  • Symmetrical values for genome identity:
	MGV_MGV-GENOME-0264574:1	MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1	1.0	0.9962491763
MGV_MGV-GENOME-0266457:2	0.9962491763	1.0

However, running the comparisons under branch issue_421 corrected for these issues (all code, input, and output data are provided here: test_run.zip). For example,

  • Coverage is no longer more than 1:
	MGV_MGV-GENOME-0264574:1	MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1	1.0	0.9978600362000001
MGV_MGV-GENOME-0266457:2	0.9894428449	1.0
  • Alignment lengths are no longer the same and are reported for the query:
	MGV_MGV-GENOME-0264574:1	MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1	39253	39169
MGV_MGV-GENOME-0266457:2	39176	39594
  • Genome identity values are no longer symmetrical and are given for the query:
	MGV_MGV-GENOME-0264574:1	MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1	1.0	0.9943322525
MGV_MGV-GENOME-0266457:2	0.9943332653	1.0

For reference, here is a table output for the parameter --run_results:

	Comparison ID	Query ID	Query description	Subject ID	Subject description	% query identity	% subject identity	% query coverage	% subject coverage	query alignment length	subject alignment length	similarity errors	program	version	fragment size	maxmatch	Run ID
0	1	2	MGV_MGV-GENOME-0266457	1	MGV_MGV-GENOME-0264574	0.9943332652644477	0.9943322525466568	0.9894428448754862	0.997860036175579	39176	39169	222	nucmer	Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)		False	1
1	2	1	MGV_MGV-GENOME-0264574	2	MGV_MGV-GENOME-0266457	0.9943322525466568	0.9943332652644477	0.997860036175579	0.9894428448754862	39169	39176	222	nucmer	Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)		False	1

The only issue I am currently facing is that no graphical output is provided when running the analysis under branch issue_421. I'm currently trying to investigate what causes this. Interestingly, there is no error raised:

[INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(version=False, citation=False, logfile=None, verbose=True, debug=False, disable_tqdm=False, outdir=PosixPath('output'), run_ids=['1'], dbpath=PosixPath('.pyani/pyanidb'), formats=['pdf'], method='seaborn', workers=None, func=<function subcmd_plot at 0x1137e8d30>)
[INFO] [pyani.scripts.pyani_script]: command-line: /opt/anaconda3/envs/pyani_v3/bin/pyani plot -o output --run_id 1 -v --formats pdf
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[INFO] [pyani.scripts.pyani_script]: CITATION INFO
[INFO] [pyani.scripts.pyani_script]: If you use pyani in your work, please cite the following publication:
[INFO] [pyani.scripts.pyani_script]: 	Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G.,
[INFO] [pyani.scripts.pyani_script]: 	& Toth, I.K. (2016) 'Genomics and taxonomy in diagnostics for
[INFO] [pyani.scripts.pyani_script]: 	food security: soft-rotting enterobacterial plant pathogens.'
[INFO] [pyani.scripts.pyani_script]: 	Analytical Methods, 8(1), 12–24. http://doi.org/10.1039/C5AY02550H
[INFO] [pyani.scripts.pyani_script]: DEPENDENCIES
[INFO] [pyani.scripts.pyani_script]: The authors of pyani gratefully acknowledge its dependence on
[INFO] [pyani.scripts.pyani_script]: the following bioinformatics software:
[INFO] [pyani.scripts.pyani_script]: 	MUMmer3: S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway,
[INFO] [pyani.scripts.pyani_script]: 	C. Antonescu, and S.L. Salzberg (2004), 'Versatile and open software
[INFO] [pyani.scripts.pyani_script]: 	for comparing large genomes' Genome Biology 5:R12
[INFO] [pyani.scripts.pyani_script]: 	BLAST+: Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J.,
[INFO] [pyani.scripts.pyani_script]: 	Bealer K., & Madden T.L. (2008) 'BLAST+: architecture and applications.'
[INFO] [pyani.scripts.pyani_script]: 	BMC Bioinformatics 10:421.
[INFO] [pyani.scripts.pyani_script]: 	BLAST: Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J.,
[INFO] [pyani.scripts.pyani_script]: 	Zhang, Z., Miller, W. & Lipman, D.J. (1997) 'Gapped BLAST and PSI-BLAST:
[INFO] [pyani.scripts.pyani_script]: 	a new generation of protein database search programs.' Nucleic Acids Res.
[INFO] [pyani.scripts.pyani_script]: 	25:3389-3402
[INFO] [pyani.scripts.pyani_script]: 	Biopython: Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A,
[INFO] [pyani.scripts.pyani_script]: 	Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL
[INFO] [pyani.scripts.pyani_script]: 	(2009) Biopython: freely available Python tools for computational
[INFO] [pyani.scripts.pyani_script]: 	molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
[INFO] [pyani.scripts.subcommands.subcmd_plot]: Generating graphical output for analyses
[INFO] [pyani.scripts.subcommands.subcmd_plot]: Writing output to: output
[INFO] [pyani.scripts.subcommands.subcmd_plot]: Rendering method: seaborn
[INFO] [pyani.scripts.pyani_script]: Completed. Time taken: 6.745

@kiepczi
Copy link
Collaborator

kiepczi commented Apr 18, 2024

I'm marking this for closure, because my understanding is that this issue was fixed with 1f10a6c.

@kiepczi kiepczi added the Marked for closure Ready to close label Apr 18, 2024
@widdowquinn
Copy link
Owner Author

We've established that ANIm - as originally implemented - is not a true metric. But we still have to identify a method that does constitute an actual metric (though it may be constrained to the specific genome input set…).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement something we'd like pyani to do that it doesn't already Marked for closure Ready to close method the issue relates to how results are calculated
Projects
Status: Done
Development

No branches or pull requests

3 participants