-
Notifications
You must be signed in to change notification settings - Fork 55
Roadmap
This page contains notes for the planned future development of pyani
The current interface for pyani
scripts is to call either the average_nucleotide_identity.py
or genbank_get_genomes_by_taxon.py
scripts with a combination of arguments. For the average_nucleotide_identity.py
script in particular there are arguments that either perform a stage in the total analysis, or prevent a stage from executing. I would like to change this interface to a pyani.py COMMAND OPTIONS
structure, similar to git
and other tools.
More specificially, I would like to enable operations such as:
-
pyani.py download -t 931 -o my_organism
: download all NCBI assemblies under taxon 931 to the directorymy_organism
-
pyani.py index my_organism
: generate MD5 or other hashes for each genome in the directorymy_organism
-
pyani.py anim my_organism -o my_organism_ANIm --scheduler SGE
: conduct ANIm analysis on the genomes in the directorymy_organism
-
pyani.py anib my_organism -o my_organism_ANIb --scheduler SGE
: conduct ANIb analysis on the genomes in the directorymy_organism
-
pyani.py render my_organism_ANIm --gmethod seaborn
: draw graphical output for the ANIm analysis in the directorymy_organism_ANIm
-
pyani.py classify my_organism_ANIm
: conduct classification analysis of ANIm results in the directorymy_organism_ANIm
-
pyani.py db --setdb my_db
: specify the database (sqlite3?) to hold comparison data; create it if it does not exist -
pyani.py db --update my_organism_ANIm
: update the current comparison data database with the results contained inmy_organism_ANIm
- this might be useful after a partial run/failure.
Some modifications to the options are also desirable:
- specify multiple input directories
- specify multiple class/label files
I have a goal to store all the comparison results in a persistent database, so that incremental additions to existing analyses are made easier, and that partially complete jobs can be resumed.
- A specific sqlite3 database is designated as 'current' for any analysis (e.g. with
pyani.py db --setdb <location>
) - The default database location could be
.pyani/pyanidb
in the root directory for the analysis (other configuration/debug information may go into.pyani
) - The database will recognise a MD5 (or other) hash as representing a unique input genome. This will require all input genomes to be 'indexed'
- The indexing may be performed during download with
pyani download
- Indexing may be forced with
pyani index <directory>
- The indexing may be performed during download with
- Previously-seen genomes will be stored in a table (a separate table of their previously-seen locations will be kept)
- Comparisons between genome pairs will be recorded in a table, indicating the tool (MUMmer, BLAST+, etc.) and date (which may be used to force a recomparison if requested)
- For each comparison, we will record in another table the values that are currently recorded in the output
.tab
files - Anticipated tables:
-
genomes
: hashes of genome sequences -
paths
: known paths for each hash, keyed by hash fromgenomes
-
comparisons
: pairwise comparisons conducted, multikeyed by query and subject genomes fromgenomes
, with a column describing the comparison (and options used) -
data
: pairwise comparison results: identity, coverage, mismatches, etc. - what is currently reported in.tab
files
-
This database will allow rapid identification of which analyses have been performed before, negating the need to redo comparisons.
It will also provide a persistent record of comparisons which can be accessed for downstream analyses using, e.g. pyani render
and a set of genome files (or list of their hashes?). This will allow ready subsetting of outputs.