-
Notifications
You must be signed in to change notification settings - Fork 55
Roadmap
- new command-line API
- functionality similar to v0.2.x
Issue | PR | |
---|---|---|
123 | 338 | ANIb(lastall) |
124 | TETRA | |
150 | Classify | |
378 | 387 | Alembic |
-- | 364 | Compare |
Issue | PR | |
---|---|---|
146 | Config file | |
215 | SLURM support | |
147 | Pipe 3rd party output to temp location |
Issue | PR | |
---|---|---|
151 | ANIm metric status |
Issue | PR | |
---|---|---|
373 | 376 | ANIm should not be symmetric |
Issue | PR | |
---|---|---|
145 | Warnings for 0-identity comparisons | |
188 | Propagate labels for taxon determination | |
392 | Rationalise documentation | |
152 | Update logging exceptions | |
248 | Make v0.2 default branch; rename master to v0.3 | |
194 | Adopt concurrent.futures in place of multiprocessing |
Issue | PR | |
---|---|---|
221 | Missing labels and captions in plots with default settings | |
129 | ANIm: check class/label files before loading sequences |
- Extension of pyani v0.3.0 to add new functionality and outputs
Issue | PR | |
---|---|---|
187 | 370 | Tree |
180 | Evolve | |
135 | Subsample |
Issue | PR | |
---|---|---|
136 | Use JSON for labels/classes files | |
116 | Order rows and columns in clustering order like images | |
94 | Fetching only N genomes | |
343 | --dry-run flag |
Issue | PR | |
---|---|---|
14 | Collating results is slow for large datasets (>1500 genomes) |
- Extension of pyani v0.3.1 to accommodate alternative measures of similarity
Issue | PR | |
---|---|---|
156 | wANI | |
155 | gANI | |
137 | mash | |
16 | AAI |
- Flask interface onto pyani database.
Issue | PR | |
---|---|---|
148 | Flask interface onto SQLite3 backend |
This page contains notes for the planned future development of pyani
The current interface for pyani
scripts is to call either the average_nucleotide_identity.py
or genbank_get_genomes_by_taxon.py
scripts with a combination of arguments. For the average_nucleotide_identity.py
script in particular there are arguments that either perform a stage in the total analysis, or prevent a stage from executing. I would like to change this interface to a pyani.py COMMAND OPTIONS
structure, similar to git
and other tools.
More specificially, I would like to enable operations such as:
-
pyani.py download -t 931 -o my_organism
: download all NCBI assemblies under taxon 931 to the directorymy_organism
-
pyani.py index my_organism
: generate MD5 or other hashes for each genome in the directorymy_organism
-
pyani.py anim my_organism -o my_organism_ANIm --scheduler SGE
: conduct ANIm analysis on the genomes in the directorymy_organism
-
pyani.py anib my_organism -o my_organism_ANIb --scheduler SGE
: conduct ANIb analysis on the genomes in the directorymy_organism
-
pyani.py render my_organism_ANIm --gmethod seaborn
: draw graphical output for the ANIm analysis in the directorymy_organism_ANIm
-
pyani.py classify my_organism_ANIm
: conduct classification analysis of ANIm results in the directorymy_organism_ANIm
-
pyani.py db --setdb my_db
: specify the database (sqlite3?) to hold comparison data; create it if it does not exist -
pyani.py db --update my_organism_ANIm
: update the current comparison data database with the results contained inmy_organism_ANIm
- this might be useful after a partial run/failure.
Some modifications to the options are also desirable:
- specify multiple input directories
- specify multiple class/label files
I have a goal to store all the comparison results in a persistent database, so that incremental additions to existing analyses are made easier, and that partially complete jobs can be resumed.
- A specific sqlite3 database is designated as 'current' for any analysis (e.g. with
pyani.py db --setdb <location>
) - The default database location could be
.pyani/pyanidb
in the root directory for the analysis (other configuration/debug information may go into.pyani
) - The database will recognise a MD5 (or other) hash as representing a unique input genome. This will require all input genomes to be 'indexed'
- The indexing may be performed during download with
pyani download
- Indexing may be forced with
pyani index <directory>
- The indexing may be performed during download with
- Previously-seen genomes will be stored in a table (a separate table of their previously-seen locations will be kept)
- Comparisons between genome pairs will be recorded in a table, indicating the tool (MUMmer, BLAST+, etc.) and date (which may be used to force a recomparison if requested)
- For each comparison, we will record in another table the values that are currently recorded in the output
.tab
files - Anticipated tables:
-
genomes
: hashes of genome sequences -
paths
: known paths for each hash, keyed by hash fromgenomes
-
comparisons
: pairwise comparisons conducted, multikeyed by query and subject genomes fromgenomes
, with a column describing the comparison (and options used) -
data
: pairwise comparison results: identity, coverage, mismatches, etc. - what is currently reported in.tab
files
-
This database will allow rapid identification of which analyses have been performed before, negating the need to redo comparisons.
It will also provide a persistent record of comparisons which can be accessed for downstream analyses using, e.g. pyani render
and a set of genome files (or list of their hashes?). This will allow ready subsetting of outputs.