-
Notifications
You must be signed in to change notification settings - Fork 55
Roadmap
- updates to version 2 distribution on PyPI and bioconda
Issue | Branch | PR | |
---|---|---|---|
248 | Make version_0_2 the default branch; rename master
|
- new command-line API
- functionality similar to v0.2.x
Issue | Branch | PR | |
---|---|---|---|
123 | anib_123 | 338 | ANIb(lastall) |
124 | TETRA | ||
150 | Classify | ||
378 | alembic_378 | 387 | Alembic |
-- | compare | 364 | Compare |
Issue | Branch | PR | |
---|---|---|---|
146 | Config file | ||
215 | SLURM support | ||
147 | Pipe 3rd party output to temp location |
Issue | Branch | PR | |
---|---|---|---|
151 | ANIm metric status |
Issue | Branch | PR | |
---|---|---|---|
373 | issue_373 | 376 | ANIm should not be symmetric |
383 | issue_383 | 385 |
try/except around extraction in pyani download
|
371 | ValueError: zero-size array to reduction operation minimum which has no identity |
||
342 | issue_342, noextend_342 | Use --noextend in NUCmer as a rule |
|
340 | Alignment coverage >1.0 | ||
402 | issue_402 | 404 |
test_cli_parsing() tests fail when pytest is run with no flags |
Issue | Branch | PR | |
---|---|---|---|
145 | Warnings for 0-identity comparisons | ||
188 | Propagate labels for taxon determination | ||
392 | Rationalise documentation | ||
152 | Update logging exceptions | ||
194 | Adopt concurrent.futures in place of multiprocessing
|
Issue | Branch | PR | |
---|---|---|---|
221 | Missing labels and captions in plots with default settings | ||
129 | ANIm: check class/label files before loading sequences |
- Extension of pyani v0.3.0 to add new functionality and outputs
Issue | Branch | PR | |
---|---|---|---|
187 | tree_186 | 370 | Tree (branch named for a now-closed issue) |
180 | evolve | Evolve | |
135 | Subsample | ||
362 | Add tests for --recovery mode |
Issue | Branch | PR | |
---|---|---|---|
136 | Use JSON for labels/classes files | ||
116 | Order rows and columns in clustering order like images | ||
94 | Fetching only N genomes | ||
343 | --dry-run flag |
Issue | Branch | PR | |
---|---|---|---|
14 | Collating results is slow for large datasets (>1500 genomes) | ||
306 | NUCmer job generation for large jobs slows down rapidly |
- Extension of pyani v0.3.1 to accommodate alternative measures of similarity
Issue | Branch | PR | |
---|---|---|---|
156 | wANI | ||
155 | gANI | ||
137 | mash | ||
16 | AAI |
- Flask interface onto pyani database.
Issue | Branch | PR | |
---|---|---|---|
148 | Flask interface onto SQLite3 backend |
This page contains notes for the planned future development of pyani
The current interface for pyani
scripts is to call either the average_nucleotide_identity.py
or genbank_get_genomes_by_taxon.py
scripts with a combination of arguments. For the average_nucleotide_identity.py
script in particular there are arguments that either perform a stage in the total analysis, or prevent a stage from executing. I would like to change this interface to a pyani.py COMMAND OPTIONS
structure, similar to git
and other tools.
More specificially, I would like to enable operations such as:
-
pyani.py download -t 931 -o my_organism
: download all NCBI assemblies under taxon 931 to the directorymy_organism
-
pyani.py index my_organism
: generate MD5 or other hashes for each genome in the directorymy_organism
-
pyani.py anim my_organism -o my_organism_ANIm --scheduler SGE
: conduct ANIm analysis on the genomes in the directorymy_organism
-
pyani.py anib my_organism -o my_organism_ANIb --scheduler SGE
: conduct ANIb analysis on the genomes in the directorymy_organism
-
pyani.py render my_organism_ANIm --gmethod seaborn
: draw graphical output for the ANIm analysis in the directorymy_organism_ANIm
-
pyani.py classify my_organism_ANIm
: conduct classification analysis of ANIm results in the directorymy_organism_ANIm
-
pyani.py db --setdb my_db
: specify the database (sqlite3?) to hold comparison data; create it if it does not exist -
pyani.py db --update my_organism_ANIm
: update the current comparison data database with the results contained inmy_organism_ANIm
- this might be useful after a partial run/failure.
Some modifications to the options are also desirable:
- specify multiple input directories
- specify multiple class/label files
I have a goal to store all the comparison results in a persistent database, so that incremental additions to existing analyses are made easier, and that partially complete jobs can be resumed.
- A specific sqlite3 database is designated as 'current' for any analysis (e.g. with
pyani.py db --setdb <location>
) - The default database location could be
.pyani/pyanidb
in the root directory for the analysis (other configuration/debug information may go into.pyani
) - The database will recognise a MD5 (or other) hash as representing a unique input genome. This will require all input genomes to be 'indexed'
- The indexing may be performed during download with
pyani download
- Indexing may be forced with
pyani index <directory>
- The indexing may be performed during download with
- Previously-seen genomes will be stored in a table (a separate table of their previously-seen locations will be kept)
- Comparisons between genome pairs will be recorded in a table, indicating the tool (MUMmer, BLAST+, etc.) and date (which may be used to force a recomparison if requested)
- For each comparison, we will record in another table the values that are currently recorded in the output
.tab
files - Anticipated tables:
-
genomes
: hashes of genome sequences -
paths
: known paths for each hash, keyed by hash fromgenomes
-
comparisons
: pairwise comparisons conducted, multikeyed by query and subject genomes fromgenomes
, with a column describing the comparison (and options used) -
data
: pairwise comparison results: identity, coverage, mismatches, etc. - what is currently reported in.tab
files
-
This database will allow rapid identification of which analyses have been performed before, negating the need to redo comparisons.
It will also provide a persistent record of comparisons which can be accessed for downstream analyses using, e.g. pyani render
and a set of genome files (or list of their hashes?). This will allow ready subsetting of outputs.