Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add picklists to selector protocol and provide initial Index support #1588

Merged
merged 47 commits into from
Jun 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
0997834
various cleanups of sourmash_args
ctb Jun 12, 2021
66b0599
cleanup flakes errors
ctb Jun 12, 2021
3a583a9
clean up sourmash.sig submodule
ctb Jun 12, 2021
bb794ec
initial picklist implementation
ctb Jun 12, 2021
3ecfb48
integrate picklists into sourmash sig extract
ctb Jun 12, 2021
505b04f
basic tests for picklist functionality
ctb Jun 12, 2021
74f31f5
track found etc
ctb Jun 12, 2021
b1fc982
add picklists to selectors
ctb Jun 12, 2021
a817843
split pickfile out a little bit
ctb Jun 12, 2021
def1933
split column_type out of SignaturePicklist a bit
ctb Jun 12, 2021
de6fc06
picklist tests for .signatures() methods on Index classes
ctb Jun 12, 2021
1bdf88e
split pickfile out a little bit
ctb Jun 12, 2021
3c05f95
split column_type out of SignaturePicklist a bit
ctb Jun 12, 2021
03cc61b
Merge branch 'add/picklist' into add/picklist_selectors
ctb Jun 12, 2021
54407a3
test 'Index.find' on picklists for SBTs and LCAs
ctb Jun 12, 2021
a88b66d
factor out picklist checks to 'passes_all_picklists' fn
ctb Jun 13, 2021
031522c
Merge branch 'latest' of github.com:dib-lab/sourmash into add/picklist
ctb Jun 14, 2021
5ac4671
Merge branch 'add/picklist' into add/picklist_selectors
ctb Jun 14, 2021
aaa4548
update comments, constructor, etc.
ctb Jun 16, 2021
9b50748
fix tests :)
ctb Jun 16, 2021
207a813
more picklist tests
ctb Jun 16, 2021
14a88a7
verify output
ctb Jun 16, 2021
3d23d87
add --picklist-require-all &c
ctb Jun 16, 2021
9d60e32
documentation
ctb Jun 16, 2021
8f65f22
test with --md5 selector
ctb Jun 16, 2021
4f8e20c
cover untested code with tests
ctb Jun 16, 2021
14b87d4
trap errors and be nice to users
ctb Jun 16, 2021
04c209c
remove comment
ctb Jun 16, 2021
8e5fb8d
Merge branch 'add/picklist' into add/picklist_selectors
ctb Jun 16, 2021
b8f4bb8
Merge branch 'latest' of github.com:dib-lab/sourmash into add/picklis…
ctb Jun 16, 2021
21ce4b7
fix tests for new SignaturePicklist
ctb Jun 16, 2021
b3c6bb9
move picklist.py from sourmash.sig into sourmash
ctb Jun 17, 2021
fddf141
move picklist reporting into sourmash_args
ctb Jun 17, 2021
984a557
fix space
ctb Jun 17, 2021
ced72d2
add picklist args throughout, eek.
ctb Jun 17, 2021
7a30b20
add picklists and tests for search, gather, index
ctb Jun 17, 2021
c0e5781
add picklists to prefetch
ctb Jun 17, 2021
a0335a3
add picklists to sourmash compare
ctb Jun 17, 2021
a074127
add picklists to lca index
ctb Jun 17, 2021
ba5c8bc
block multiple picklists on SBTs and LCAs, for now
ctb Jun 17, 2021
ca6ea4f
add picklist test that checks indexing-and-then-search == index
ctb Jun 17, 2021
c965648
add a test for using prefetch CSV as picklist
ctb Jun 17, 2021
ab286cf
remove debugging print
ctb Jun 17, 2021
4d156e9
add docs
ctb Jun 17, 2021
de6f3c4
remove order dependence from test
ctb Jun 17, 2021
8812142
further attempt to fix test
ctb Jun 17, 2021
565428b
Merge branch 'latest' into add/picklist_selectors
bluegenes Jun 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 61 additions & 38 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,15 +177,14 @@ sourmash compare file1.sig [ file2.sig ... ]
```

Options:
```
--output -- save the distance matrix to this file (as a numpy binary matrix)
--ksize -- do the comparisons at this k-mer size.
--containment -- calculate containment instead of similarity.
C(i, j) = size(i intersection j) / size(i).
--from-file -- append the list of files in this text file to the input

* `--output` -- save the distance matrix to this file (as a numpy binary matrix)
* `--ksize` -- do the comparisons at this k-mer size.
* `--containment` -- calculate containment instead of similarity; `C(i, j) = size(i intersection j) / size(i)`
* `--from-file` -- append the list of files in this text file to the input
signatures.
--ignore-abundance -- ignore abundances in signatures.
```
* `--ignore-abundance` -- ignore abundances in signatures.
* `--picklist` -- select a subset of signatures with [a picklist](#using-picklists-to-subset-large-collections-of-signatures)

**Note:** compare by default produces a symmetric similarity matrix that can be used as an input to clustering. With `--containment`, however, this matrix is no longer symmetric and cannot formally be used for clustering.

Expand Down Expand Up @@ -249,6 +248,9 @@ similarity match
...
```

Note, as of sourmash 4.2.0, `search` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash gather` - find metagenome members

The `gather` subcommand selects the best reference genomes to use for
Expand Down Expand Up @@ -289,6 +291,9 @@ which matches are no longer reported; by default, this is set to
50kb. see the Appendix in
[Classifying Signatures](classifying-signatures.md) for details.

As of sourmash 4.2.0, `gather` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

Note:

Use `sourmash gather` to classify a metagenome against a collection of
Expand Down Expand Up @@ -350,6 +355,9 @@ containing a list of file names to index; you can also provide individual
signature files, directories full of signatures, or other sourmash
databases.

As of sourmash 4.2.0, `index` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash prefetch` - select subsets of very large databases for more processing

The `prefetch` subcommand searches a collection of scaled signatures
Expand All @@ -375,6 +383,7 @@ Other options include:
* `--threshold-bp` to require a minimum estimated bp overlap for output;
* `--scaled` for downsampling;
* `--force` to continue past survivable errors;
* `--picklist` select a subset of signatures with [a picklist](#using-picklists-to-subset-large-collections-of-signatures)

### Alternative search mode for low-memory (but slow) search: `--linear`

Expand Down Expand Up @@ -589,6 +598,9 @@ see
You can use `--from-file` to pass `lca index` a text file containing a
list of file names to index.

As of sourmash 4.2.0, `lca index` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash lca rankinfo` - examine an LCA database

The `sourmash lca rankinfo` command displays k-mer specificity
Expand Down Expand Up @@ -821,36 +833,8 @@ will extract the same signature, which has an accession number of
#### Using picklists with `sourmash sig extract`

As of sourmash 4.2.0, `extract` also supports picklists, a feature by
which you can select signatures based on values in a CSV file.

For example,
```
sourmash sig extract --picklist list.csv:md5:md5sum <signatures>
```
will extract only the signatures that have md5sums matching the
column `md5sum` in the CSV file `list.csv`.

The `--picklist` argument string must be of the format
`pickfile:colname:coltype`, where `pickfile` is the path to a CSV
file, `colname` is the name of the column to select from the CSV
file (based on the headers in the first line of the CSV file),
and `coltype` is the type of match.

The following `coltype`s are currently supported by `sourmash sig extract`:

* `name` - exact match to signature's name
* `md5` - exact match to signature's md5sum
* `md5prefix8` - match to 8-character prefix of signature's md5sum
* `md5short` - same as `md5prefix8`
* `ident` - exact match to signature's identifier
* `identprefix` - match to signature's identifier, before '.'

Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` to construct an initial CSV file that you can
then edit further.
which you can select signatures based on values in a CSV file. See
[Using picklists to subset large collections of signatures](#using-picklists-to-subset-large-collections-of-signatures), below.

### `sourmash signature flatten` - remove abundance information from signatures

Expand Down Expand Up @@ -963,6 +947,45 @@ signatures with multiple ksizes or moltypes at the same time; you need
to pick the ksize and moltype to use for your search. Where possible,
scaled values will be made compatible.

### Using picklists to subset large collections of signatures

As of sourmash 4.2.0, many commands support *picklists*, a feature by
which you can select or "pick out" signatures based on values in a CSV
file.

For example,
```
sourmash sig extract --picklist list.csv:md5:md5sum <signatures>
```
will extract only the signatures that have md5sums matching the
column `md5sum` in the CSV file `list.csv`.

The `--picklist` argument string must be of the format
`pickfile:colname:coltype`, where `pickfile` is the path to a CSV
file, `colname` is the name of the column to select from the CSV
file (based on the headers in the first line of the CSV file),
and `coltype` is the type of match.

The following `coltype`s are currently supported by `sourmash sig extract`:

* `name` - exact match to signature's name
* `md5` - exact match to signature's md5sum
* `md5prefix8` - match to 8-character prefix of signature's md5sum
* `md5short` - same as `md5prefix8`
* `ident` - exact match to signature's identifier
* `identprefix` - match to signature's identifier, before '.'

Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` to construct an initial CSV file that you can
then edit further.

In addition to `sig extract`, the following commands support
`--picklist` selection: `index`, `search`, `gather`, `prefetch`,
`compare`, `index`, and `lca index`.

### Storing (and searching) signatures

Backing up a little, there are many ways to store and search
Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/compare.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""compare sequence signatures made by compute"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand Down Expand Up @@ -47,6 +48,7 @@ def subparser(subparsers):
subparser.add_argument(
'-p', '--processes', metavar='N', type=int, default=None,
help='Number of processes to use to calculate similarity')
add_picklist_args(subparser)


def main(args):
Expand Down
9 changes: 6 additions & 3 deletions src/sourmash/cli/gather.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""search a metagenome signature against dbs"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand Down Expand Up @@ -60,8 +61,6 @@ def subparser(subparsers):
'--cache-size', default=0, type=int, metavar='N',
help='number of internal SBT nodes to cache in memory (default: 0, cache all nodes)'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

# advanced parameters
subparser.add_argument(
Expand All @@ -80,6 +79,10 @@ def subparser(subparsers):
help="use prefetch before gather; see documentation",
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
import sourmash
Expand Down
6 changes: 4 additions & 2 deletions src/sourmash/cli/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@
---
"""

from sourmash.cli.utils import add_moltype_args, add_ksize_arg
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand All @@ -44,7 +45,6 @@ def subparser(subparsers):
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
add_ksize_arg(subparser, 31)
subparser.add_argument(
'-d', '--n_children', metavar='D', type=int, default=2,
help='number of children for internal nodes; default=2'
Expand All @@ -70,7 +70,9 @@ def subparser(subparsers):
'--scaled', metavar='FLOAT', type=float, default=0,
help='downsample signatures to the specified scaled factor'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
Expand Down
9 changes: 6 additions & 3 deletions src/sourmash/cli/lca/index.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""create LCA database"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand All @@ -18,8 +19,6 @@ def subparser(subparsers):
subparser.add_argument(
'--scaled', metavar='S', default=10000, type=float
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
Expand Down Expand Up @@ -53,6 +52,10 @@ def subparser(subparsers):
help='ignore signatures with no taxonomy entry'
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
import sourmash
Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/prefetch.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""search a signature against dbs, find all overlaps"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand Down Expand Up @@ -63,6 +64,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/search.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""search a signature against other signatures"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand Down Expand Up @@ -59,6 +60,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
Expand Down
12 changes: 3 additions & 9 deletions src/sourmash/cli/sig/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

import sys

from sourmash.cli.utils import add_moltype_args, add_ksize_arg
from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)


def subparser(subparsers):
Expand All @@ -25,16 +26,9 @@ def subparser(subparsers):
'--name', default=None,
help='select signatures whose name contains this substring'
)
subparser.add_argument(
'--picklist', default=None,
help="select signatures based on a picklist, i.e. 'file.csv:colname:coltype'"
)
subparser.add_argument(
'--picklist-require-all', default=False, action='store_true',
help="require that all picklist values be found or else fail"
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
Expand Down
10 changes: 10 additions & 0 deletions src/sourmash/cli/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,16 @@ def add_ksize_arg(parser, default=31):
help='k-mer size; default={d}'.format(d=default)
)

def add_picklist_args(parser):
parser.add_argument(
'--picklist', default=None,
help="select signatures based on a picklist, i.e. 'file.csv:colname:coltype'"
)
parser.add_argument(
'--picklist-require-all', default=False, action='store_true',
help="require that all picklist values be found or else fail"
)


def opfilter(path):
return not path.startswith('__') and path not in ['utils']
Expand Down
Loading