Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add --include-db-pattern and --exclude-db-pattern to search/gather #1871

Merged
merged 51 commits into from
Mar 10, 2022
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
30bf6b9
upgrade 'manifest' documentation, cli help
ctb Mar 5, 2022
f891e11
alias fileinfo to summarize
ctb Mar 5, 2022
4fb5f99
flakes cleanup
ctb Mar 5, 2022
7eab2f6
rescue shadowed tests
ctb Mar 5, 2022
7feaad7
rescue shadowed tests
ctb Mar 5, 2022
31d5586
rescue shadowed tests
ctb Mar 5, 2022
c7b63eb
add 'sig grep' command
ctb Mar 5, 2022
44979e5
add some basic tests
ctb Mar 5, 2022
ebe2334
fix get manifest stuff
ctb Mar 5, 2022
5a311c1
fail on no manifest
ctb Mar 5, 2022
9bbc3f6
check manifest req't
ctb Mar 5, 2022
591c352
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 5, 2022
5f6ad7f
test various combinations of zip, -v, -i
ctb Mar 5, 2022
5a2cce5
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 5, 2022
c19f31e
update with CSV output/manifest
ctb Mar 5, 2022
4ff79cf
added -c/--count
ctb Mar 6, 2022
3baa0e2
adjust output
ctb Mar 6, 2022
d2d600e
test fail extract
ctb Mar 6, 2022
8b0a815
comment tests better
ctb Mar 7, 2022
66232fc
add test for count
ctb Mar 7, 2022
0f248e5
update docs
ctb Mar 7, 2022
9a00d53
remove warnings
ctb Mar 7, 2022
00c3afb
cleanup; create CollectionManifest.filter_rows
ctb Mar 7, 2022
1072608
create CollectionManifest.filter_on_columns
ctb Mar 7, 2022
56a8992
minor cleanup
ctb Mar 7, 2022
4d460c1
Merge branch 'latest' into add/sig_grep
ctb Mar 7, 2022
88d95e0
add --include and --exclude to search
ctb Mar 7, 2022
7330603
add --include and --exclude to search and gather
ctb Mar 7, 2022
98d741a
add --include and --exclude to prefetch
ctb Mar 7, 2022
d93f93c
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 7, 2022
a2e9dca
add args to most set of commands
ctb Mar 8, 2022
a07bbf2
update docs
ctb Mar 8, 2022
3bf292c
more doc
ctb Mar 8, 2022
37b4d04
add --include fn to sig cat
ctb Mar 8, 2022
86fb362
add pattern tests for search and gather
ctb Mar 9, 2022
1ec3f6f
add pattern include/exclude to prefetch
ctb Mar 9, 2022
f304d4e
implement sig extract w/patterns
ctb Mar 9, 2022
6b7dd34
add pattern search to sig rename
ctb Mar 9, 2022
32d72d2
add --include/--exclude to sourmash compare
ctb Mar 9, 2022
c0d4654
update docs
ctb Mar 10, 2022
160ccb0
refactor picklist/pattern selection
ctb Mar 10, 2022
727f431
finish refactoring out picklist foo
ctb Mar 10, 2022
610969b
much refactoring wow
ctb Mar 10, 2022
218d1ba
check for various argument incompatibility
ctb Mar 10, 2022
fccb00a
test what happens when no manifest
ctb Mar 10, 2022
548d36c
fix grep
ctb Mar 10, 2022
318766f
cleanup and simplify
ctb Mar 10, 2022
8881233
change to load_include_exclude_db_patterns
ctb Mar 10, 2022
34f3ac9
adjust error message
ctb Mar 10, 2022
975b548
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 10, 2022
2e3fbdd
remove -f comment
ctb Mar 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 74 additions & 50 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,13 @@ species, while the third is from a completely different genus.

To get a list of subcommands, run `sourmash` without any arguments.

Please use the command line option `--help` to get more detailed usage
information for each command.

All signature saving commands can save to a variety of formats (we
suggest `.zip` files) and all signature loading commands can load
signatures from any of these formats.

There are seven main subcommands: `sketch`, `compare`, `plot`,
`search`, `gather`, `index`, and `prefetch`. See
[the tutorial](tutorials.md) for a walkthrough of these commands.
Expand Down Expand Up @@ -99,12 +106,6 @@ Finally, there are a number of utility and information commands:
* `watch` is an experimental command to classify a stream of sequencing data.
* `multigather` is an experimental command to run multiple gathers against the same collection of databases.

Please use the command line option `--help` to get more detailed usage
information for each command.

Note that as of sourmash v3.4, all commands should load signatures from
indexed databases (the SBT and LCA formats) as well as from signature files.

### `sourmash sketch` - make sourmash signatures from sequence data

Most of the commands in sourmash work with **signatures**, which contain information about genomic or proteomic sequences. Each signature contains one or more **sketches**, which are compressed versions of these sequences. Using sourmash, you can search, compare, and analyze these sequences in various ways.
Expand Down Expand Up @@ -1406,12 +1407,33 @@ signatures with multiple ksizes or moltypes at the same time; you need
to pick the ksize and moltype to use for your search. Where possible,
scaled values will be made compatible.

### Selecting signatures

(sourmash v4.3.0 and later)

sourmash is built to work with very large collections of signatures,
and you may want to select (or exclude) specific signatures from
search or other operations, based on their name. This can be done
without modifying the collections themselves via the
`--include-db-pattern` and `--exclude-db-pattern` arguments to many
sourmash commands, including `search`, `gather`, `compare`, `prefetch`,
and `sig extract`.

In brief, `sourmash search ... --include <pattern>` will search only
those database signatures that match `<pattern>` in their `name`,
`filename`, or `md5` strings. Here, `<pattern>` can be either a
substring or a regular expression. Likewise, `sourmash search
... --exclude <pattern>` will search only those database signatures
that _don't_ match pattern in their `name`, `filename`, or `md5` strings.

### Using picklists to subset large collections of signatures

As of sourmash 4.2.0, many commands support *picklists*, a feature by
which you can select or "pick out" signatures based on values in a CSV
file. This is typically used to index, extract, or search a subset of
a large collection where modifying the collection itself isn't desired.
(sourmash v4.2.0 and later)

Many commands support *picklists*, a feature by which you can select
or "pick out" signatures based on values in a CSV file. This is
typically used to index, extract, or search a subset of a large
collection where modifying the collection itself isn't desired.

For example,
```
Expand Down Expand Up @@ -1449,11 +1471,16 @@ The following `coltype`s are currently supported by `sourmash sig extract`:
Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` or `sourmash sig manifest -o out.csv
<filename_or_db>` to construct an initial CSV file that you can then
edit further; after editing, these can be passed in via the picklist
argument `--picklist out.csv::manifest`.
One way to build a picklist is to use `sourmash sig grep <pattern>
<collection> --csv out.csv` to construct a CSV file containing a list
of all sketches that match the pattern (which can be a string or
regexp). The `out.csv` file can be used as a picklist via the picklist
manifest format with `--picklist out.csv::manifest`.

You can also use `sourmash sig describe --csv out.csv <signatures>` or
`sourmash sig manifest -o out.csv <filename_or_db>` to construct an
initial CSV file that you can then edit further and use as a picklist
as above.

The picklist functionality also supports excluding (rather than
including) signatures matching the picklist arguments. To specify a
Expand Down Expand Up @@ -1494,32 +1521,40 @@ signatures using `zip -r collection.zip *.sig` and then specify

### Saving signatures, more generally

As of sourmash 4.1, most signature saving arguments (`--save-matches`
for `search` and `gather`, `-o` for `sourmash sketch`, and most of the
`sourmash signature` commands) support flexible saving of collections of
(sourmash v4.1 and later)

All signature saving arguments (`--save-matches` for `search` and
`gather`, `-o` for `sourmash sketch`, and `-o` for the `sourmash
signature` commands) support flexible saving of collections of
signatures into JSON text, Zip files, and/or directories.

This behavior is triggered by the requested output filename --

* to save to JSON signature files, use `.sig`; `-` will send JSON to stdout.
* to save to JSON signature files, use `.sig`; using the filename `-`
will send JSON to stdout.
* to save to gzipped JSON signature files, use `.sig.gz`;
* to save to a Zip file collection, use `.zip`;
* to save signature files to a directory, use a name ending in `/`; the directory will be created if it doesn't exist;

If none of these file extensions is detected, output will be written in the JSON `.sig` format, either to the provided output filename or to stdout.
If none of these file extensions is detected, output will be written
in the JSON `.sig` format, either to the provided output filename or
to stdout.

All of these save formats can be loaded by sourmash commands, too.
All of these save formats can be loaded by sourmash commands.

### Loading many signatures

### Loading all signatures under a directory
#### Loading signatures within a directory hierarchy

All of the `sourmash` commands support loading signatures from
beneath directories; provide the paths on the command line.

@CTB mention -f; does it find .zip files etc?
ctb marked this conversation as resolved.
Show resolved Hide resolved

#### Passing in lists of files

Most sourmash commands will also take `--from-file` or
`--query-from-file`, which will take a path to a text file containing
Most sourmash commands will also take a `--from-file` or
`--query-from-file`, which will take the location of a text file containing
a list of file paths. This can be useful for situations where you want
to specify thousands of queries, or a subset of signatures produced by
some other command.
Expand All @@ -1531,36 +1566,30 @@ databases are low memory and disk-intensive databases that allow for
fast searches using a tree structure, while LCA databases are higher
memory and (after a potentially significant load time) are quite fast.

(LCA databases also permit taxonomic searches using `sourmash lca` functions.)
(LCA databases also directly permit taxonomic searches using `sourmash lca`
functions.)

The main point is that since all of these databases contain signatures,
as of sourmash 3.4, any command that takes more than one signature will
also automatically load all of the signatures in the database.
Commands that take multiple signatures or collections of signatures
will also work with databases.

Note that, for now, both SBT and LCA database can only contain one
"type" of signature (one ksize, one moltype, etc.) If the database
signature type is incompatible with the other signatures, sourmash
will complain. In contrast, signature files can
contain many different types of signatures, and compatible ones will
be discovered automatically.
One limitation of indexed databases is that both SBT and LCA database
can only contain one "type" of signature (one ksize/one moltype at one
scaled value). If the database signature type is incompatible with the
other signatures, sourmash will complain appropriately.

In contrast, signature files, zip collections, and directory
hierarchies can contain many different types of signatures, and
compatible ones will be selected automatically.

### Combining search databases on the command line

All of the commands in sourmash operate in "online" mode, so you can
combine multiple databases and signatures on the command line and get
the same answer as if you built a single large database from all of
them. The only caveat to this rule is that if you have multiple
identical matches, the first one to be found will differ depending on
the order that the files are passed in on the command line.

This can actually be pretty convenient for speeding up searches - for
example, if you're using `sourmash gather` and you want to find any
new results after a database update, you can provide a file containing
the previously found matches on the command line before the updated
database. Then `gather` will automatically "find" the previously found
matches before anything else, but only if there are no better matches to
be found in the updated database. (OK, it's a bit of a niche case, but it's
been useful. :)
identical matches present across the databases, the order in which
they are found will differ depending on the order that the files are
passed in on the command line.

### Using stdin

Expand All @@ -1570,8 +1599,3 @@ sig` commands will output to stdout. So, for example,

`sourmash sketch ... -o - | sourmash sig describe -` will describe the
signatures that were just created.

(This is a relatively new feature as of 3.4 and our testing may need
some work, so please
[let us know](https://github.com/sourmash-bio/sourmash/issues) if there's
something that doesn't work and we will fix it :).
3 changes: 2 additions & 1 deletion src/sourmash/cli/compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"""

from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)
add_picklist_args, add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -75,6 +75,7 @@ def subparser(subparsers):
'-p', '--processes', metavar='N', type=int, default=None,
help='Number of processes to use to calculate similarity')
add_picklist_args(subparser)
add_pattern_args(subparser)


def main(args):
Expand Down
5 changes: 3 additions & 2 deletions src/sourmash/cli/gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,8 @@
"""

from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args, add_scaled_arg)
add_picklist_args, add_scaled_arg,
add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -130,10 +131,10 @@ def subparser(subparsers):
'--prefetch', dest="prefetch", action='store_true',
help="use prefetch before gather; see documentation",
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)
add_pattern_args(subparser)
add_scaled_arg(subparser, 0)


Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/prefetch.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
"""search a signature against dbs, find all overlaps"""

from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args, add_scaled_arg)
add_picklist_args, add_scaled_arg,
add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -61,6 +62,7 @@ def subparser(subparsers):
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)
add_pattern_args(subparser)
add_scaled_arg(subparser, 0)


Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@
"""

from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args, add_scaled_arg)
add_picklist_args, add_scaled_arg,
add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -95,6 +96,7 @@ def subparser(subparsers):
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)
add_pattern_args(subparser)
add_scaled_arg(subparser, 0)


Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/cat.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""concatenate signature files"""

from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)
add_picklist_args, add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -29,6 +29,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_pattern_args(subparser)
add_picklist_args(subparser)


Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/extract.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""extract one or more signatures"""

from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)
add_picklist_args, add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -34,6 +34,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_pattern_args(subparser)
add_picklist_args(subparser)


Expand Down
4 changes: 2 additions & 2 deletions src/sourmash/cli/sig/grep.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@
regexp searching. See https://docs.python.org/3/howto/regex.html and
https://docs.python.org/3/library/re.html for details.

The '-v' (exclude), '-i' (case-insensitive), and `-c` (count) options of 'grep' are
supported.
The '-v' (exclude), '-i' (case-insensitive), and `-c` (count) options
of 'grep' are supported.

'-o/--output' can be used to output matching signatures to a specific
location.
Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/rename.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""rename signature"""

from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)
add_picklist_args, add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -31,6 +31,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_pattern_args(subparser)
add_picklist_args(subparser)


Expand Down
15 changes: 14 additions & 1 deletion src/sourmash/cli/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,19 @@ def add_picklist_args(parser):
)


def add_pattern_args(parser):
parser.add_argument(
'--include-db-pattern',
default=None,
help='search only signatures that match this pattern in name, filename, or md5'
)
parser.add_argument(
'--exclude-db-pattern',
default=None,
help='search only signatures that do not match this pattern in name, filename, or md5'
)
ctb marked this conversation as resolved.
Show resolved Hide resolved


def opfilter(path):
return not path.startswith('__') and path not in ['utils']

Expand All @@ -108,4 +121,4 @@ def add_num_arg(parser, default=0):
parser.add_argument(
'-n', '--num-hashes', '--num', metavar='N', type=check_num_bounds, default=default,
help='num value should be between 50 and 50000'
)
)
Loading