-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
built in workaround for sigs that are in the db? #175
Comments
sourmash-bio/sourmash#1477 added underlying support for this (see JaccardSearch variant https://github.com/dib-lab/sourmash/blob/latest/tests/test_index.py#L1202-L1219). A test shows how to set the
need this in |
there is a built-in hack in charcoal that does this... the error at the top of this issue is not an error about things being in the search database. it's an error for trying to add the same name into an LCA db, twice. the problem is that |
(for the hack in question, see comment "Hack for examining members of our search database: remove exact matches" in |
whoops, sorry, I suppose that doesn't necessarily need updating for charcoal. I've been thinking of switching hacks for thumper, so I can run prefetch-gather as a standalone and then do taxonomic summarization from the csv. Realizing that doesn't really apply here, since prefetch is done with the full-file sketch, and then each contig sketch needs to be checked against the prefetched sigs. |
Got the same error on a different genome:
haven't tried to hunt it down yet, that's next on the agenda :) |
The charcoal code causing this problem is in
Adding a print statement,
And the gather output has this twice as well:
and there are more duplicated values in the gather matches file:
In total, there are 96 duplicated rows (e.g. 48 distinct things each appearing twice). |
The sigs are duplicated in the |
This is a problem caused by duplicate signatures in sourmash sbt databases (see sourmash-bio/sourmash#1171 (comment)). Using
There are 258,406 signtuares in the sbt of which 250,886 are distinct based on identical rows in probably will add filtering of redundant md5 at @ctb's suggestion. @luizirber mentioned, "our MD5 calculation only take ksize and the list of hashes, which means every empty MH has the same MD5 (for a specific k)", so could also filter on exact row matches in prefetch output. TBD. |
As a temporary workaround in case anyone else is lurking around |
Does this mean that the non-sbt |
On Tue, Jun 01, 2021 at 09:25:27AM -0700, Tessa Pierce Ward wrote:
> As a temporary workaround in case anyone else is lurking around latest, this problem does not occur with non-sbt .zip databases
Does this mean that the `*zip` databases don't have duplicates, or just that the error doesn't occur with those?
non-SBT zip databases may or may not contain duplicates; that depends on
their construction.
sbt.zip files (and maybe non-SBT JSON databases?) may contain duplicates
created by sourmash code during the creation of the SBT database; that's
the bug referred to above.
|
right - just trying to confirm I don't need to go chase down a bug from original generation of the |
right! nothing needs to be done by you :) |
Using GTDB rs 202. On branch
tr_update
from #171I selected GTDB rs 202 reps using gather, and now want to check if they're contaminated. Obviously all of these sigs are going to be in the DB. I feel like I remember there being an in-built hack to ignore perfect matches and continue decontaminating, but this error suggests otherwise
The text was updated successfully, but these errors were encountered: