Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

built in workaround for sigs that are in the db? #175

Closed
taylorreiter opened this issue May 22, 2021 · 13 comments · Fixed by #179
Closed

built in workaround for sigs that are in the db? #175

taylorreiter opened this issue May 22, 2021 · 13 comments · Fixed by #179

Comments

@taylorreiter
Copy link
Member

Using GTDB rs 202. On branch tr_update from #171

I selected GTDB rs 202 reps using gather, and now want to check if they're contaminated. Obviously all of these sigs are going to be in the DB. I feel like I remember there being an in-built hack to ignore perfect matches and continue decontaminating, but this error suggests otherwise

examining spreadsheet headers...
** assuming column 'ident' is identifiers in spreadsheet
loaded 258406 tax assignments.
loaded 209 matches from 'output.genbank_genomes/stage1/GCA_000256725.2_genomic.fna.gz.matches.zip'
Traceback (most recent call last):
  File "/home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9/lib/python3.9/runpy.py", line 197,
in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9/lib/python3.9/runpy.py", line 87, i
n _run_code
    exec(code, run_globals)
  File "/home/tereiter/github/charcoal/charcoal/contigs_search_taxonomy.py", line 130, in <module>
    returncode = cmdline(sys.argv[1:])
  File "/home/tereiter/github/charcoal/charcoal/contigs_search_taxonomy.py", line 125, in cmdline
    return main(args)
  File "/home/tereiter/github/charcoal/charcoal/contigs_search_taxonomy.py", line 78, in main
    lca_db.insert(ss, ident=ident)
  File "/home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9/lib/python3.9/site-packages/sourma$h/lca/lca_db.py", line 142, in insert
    raise ValueError("signature '{}' is already in this LCA db.".format(ident))
ValueError: signature 'GCF_000469345.1' is already in this LCA db.
@bluegenes
Copy link
Member

bluegenes commented May 22, 2021

ref sourmash-bio/sourmash#433

sourmash-bio/sourmash#1477 added underlying support for this (see JaccardSearch variant https://github.com/dib-lab/sourmash/blob/latest/tests/test_index.py#L1202-L1219).

A test shows how to set the search_fn:

...now search with something that should ignore sig47, the exact match.
search_fn = JaccardSearchBestOnly_ButIgnore([ss47])

need this inthumper too -- my plan was to write a script that uses the ButIgnore search_fn to ignore the matches already in the database during prefetch gather. Feel free to implement or suggest an alternate strategy!

@ctb
Copy link
Member

ctb commented May 22, 2021

there is a built-in hack in charcoal that does this...

the error at the top of this issue is not an error about things being in the search database. it's an error for trying to add the same name into an LCA db, twice. the problem is that output.genbank_genomes/stage1/GCA_000256725.2_genomic.fna.gz.matches.zip contains two signatures with the same identifier/name. I'm not sure why that is, tho. I would figure out why in case it's a symptom of some wider problem, but, in any case, you can change charcoal to ignore this error without affecting anything.

@ctb
Copy link
Member

ctb commented May 22, 2021

(for the hack in question, see comment "Hack for examining members of our search database: remove exact matches" in contigs_search_taxonomy.py, around line 44.)

@bluegenes
Copy link
Member

(for the hack in question, see comment "Hack for examining members of our search database: remove exact matches" in contigs_search_taxonomy.py, around line 44.)

whoops, sorry, I suppose that doesn't necessarily need updating for charcoal.

I've been thinking of switching hacks for thumper, so I can run prefetch-gather as a standalone and then do taxonomic summarization from the csv. Realizing that doesn't really apply here, since prefetch is done with the full-file sketch, and then each contig sketch needs to be checked against the prefetched sigs.

@taylorreiter
Copy link
Member Author

Got the same error on a different genome:


[Mon May 24 13:12:45 2021]
rule prefetch_all:
    input: output.genbank_genomes/stage1/GCF_003697165.2_genomic.fna.gz.sig, /group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip
    output: output.genbank_genomes/stage1/GCF_003697165.2_genomic.fna.gz.matches.csv, output.genbank_genomes/stage1/GCF_003697165.2_genomic.fna.gz.matches.zip, output.genbank_genomes/stage1/GCF_003697165.2_genomic.fna.gz.matches.txt
    jobid: 546
    wildcards: filename=GCF_003697165.2_genomic.fna.gz

Activating conda environment: /home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9

== This is sourmash version 4.1.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: /home/tereiter/github/2020-ibd... (k=31, DNA)
downsampling query from scaled=1000 to 1000
all sketches will be downsampled to scaled=1000
saving all matching database signatures to 'output.genbank_genomes/stage1/GCF_003697165.2_genomic.fna.gz.matches.zip'
loading signatures from '/group/ctbrowngrp/gtdb/databases/ctb/gtdb-rs202.genomic.k31.sbt.zip'
total of 55915 matching signatures.
saved 55915 matches to CSV file 'output.genbank_genomes/stage1/GCF_003697165.2_genomic.fna.gz.matches.csv'
of 4911 distinct query hashes, 4911 were found in matches above threshold.
a total of 0 query hashes remain unmatched.
[Mon May 24 13:39:06 2021]
Finished job 546.
1 of 1372 steps (0.07%) done

[Mon May 24 13:39:06 2021]
rule make_contigs_taxonomy_json:
    input: /home/tereiter/github/2020-ibd/genbank_genomes/GCF_000026325.1_genomic.fna.gz, output.genbank_genomes/stage1/GCF_000026325.1_genomic.fna.gz.sig, output.genbank_genomes/stage1/GCF_000026325.1_genomic.fna.gz.matches.zip, /group/ctbrowngrp/gtdb/gtdb-rs202.taxonomy.csv
    output: output.genbank_genomes/stage1/GCF_000026325.1_genomic.fna.gz.contigs-tax.json
    jobid: 1088
    wildcards: f=GCF_000026325.1_genomic.fna.gz

Activating conda environment: /home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9
examining spreadsheet headers...
** assuming column 'ident' is identifiers in spreadsheet
loaded 258406 tax assignments.
loaded 59725 matches from 'output.genbank_genomes/stage1/GCF_000026325.1_genomic.fna.gz.matches.zip'
removing an identical match: GCF_000026325.1 Escherichia coli UMN026 strain=UMN026, ASM2632v2
Traceback (most recent call last):
  File "/home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/tereiter/github/charcoal/charcoal/contigs_search_taxonomy.py", line 127, in <module>
    returncode = cmdline(sys.argv[1:])
  File "/home/tereiter/github/charcoal/charcoal/contigs_search_taxonomy.py", line 122, in cmdline
    return main(args)
  File "/home/tereiter/github/charcoal/charcoal/contigs_search_taxonomy.py", line 75, in main
    lca_db.insert(ss, ident=ident)
  File "/home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9/lib/python3.9/site-packages/sourmash/lca/lca_db.py", line 142, in insert
    raise ValueError("signature '{}' is already in this LCA db.".format(ident))
ValueError: signature 'GCF_001087165.1' is already in this LCA db.

haven't tried to hunt it down yet, that's next on the agenda :)

@taylorreiter
Copy link
Member Author

The charcoal code causing this problem is in contigs_search_taxonomy.py:

    # ...with specific matches.
    for ss in siglist:
        ident = get_ident(ss)
        lineage = tax_assign[ident]

        lca_db.insert(ss, ident=ident)
        lin_db.insert(ident, lineage)

Adding a print statement, ss is repeated:

    # ...with specific matches.
    for ss in siglist:
        print("ss is:", ss)
        ident = get_ident(ss)
        lineage = tax_assign[ident]

        lca_db.insert(ss, ident=ident)
        lin_db.insert(ident, lineage)
(charcoal) tereiter@bm18:~/github/charcoal$ python -m charcoal run genbank_genomes.conf --rerun-incomplete
** read 361 provided lineages
** config file checks PASSED!
** from here on out, it's all snakemake...
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        1       combine_genome_summary
        1       combine_hit_list
        1       compare_taxonomy_single
        1       make_contigs_taxonomy_json
        1       make_index
        1       set_kernel
        7

[Tue May 25 08:53:40 2021]
rule make_contigs_taxonomy_json:
    input: /home/tereiter/github/2020-ibd/genbank_genomes/GCA_000210075.1_genomic.fna.gz, output.genbank_genomes/stage1/GCA_000210075.1_genomic.fna.gz.sig, output.genbank_genomes/stage1/GCA_000210075.1_genomic.fna.gz.matches.zip, /group/ctbrowngrp/gtdb/gtdb-rs202.taxonomy.csv
    output: output.genbank_genomes/stage1/GCA_000210075.1_genomic.fna.gz.contigs-tax.json
    jobid: 4
    wildcards: f=GCA_000210075.1_genomic.fna.gz

Activating conda environment: /home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9
examining spreadsheet headers...
** assuming column 'ident' is identifiers in spreadsheet
loaded 258406 tax assignments.
loaded 1774 matches from 'output.genbank_genomes/stage1/GCA_000210075.1_genomic.fna.gz.matches.zip'
removing an identical match: GCA_000210075.1 Bacteroides xylanisolvens XB1A strain=XB1A, ASM21007v1
ss is: GCF_900760745.1 Bacilli bacterium, SRS020869_41
ss is: GCA_900762515.1 uncultured Odoribacter sp., SRS144506_36
ss is: GCF_900765575.1 Bacteroidaceae bacterium, SRS294937_16
ss is: GCF_009891525.1 Enterococcus faecium strain=BIOML-A2, ASM989152v1
ss is: GCA_900556845.1 uncultured Bacteroides sp., UMGS1980
ss is: GCF_000191065.1 Prevotella multiformis DSM 16608 strain=DSM 16608, ASM19106v1
ss is: GCF_900128905.1 Bacteroides luti strain=DSM 26991, IMG-taxon 2695420932 annotated assembly
ss is: GCA_900765785.1 uncultured Bacteroides sp., SRS294954_23
ss is: GCF_900754285.1 Clostridia bacterium, ERS396455_74
ss is: GCA_004556565.1 Lachnospiraceae bacterium, ASM455656v1
ss is: GCF_900128495.1 Bacteroides ilei strain=Marseille-P3208, PRJEB18049
ss is: GCA_900556905.1 uncultured Clostridiales bacterium, UMGS1989
ss is: GCA_002633275.1 Bacteroidales bacterium Phil12, ASM263327v1
ss is: GCA_013316905.1 Alistipes sp., ASM1331690v1
ss is: GCF_900765255.1 Bacteroidaceae bacterium, SRS259526_9
ss is: GCA_900542985.1 uncultured Bacteroides sp., UMGS523
ss is: GCA_900540795.1 uncultured Bacteroides sp., UMGS277
ss is: GCA_900541935.1 uncultured Prevotellaceae bacterium, UMGS410
ss is: GCF_900759965.1 Lachnospiraceae bacterium, SRS011302_30
ss is: GCF_900753015.1 Lachnospiraceae bacterium, ERS396322_76
ss is: GCF_003473295.1 Parabacteroides sp. AM08-6 strain=AM08-6, ASM347329v1
ss is: GCF_003473295.1 Parabacteroides sp. AM08-6 strain=AM08-6, ASM347329v1
Traceback (most recent call last):
  File "/home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/tereiter/github/charcoal/charcoal/contigs_search_taxonomy.py", line 128, in <module>
    returncode = cmdline(sys.argv[1:])
  File "/home/tereiter/github/charcoal/charcoal/contigs_search_taxonomy.py", line 123, in cmdline
    return main(args)
  File "/home/tereiter/github/charcoal/charcoal/contigs_search_taxonomy.py", line 76, in main
    lca_db.insert(ss, ident=ident)
  File "/home/tereiter/github/charcoal/.snakemake/conda/017135dd391ef3c0eb7f938c30cafcd9/lib/python3.9/site-packages/sourmash/lca/lca_db.py", line 142, in insert
    raise ValueError("signature '{}' is already in this LCA db.".format(ident))

And the gather output has this twice as well:

5000,0.0005047955577990914,0.001201923076923077,0.001201923076923077,0.0008695652173913044,/dev/fd/63,"GCF_003473295.1 Parabacteroides sp. AM08-6 strain=AM08-6, ASM347329v1",d4fad0cf,4160000,/home/tereiter/github/2020-ibd/genbank_genomes/GCA_000210075.1_genomic.fna.gz,,c08f8e47,5750000
5000,0.0005047955577990914,0.001201923076923077,0.001201923076923077,0.0008695652173913044,/dev/fd/63,"GCF_003473295.1 Parabacteroides sp. AM08-6 strain=AM08-6, ASM347329v1",d4fad0cf,4160000,/home/tereiter/github/2020-ibd/genbank_genomes/GCA_000210075.1_genomic.fna.gz,,c08f8e47,5750000

and there are more duplicated values in the gather matches file:

3000,0.0003169572107765452,0.0008068854222700376,0.0008068854222700376,0.0005217391304347826,/dev/fd/63,"GCF_902374495.1 Coprobacter fastidiosus, MGYG-HGUT-01391",ec0b7d93,3718000,/home/tereiter/github/2020-ibd/genbank_genomes/GCA_000210075.1_genomic.fna.gz,,c08f8e47,5750000
3000,0.0003169572107765452,0.0008068854222700376,0.0008068854222700376,0.0005217391304347826,/dev/fd/63,"GCF_902374495.1 Coprobacter fastidiosus, MGYG-HGUT-01391",ec0b7d93,3718000,/home/tereiter/github/2020-ibd/genbank_genomes/GCA_000210075.1_genomic.fna.gz,,c08f8e47,5750000

In total, there are 96 duplicated rows (e.g. 48 distinct things each appearing twice).

@taylorreiter
Copy link
Member Author

The sigs are duplicated in the matches.zip file output by prefetch as well.

@taylorreiter
Copy link
Member Author

This is a problem caused by duplicate signatures in sourmash sbt databases (see sourmash-bio/sourmash#1171 (comment)).

Using

sourmash sig describe --csv ~/gtdb-rs202.genomic.k31.sbt.zip.sig_describe.csv gtdb-rs202.genomic.k31.sbt.zip

There are 258,406 signtuares in the sbt of which 250,886 are distinct based on identical rows in sourmash sig describe output.

probably will add filtering of redundant md5 at @ctb's suggestion. @luizirber mentioned, "our MD5 calculation only take ksize and the list of hashes, which means every empty MH has the same MD5 (for a specific k)", so could also filter on exact row matches in prefetch output. TBD.

@taylorreiter
Copy link
Member Author

As a temporary workaround in case anyone else is lurking around latest, this problem does not occur with non-sbt .zip databases

@bluegenes
Copy link
Member

bluegenes commented Jun 1, 2021

As a temporary workaround in case anyone else is lurking around latest, this problem does not occur with non-sbt .zip databases

Does this mean that the non-sbt*zip databases don't have duplicates, or just that the error doesn't occur with those?

@ctb
Copy link
Member

ctb commented Jun 1, 2021 via email

@bluegenes
Copy link
Member

bluegenes commented Jun 1, 2021

non-SBT zip databases may or may not contain duplicates; that depends on
their construction.

right - just trying to confirm I don't need to go chase down a bug from original generation of the gtdb-rs202 databases. I'll assume not unless I hear otherwise.

@ctb
Copy link
Member

ctb commented Jun 1, 2021

non-SBT zip databases may or may not contain duplicates; that depends on
their construction.

right - just trying to confirm I don't need to go chase down a bug from original generation of the gtdb-rs202 databases. I'll assume not unless I hear otherwise.

right! nothing needs to be done by you :)

@ctb ctb closed this as completed in #179 Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants