-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplication of signatures seen in large SBT databases #1171
Comments
note, all that I do note that there is no deduplication of filenames... we could maybe use a set of names or md5sums to track and remove duplicates. |
Good point. Just to make sure that there weren't duplicated filenames in the text file with which I observed the original behavior, I ran |
I added rules to work with the Next: I will short-circuit the |
@luizirber It's fascinating watching you troubleshoot in the snakefile. I am learning lots about the Python API from working through what you've laid out; thanks for the awesome side-effect! I have one quick question (not a big deal at all, please take your time or ignore completely): how does the cli |
I think this sort of "public debugging" fits at least a bit in each of research/teaching/service/scicomm, so seems like a good use of time to write it down =]
you mean the Main issue with |
Tried using the subset of signatures that are duplicated in my SBT in luizirber/2020-08-14-debug-sbt@d7f0147, but that didn't generate duplicated signatures in the new SBT. So, doesn't seem to be an issue with the signatures... |
Zounds! I didn't expect to see you posting on Saturday!
Thank you 😄
Yes, it's a mystery parameter to me, and I was wondering if it is something that I need to fiddle with, but it sounds like I should leave it alone for now. In the meantime, I'll check out your notebook link: thank you! I'm going to just try to follow what you're doing here if you don't mind, cheers! P.S. - Are you a fan of maté 🧉? |
After running off to learn snakemake and take care of some pressing tasks, I tried building an sbt using the Python API:
When I count the number of signatures in the "tree" object before saving the sbt, there aren't any duplicated signatures, but when I use the "tree.save" function to write the sbt to a file, then use |
Oh, great finding! I'm scratching my head to think what is happening on the
Yup! It's just hard to find good Erva Mate around here, but I still have some that I brought from Brazil =] |
Okay, thanks for your help!
Is it possible to order any good erva mate (why the "erva" transliteration instead of "yerba"?) online?
I will script this up by subsetting the input file list in Python using a binary approach (cut number of input signature files in half, test, then up or down by half, etc) and try running tree.save() to give you an estimate about the effect of number of signatures by next week. |
** See next comment below for a reduced unit test with just 46 signatures that give the duplication behavior in the sbt index ** Hi, @luizirber, I found that the number of signatures alone might not be the trigger for the duplication behavior. I've uploaded 1,000 signatures and their paired fasta files into a folder on my Google Drive. Eleven of these signatures are duplicated in the sbt index, two are triplicated, one is replicated 5 times, and one is replicated 6 times. You can download them from here. They are about 2.6 Gb total. Do you mind letting me know if you get duplication when building an sbt index from this set of signatures? Thank you! |
Here is one more observation about this duplication behavior that might be helpful. I scripted up a recursive function to use a binary search-like approach to recursively increase or decrease the size of the signature input list ultimately given to Code for recursive function:
So there is a smallest unit test collection of 46 signatures that give duplication of a single signature, named |
This caused some problems over in charcoal: dib-lab/charcoal#175. Just to record the info here, Using
There are 258,406 signtuares in the sbt of which 250,886 are distinct based on identical rows in sourmash sig describe output. |
thanks! I note that for k=31 the duplicate md5 is ea8e0babc70f61011cbc15b453bb61ce, which is in |
catalog generationI started with I then created a new
and then used
This yielded three
Also, the
so I will ignore the out.zip file. differencesUnfortunately, there are differences between the .zip and the .sbt.zip file: they contain 7520 differences in signature names.
yielded 7520. When I then counted the duplicates like so,
I discovered that there were exactly 7520 counts of duplicated signatures. So what appears to be happening in the SBT code is that signatures are being replaced by duplicates. This also explains why some GTDB identifiers are not being found in the file - see #1511 (comment), where I used the .sbt.zip to construct an LCA database. tl;drThis puts a very different complexion on things - we're not just adding duplicates, we're actually replacing signatures with duplicates. It also starts to narrow down places in the SBT code where it could be happening... |
OK, new theory 😓 . On a quick scan, it looks like all of the duplicates have the same md5sum! So, for example:
So it looks like this is may only be losing information about duplicate md5sum signatures. |
ok, this is interesting. In the indexed SBT zip file, there are many different versions of the same saved sig file;
produces
but in the .sbt.json directory file, these aren't referenced; so the command
produces
So it looks very much like there is some miscommunication between the storage and the SBT itself. |
...aaaaaaand this is now straightforward to reproduce 😓 |
ok I think I figured it out over in #1568 - the JSON filename wasn't being updated with the correct actually-saved filename for duplicate md5 signatures. Will rework that into a proper PR with tests and everything. |
ok, I rebuilt the GTDB SBT index with the fix in #1568, and the catalog of the SBT index now matches that of the input zipfile collection. 🎉 I provisionally declare this bug slain! |
Hi, I believe this is the same problem. Using Is there a way to force the usual Thanks in advance for any help with this!!!! |
hi @moorembioinfo, could you start a new issue, please? This particular problem was (we hope) fixed a while back and I'm guessing you're finding something new! I'd copy your question over to a new one myself, but I thought I'd take the opportunity to ask for more info -
It sounds like when you run
Then there's another problem, which is that you're not getting a comparison of all of your intersected sigs. That should be a different issue from the signature naming problem - the names are only used for display, nothing else. I'm not actually sure how to debug that, so any more info you can give on what you're trying to do would be helpful here! |
Hi @ctb, thanks for getting back to me! I've looked into this and haven't been able to reproduce the As for the naming in the first place, sourmash rename worked perfectly. Perhaps this could be combined with the required output (-o) flag of All the best! Matt |
excellent - very glad to hear it, if the bug crops up again don't worry too hard about replicating it before posting it here, it's always valuable to know where there are wibbly UX problems cropping up! rename stuff punted to issue here, #1801 |
start with #849 (comment) and go from there :)
The text was updated successfully, but these errors were encountered: