-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZipFile collection problems and experiences #1483
Comments
I ran this on the farm head node --
|
ZipFile generation issue -- to build that Zipfile, I used:
(file currently here: https://github.com/dib-lab/sourmash_databases/blob/auto-gtdb/sigs-to-zipfile.py) Using this code with all~300k genomes in GTDB, I get 28,376 of the following warnings in my logfile:
By using md5sum as the name, are we losing ~ 28k genomes? I remember @luizirber implemented a fix for duplicate md5sums in SBTs (#994) -- I assume the same issues is happening here, though it's just getting ignored. I think #994 added a from #994:
|
yes, we're losing signatures w/duplicate md5sums... Separately, note that my zipfile creation code resulted in uncompressed zipfiles, and you'll probably want to change your code to use something like
|
updated code to handle duplicated md5sums & compress sigs with
updated based on slack conversation below: titus:speech_balloon: 9:34 AM luizirber 9:35 AM luizirber 9:37 AM bluegenes:feet: 1:18 PM luizirber 1:27 PM luizirber 1:27 PM luizirber 1:30 PM
|
#1495 fixed the LCA database issue. The only remaining issue is this one:
I will punt to new issue - #1506 |
(leaving this here so I don't forget - some of these are easily fixable.)
So, I did a few trials this morning -
gather
functionality for speed & modularity; provideprefetch
functionality. #1370 with genome-grist (see [MRG] usesourmash prefetch
from sourmash v4.1.0 dib-lab/genome-grist#68)/group/ctbrowngrp/gtdb/databases/gtdb-r202.genomic.zip
on farm)and ran into various problems, many of which are fixed in 808ae37.
First,
ZipFileLinearIndex
didn't define__bool__
but did define__len__
, which loads all the signatures, so when doing truth testing in_load_database
on the database object, it hung. Solution: provide__bool__
as well! ref #271Second,
_load_database
tried to load the 54 GB ZipFile as a JSON file, which failed with out-of-memory. That was fixed by looking at the first byte of the file contents and seeing if it looked like JSON, and failing if it didn't.Third, and still a problem, there's a noticeable pause while
_load_sbt
in_load_database
tries to load the .zip file as an SBT. Not sure what's going on here.That is all. Other than the noticeable pause when it attempts to load an SBT, the changes permit a pleasurable "let's load signatures from this 54 GB file, kthxbye" experience...
The text was updated successfully, but these errors were encountered: