Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZipFile collection problems and experiences #1483

Closed
ctb opened this issue Apr 24, 2021 · 5 comments
Closed

ZipFile collection problems and experiences #1483

ctb opened this issue Apr 24, 2021 · 5 comments

Comments

@ctb
Copy link
Contributor

ctb commented Apr 24, 2021

(leaving this here so I don't forget - some of these are easily fixable.)

So, I did a few trials this morning -

and ran into various problems, many of which are fixed in 808ae37.

First, ZipFileLinearIndex didn't define __bool__ but did define __len__, which loads all the signatures, so when doing truth testing in _load_database on the database object, it hung. Solution: provide __bool__ as well! ref #271

Second, _load_database tried to load the 54 GB ZipFile as a JSON file, which failed with out-of-memory. That was fixed by looking at the first byte of the file contents and seeing if it looked like JSON, and failing if it didn't.

Third, and still a problem, there's a noticeable pause while _load_sbt in _load_database tries to load the .zip file as an SBT. Not sure what's going on here.

That is all. Other than the noticeable pause when it attempts to load an SBT, the changes permit a pleasurable "let's load signatures from this 54 GB file, kthxbye" experience...

@ctb
Copy link
Contributor Author

ctb commented Apr 24, 2021

I ran this on the farm head node -- gtdb-r202.genomic.zip is a 54GB file.

% time sourmash prefetch signatures/12799206795148077d359094038e6f4b.sig gtdb-r202.genomic.zip  -o xyz.csv > out

== This is sourmash version 4.0.1.dev31+g7852fa10.d20210424. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=21 automatically.
loaded query: GCF_014075335.1 Escherichia co... (k=21, DNA)
all sketches will be downsampled to scaled=1000
loading signatures from 'gtdb-r202.genomic.zip'
total of 51136 matching signatures.
saved 51136 matches to CSV file 'xyz.csv'
of 4963 distinct query hashes, 4963 were found in matches above threshold.
a total of 0 query hashes remain unmatched.

real    38m58.108s
user    38m7.528s
sys     0m48.942s

@bluegenes
Copy link
Contributor

bluegenes commented Apr 26, 2021

ZipFile generation issue -- to build that Zipfile, I used:

#! /usr/bin/env python
# modified from: https://github.com/dib-lab/sourmash/pull/1349#issuecomment-787482809
import sys
import zipfile
import sourmash
import argparse

def main():
    p = argparse.ArgumentParser()
    p.add_argument('zipfile')
    p.add_argument('signatures', nargs='*')
    p.add_argument('--sig-pathlist')
    args = p.parse_args()

    zf = zipfile.ZipFile(args.zipfile, 'w')

    siglist = [x.rstrip() for x in open(args.sig_pathlist)]
    all_sigs = siglist + args.signatures

    n = 0
    for i, filename in enumerate(all_sigs):
        if n % 10000 == 0:
            print(f"... processing {n}th signature; currently reading signatures from '{filename}'")
        for sig in sourmash.load_file_as_signatures(filename):
            md5 = 'signatures/' + sig.md5sum() + '.sig'
            sigstr = sourmash.save_signatures([sig])
            zf.writestr(md5, sigstr)
            n += 1

    print(f"wrote {n} signatures to '{args.zipfile}'")

    return 0

if __name__ == '__main__':
    sys.exit(main())

(file currently here: https://github.com/dib-lab/sourmash_databases/blob/auto-gtdb/sigs-to-zipfile.py)

Using this code with all~300k genomes in GTDB, I get 28,376 of the following warnings in my logfile:

/home/ntpierce/miniconda3/350094955eecdc310e92f10a70282193/lib/python3.8/zipfile.py:1517: UserWarning: Duplicate name: 'signatures/1e02aab2995c30feb37536ce94fe8cc8.sig'
  return self._open_to_write(zinfo, force_zip64=force_zip64)
/home/ntpierce/miniconda3/350094955eecdc310e92f10a70282193/lib/python3.8/zipfile.py:1517: UserWarning: Duplicate name: 'signatures/3eef3257a786738a43ca974d6ef4bddc.sig'
  return self._open_to_write(zinfo, force_zip64=force_zip64)
/home/ntpierce/miniconda3/350094955eecdc310e92f10a70282193/lib/python3.8/zipfile.py:1517: UserWarning: Duplicate name: 'signatures/10dd706f3e5ab96fc91c7caaaef31bf8.sig'
  return self._open_to_write(zinfo, force_zip64=force_zip64)

By using md5sum as the name, are we losing ~ 28k genomes? I remember @luizirber implemented a fix for duplicate md5sums in SBTs (#994) -- I assume the same issues is happening here, though it's just getting ignored.

I think #994 added a _{number} to any duplicated md5sums. I'll try implementing this fix to the above code, but want to document here, as it'll be something to consider when we implement zipfile generation.

from #994:

while newpath is None:
    testpath = "{}_{}".format(fullpath, n)
    if os.path.exists(testpath):
         n += 1
    else:
        # testpath is available, use it as newpath
        newpath = "{}_{}".format(path, n)

@ctb
Copy link
Contributor Author

ctb commented Apr 27, 2021

yes, we're losing signatures w/duplicate md5sums...

Separately, note that my zipfile creation code resulted in uncompressed zipfiles, and you'll probably want to change your code to use something like ZIP_DEFLATED:

            self.zf = zipfile.ZipFile(self.filename, 'w',
                                      zipfile.ZIP_DEFLATED,
                                      compresslevel=9)

@bluegenes
Copy link
Contributor

bluegenes commented Apr 27, 2021

updated code to handle duplicated md5sums & compress sigs with save_signatures

#! /usr/bin/env python
# modified from: https://github.com/dib-lab/sourmash/pull/1349#issuecomment-787482809
import sys
import zipfile
import sourmash
import argparse

def main():
    p = argparse.ArgumentParser()
    p.add_argument('zipfile')
    p.add_argument('signatures', nargs='*')
    p.add_argument('--sig-pathlist')
    p.add_argument('--compression', type=int, default=9)
    args = p.parse_args()

    zf = zipfile.ZipFile(args.zipfile, 'w')

    siglist = [x.rstrip() for x in open(args.sig_pathlist)]
    all_sigs = siglist + args.signatures

    n = 0
    all_md5=set()
    for i, filename in enumerate(all_sigs):
        if n % 10000 == 0:
            print(f"... processing {n}th signature; currently reading signatures from '{filename}'")

        for sig in sourmash.load_file_as_signatures(filename):
            # zip needs a unique name for each signature. Use sig md5sum.
            md5= sig.md5sum()
            # if this is a duplicate md5sum, add _{number} to make it unique.
            if md5 in all_md5:
                sys.stderr.write(f"{str(sig)} has an md5sum identical to one already in the zipfile ({md5})")
                i=0
                full_md5 = f"{md5}_{i}"
                while full_md5 in all_md5:
                    i+= 1
                    full_md5 = f"{md5}_{i}"
                md5=full_md5
                sys.stderr.write(f"...adding unique md5 {md5} instead")

            all_md5.add(md5)
            md5_name = 'signatures/' + md5 + '.sig'
            sigstr = sourmash.save_signatures([sig], compression=args.compression)
            zf.writestr(md5_name, sigstr)
            n += 1

    print(f"wrote {n} signatures to '{args.zipfile}'")

    return 0


if __name__ == '__main__':
    sys.exit(main())

updated based on slack conversation below:

titus:speech_balloon: 9:34 AM
So, ok, to be clear: what you’re saying is that we should store .sig.gz files in uncompressed .zip files, right? Not uncompressed .sig files in compressed .zip files?

luizirber 9:35 AM
yup (with the caveat that the sig loading code doesn't care about extension, so you can name your compressed sig file as .sig and not .sig.gz if you want)

luizirber 9:37 AM
save_signatures has a compression argument:
https://github.com/dib-lab/sourmash/blob/e2dbe24fa6afd79ea82a79e2e1bc2fb6112a9119/src/sourmash/signature.py#L327

bluegenes:feet: 1:18 PM
so to be even more explicit … in save_signatures, use compression=1 for gzip compression?

luizirber 1:27 PM
1 is fastest/not much compression, 9 is slow/very compressed. SBTs default to compression=1:
https://github.com/dib-lab/sourmash/blob/e2dbe24fa6afd79ea82a79e2e1bc2fb6112a9119/src/sourmash/sbtmh.py#L48

luizirber 1:27 PM
(In fact 9 is not much smaller, but takes WAY longer to calculate)

luizirber 1:30 PM
#648 (comment)

I was running with 9, and it is a pretty big performance impact: almost 7x slower than no compression 0, and 5x slower than 1. All for a meager 6MB difference (from 252 MB in 1 to 246 MB in 9 )

@ctb
Copy link
Contributor Author

ctb commented May 8, 2021

sourmash_args.SaveSignaturesToLocation(...) now provides simple ZipFile output.

#1495 fixed the LCA database issue.

The only remaining issue is this one:

Third, and still a problem, there's a noticeable pause while _load_sbt in _load_database tries to load the .zip file as an SBT. Not sure what's going on here.

I will punt to new issue - #1506

@ctb ctb closed this as completed May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants