Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept: add support for zipfile.Path when loading raw data #11924

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

dmalt
Copy link
Contributor

@dmalt dmalt commented Aug 25, 2023

Following the discussion with @agramfort in #11870, I've looked into options to
store splits in archive while keeping preload=False possible. Apparently, it's indeed possible with zip archives.
Moreover, thanks to zipfile.Path implementing pathlib.Path interface adding support for reading from zip archives is mostly just a matter of not using str for paths and relying on Path instead. With this PR we can now read directly from zip archives:

import pathlib
import zipfile
from zipfile import ZipFile

import mne
from mne.datasets import sample

# create archive with splits
sample_fpath = sample.data_path() / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = mne.io.read_raw_fif(sample_fpath).crop(tmax=60)
raw_fname = "raw_meg.fif"
raw_zip_fname = raw_fname + ".zip"
raw.save(raw_fname, split_naming="bids", split_size="50MB")
with ZipFile("raw_meg.fif.zip", "w") as zipper:
    zipper.write("raw_split-01_meg.fif")
    zipper.write("raw_split-02_meg.fif")
pathlib.Path("raw_split-01_meg.fif").unlink()
pathlib.Path("raw_split-02_meg.fif").unlink()

# read directly from zip archive
with ZipFile(raw_zip_fname) as zipper:
    fname = zipper.namelist()[0]  # pick raw_split-01_meg.fif
    raw = mne.io.read_raw_fif(zipfile.Path(zipper, fname))
    print(raw)
    # <Raw | raw_split-01_meg.fif, 376 x 36038 (60.0 s), ~3.2 MB, data not loaded>

I.e. this PR allows to use zip archives to avoid problems with splits at least in the client code.
@agramfort , @larsoner, @drammock what do you think?

@dmalt dmalt requested a review from sappelhoff as a code owner August 25, 2023 19:07
@dmalt dmalt marked this pull request as draft August 25, 2023 19:09
@dmalt dmalt marked this pull request as ready for review August 26, 2023 06:15
@agramfort
Copy link
Member

interesting ! thx for looking into this @dmalt

can you test IO speed vs non zip ? basically how much slower is it due to compression? how much is this affected by the choice of compression in ZipFile object?

@dmalt
Copy link
Contributor Author

dmalt commented Aug 29, 2023

can you test IO speed vs non zip ? basically how much slower is it due to compression? how much is this affected by the choice of compression in ZipFile object?

Sure! Here are the times for reading 2 splits for small and large data file in different compression variants. I also added times for loading '.fif.gz' for comparison.

dataset preload variant time, ms
small False base 30.08
small False zip_stored 143.47
small True base 272.32
small True zip_stored 414.69
small False zip_deflated 1732.88
small True gz 2022.98
small False gz 2289.92
small True zip_deflated 2738.62
small False zip_lzma 5705.8
small True zip_lzma 8525.89
small False zip_bzip2 11326.7
small True zip_bzip2 16672.9
large False base 35.53
large False zip_stored 821.15
large True base 1883.77
large True zip_stored 2977.24
large False zip_deflated 18177.4
large False gz 20176.1
large True gz 20926.4
large True zip_deflated 28689.8
large False zip_lzma 85950
large False zip_bzip2 107854
large True zip_lzma 126590
large True zip_bzip2 164603

Same data on a plot:
image

I used two mne datasets for this: sample (small) and one file from brainstorm.bst_resting (large).
Small file is ~250 MB, the large one is ~1.6 GB. In both cases I picked split_size so that there are 2 splits after saving.

base: simply loading with read_raw_fif()
gz: also read_raw_fif() but for gunzip-compressed files
zip_stored: read from zip archive with no compression.
zip_<alg>: read from zip archive compressed with <alg>

If you need more detail, I can post the scripts I used.

I also looked into what causes the slowdown when reading from archives even with preload=False and it's the fid.seek() function, which for archives reads the data first, so it's O(n). I kinda expected it to be faster for zips without compression (zip_stored) but it is what it is.

@agramfort
Copy link
Member

agramfort commented Aug 29, 2023 via email

@dmalt
Copy link
Contributor Author

dmalt commented Aug 29, 2023

hum so we use zip_stored then the difference is not huge assuming we use preload=True. With preload=False I think it requires to uncompress the entire file to read the header.

Message ID: @.***>

It's not, but it still grows with the file size. In any case, 800 ms for loading 1.6 with preload=False is not too bad I think

@dmalt
Copy link
Contributor Author

dmalt commented Aug 29, 2023

With preload=False I think it requires to uncompress the entire file to read the header.

Yes, but it's only in part zip's fault. The slowdown comes more or less from a single line in read_tag_info():
fid.seek(tag.size, 1)

The problem is tag.size on average is larger when reading larger file (I measured it). When we just read a file it's not a big deal because seek() is O(1) but for zip (or gunzip) it's O(n). I was a bit surprised when I saw it because I thought that .fiff header is just a fixed number of bytes, no matter the file size.

@dmalt
Copy link
Contributor Author

dmalt commented Aug 29, 2023

I was a bit surprised when I saw it because I thought that .fiff header is just a fixed number of bytes, no matter the file size.

I'm also wondering, is there a way to load a header without increasing the number of bytes we have to look through? Or is it a quirk of .fif format and there's no way around it.

@agramfort
Copy link
Member

agramfort commented Aug 29, 2023 via email

@dmalt
Copy link
Contributor Author

dmalt commented Aug 29, 2023

makes sense where do we go from here?

For now, since adding support for zipfile.Path is almost feee, I can add it to writing raw files as well as reading and writing epochs + cover it with tests.

Then we can optionally add handling of .zip extension support inside mne-python reading and writing functions for raw and epochs. Alternatively, we can add a recipe to the documentation for handling splits with zip archives with a code like

with ZipFile(raw_zip_fname) as zipper:
    raw = mne.io.read_raw_fif(zipfile.Path(zipper, fname))

If you ask me, I'd rather go for the second option.
We also need to think, how to pick the first split inside the archive. Sorting will only work for 'bids' naming. I guess, this can also be client's code responsibility.

As far as the header reading performance goes, I'm not sure. I don't understand the '.fiff' file structure enough to see how this can be fixed. But it feels like it should be fixable.

@agramfort
Copy link
Member

agramfort commented Aug 29, 2023 via email

@dmalt
Copy link
Contributor Author

dmalt commented Aug 29, 2023

don't go the fiff understanding route. It can fast become a rabbit hole.

Yeah, I've got this impression.

If you think you can do it with documentation then I think it's the safest option.

Ok then, let's do it like that then.

@dmalt dmalt marked this pull request as draft August 31, 2023 12:25
@dmalt dmalt force-pushed the zipfile branch 5 times, most recently from ccfdd5b to a5767bf Compare September 3, 2023 14:22
@dmalt dmalt mentioned this pull request Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants