Proof of concept: add support for `zipfile.Path` when loading raw data #11924

dmalt · 2023-08-25T19:07:00Z

Following the discussion with @agramfort in #11870, I've looked into options to
store splits in archive while keeping preload=False possible. Apparently, it's indeed possible with zip archives.
Moreover, thanks to zipfile.Path implementing pathlib.Path interface adding support for reading from zip archives is mostly just a matter of not using str for paths and relying on Path instead. With this PR we can now read directly from zip archives:

import pathlib
import zipfile
from zipfile import ZipFile

import mne
from mne.datasets import sample

# create archive with splits
sample_fpath = sample.data_path() / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = mne.io.read_raw_fif(sample_fpath).crop(tmax=60)
raw_fname = "raw_meg.fif"
raw_zip_fname = raw_fname + ".zip"
raw.save(raw_fname, split_naming="bids", split_size="50MB")
with ZipFile("raw_meg.fif.zip", "w") as zipper:
    zipper.write("raw_split-01_meg.fif")
    zipper.write("raw_split-02_meg.fif")
pathlib.Path("raw_split-01_meg.fif").unlink()
pathlib.Path("raw_split-02_meg.fif").unlink()

# read directly from zip archive
with ZipFile(raw_zip_fname) as zipper:
    fname = zipper.namelist()[0]  # pick raw_split-01_meg.fif
    raw = mne.io.read_raw_fif(zipfile.Path(zipper, fname))
    print(raw)
    # <Raw | raw_split-01_meg.fif, 376 x 36038 (60.0 s), ~3.2 MB, data not loaded>

I.e. this PR allows to use zip archives to avoid problems with splits at least in the client code.
@agramfort , @larsoner, @drammock what do you think?

mne/_fiff/open.py

agramfort · 2023-08-27T09:11:07Z

interesting ! thx for looking into this @dmalt

can you test IO speed vs non zip ? basically how much slower is it due to compression? how much is this affected by the choice of compression in ZipFile object?

dmalt · 2023-08-29T09:36:39Z

can you test IO speed vs non zip ? basically how much slower is it due to compression? how much is this affected by the choice of compression in ZipFile object?

Sure! Here are the times for reading 2 splits for small and large data file in different compression variants. I also added times for loading '.fif.gz' for comparison.

dataset	preload	variant	time, ms
small	False	base	30.08
small	False	zip_stored	143.47
small	True	base	272.32
small	True	zip_stored	414.69
small	False	zip_deflated	1732.88
small	True	gz	2022.98
small	False	gz	2289.92
small	True	zip_deflated	2738.62
small	False	zip_lzma	5705.8
small	True	zip_lzma	8525.89
small	False	zip_bzip2	11326.7
small	True	zip_bzip2	16672.9
large	False	base	35.53
large	False	zip_stored	821.15
large	True	base	1883.77
large	True	zip_stored	2977.24
large	False	zip_deflated	18177.4
large	False	gz	20176.1
large	True	gz	20926.4
large	True	zip_deflated	28689.8
large	False	zip_lzma	85950
large	False	zip_bzip2	107854
large	True	zip_lzma	126590
large	True	zip_bzip2	164603

Same data on a plot:

I used two mne datasets for this: sample (small) and one file from brainstorm.bst_resting (large).
Small file is ~250 MB, the large one is ~1.6 GB. In both cases I picked split_size so that there are 2 splits after saving.

base: simply loading with read_raw_fif()
gz: also read_raw_fif() but for gunzip-compressed files
zip_stored: read from zip archive with no compression.
zip_<alg>: read from zip archive compressed with <alg>

If you need more detail, I can post the scripts I used.

I also looked into what causes the slowdown when reading from archives even with preload=False and it's the fid.seek() function, which for archives reads the data first, so it's O(n). I kinda expected it to be faster for zips without compression (zip_stored) but it is what it is.

agramfort · 2023-08-29T09:57:56Z

hum so we use zip_stored then the difference is not huge assuming we use preload=True. With preload=False I think it requires to uncompress the entire file to read the header.

…

Message ID: ***@***.***>

dmalt · 2023-08-29T10:00:47Z

hum so we use zip_stored then the difference is not huge assuming we use preload=True. With preload=False I think it requires to uncompress the entire file to read the header.
…
Message ID: @.***>

It's not, but it still grows with the file size. In any case, 800 ms for loading 1.6 with preload=False is not too bad I think

dmalt · 2023-08-29T10:22:22Z

With preload=False I think it requires to uncompress the entire file to read the header.

Yes, but it's only in part zip's fault. The slowdown comes more or less from a single line in read_tag_info():
fid.seek(tag.size, 1)

The problem is tag.size on average is larger when reading larger file (I measured it). When we just read a file it's not a big deal because seek() is O(1) but for zip (or gunzip) it's O(n). I was a bit surprised when I saw it because I thought that .fiff header is just a fixed number of bytes, no matter the file size.

dmalt · 2023-08-29T10:29:58Z

I was a bit surprised when I saw it because I thought that .fiff header is just a fixed number of bytes, no matter the file size.

I'm also wondering, is there a way to load a header without increasing the number of bytes we have to look through? Or is it a quirk of .fif format and there's no way around it.

agramfort · 2023-08-29T10:56:04Z

makes sense where do we go from here?

…

Message ID: ***@***.***>

dmalt · 2023-08-29T11:20:26Z

makes sense where do we go from here?

For now, since adding support for zipfile.Path is almost feee, I can add it to writing raw files as well as reading and writing epochs + cover it with tests.

Then we can optionally add handling of .zip extension support inside mne-python reading and writing functions for raw and epochs. Alternatively, we can add a recipe to the documentation for handling splits with zip archives with a code like

with ZipFile(raw_zip_fname) as zipper:
    raw = mne.io.read_raw_fif(zipfile.Path(zipper, fname))

If you ask me, I'd rather go for the second option.
We also need to think, how to pick the first split inside the archive. Sorting will only work for 'bids' naming. I guess, this can also be client's code responsibility.

As far as the header reading performance goes, I'm not sure. I don't understand the '.fiff' file structure enough to see how this can be fixed. But it feels like it should be fixable.

agramfort · 2023-08-29T11:31:37Z

don't go the fiff understanding route. It can fast become a rabbit hole. If you think you can do it with documentation then I think it's the safest option.

…

Message ID: ***@***.***>

dmalt · 2023-08-29T11:40:14Z

don't go the fiff understanding route. It can fast become a rabbit hole.

Yeah, I've got this impression.

If you think you can do it with documentation then I think it's the safest option.

Ok then, let's do it like that then.

dmalt requested a review from sappelhoff as a code owner August 25, 2023 19:07

dmalt marked this pull request as draft August 25, 2023 19:09

dmalt force-pushed the zipfile branch from 0db0c5f to 1b886f4 Compare August 25, 2023 22:34

dmalt marked this pull request as ready for review August 26, 2023 06:15

hoechenberger reviewed Aug 26, 2023

View reviewed changes

mne/_fiff/open.py Show resolved Hide resolved

dmalt force-pushed the zipfile branch from 1b886f4 to 5df64ae Compare August 31, 2023 12:24

dmalt marked this pull request as draft August 31, 2023 12:25

dmalt force-pushed the zipfile branch 5 times, most recently from ccfdd5b to a5767bf Compare September 3, 2023 14:22

dmalt mentioned this pull request Sep 4, 2023

Refactor writing raw file #11953

Merged

dmalt force-pushed the zipfile branch from 38e8d39 to c03a2ee Compare September 4, 2023 18:19

dmalt added 6 commits October 12, 2023 14:35

add support for zipfile.Path

9ae8400

fix failing tests

095d2fb

impllement writing to zip for raw; test zip io

9f00964

fix zipfile.Path backward compatibility

09e4b84

fix "requires_testing_data"

5b7bf52

add splitting to zifile test

9245ea5

test number of splits with zip

f6ee464

dmalt force-pushed the zipfile branch from 8b65d83 to f6ee464 Compare October 12, 2023 15:41

add requires_testing_data decorator to test

ca22c80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of concept: add support for `zipfile.Path` when loading raw data #11924

Proof of concept: add support for `zipfile.Path` when loading raw data #11924

dmalt commented Aug 25, 2023 •

edited

Loading

agramfort commented Aug 27, 2023

dmalt commented Aug 29, 2023 •

edited

Loading

agramfort commented Aug 29, 2023 via email

dmalt commented Aug 29, 2023

dmalt commented Aug 29, 2023

dmalt commented Aug 29, 2023

agramfort commented Aug 29, 2023 via email

dmalt commented Aug 29, 2023

agramfort commented Aug 29, 2023 via email

dmalt commented Aug 29, 2023

Proof of concept: add support for zipfile.Path when loading raw data #11924

Are you sure you want to change the base?

Proof of concept: add support for zipfile.Path when loading raw data #11924

Conversation

dmalt commented Aug 25, 2023 • edited Loading

agramfort commented Aug 27, 2023

dmalt commented Aug 29, 2023 • edited Loading

agramfort commented Aug 29, 2023 via email

dmalt commented Aug 29, 2023

dmalt commented Aug 29, 2023

dmalt commented Aug 29, 2023

agramfort commented Aug 29, 2023 via email

dmalt commented Aug 29, 2023

agramfort commented Aug 29, 2023 via email

dmalt commented Aug 29, 2023

Proof of concept: add support for `zipfile.Path` when loading raw data #11924

Proof of concept: add support for `zipfile.Path` when loading raw data #11924

dmalt commented Aug 25, 2023 •

edited

Loading

dmalt commented Aug 29, 2023 •

edited

Loading