Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading single chunk takes 10x longer than remfile #74

Open
rly opened this issue May 27, 2024 · 5 comments
Open

Reading single chunk takes 10x longer than remfile #74

rly opened this issue May 27, 2024 · 5 comments
Labels
category: bug errors in the code or code behavior

Comments

@rly
Copy link
Contributor

rly commented May 27, 2024

Using remfile as below:

import remfile
import h5py
import pynwb
import timeit

# URL to HDF5 NWB file
s3_url = "https://dandiarchive.s3.amazonaws.com/blobs/fec/8a6/fec8a690-2ece-4437-8877-8a002ff8bd8a"
byte_stream = remfile.File(url=s3_url)
file = h5py.File(name=byte_stream)
io = pynwb.NWBHDF5IO(file=file)
nwbfile = io.read()
data_to_slice = nwbfile.acquisition["ElectricalSeriesAp"].data

start = timeit.default_timer()
data_to_slice[0:10,0:384]
end = timeit.default_timer()
print(end - start)

Takes 0.2 seconds on my laptop.

Using lindi as below:

import lindi
import pynwb
import timeit

# URL to LINDI JSON of NWB file
s3_url = "https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/914/6aa/9146aa46-9c01-45be-9d2a-693e6a7bb778"
client = lindi.LindiH5pyFile.from_lindi_file(url_or_path=s3_url)
io = pynwb.NWBHDF5IO(file=client)
nwbfile = io.read()
data_to_slice = nwbfile.acquisition["ElectricalSeriesAp"].data

start = timeit.default_timer()
data_to_slice[0:10,0:384]
end = timeit.default_timer()
print(end - start)

Takes 2.4 seconds on my laptop.

The data chunk size is (13653, 384) with no compression. Nothing stands out in the LINDI JSON. I'm not sure if I am doing something wrong or if there is an efficiency somewhere in the system.

I'll start looking into it. @magland, do you have any ideas about what might be going on?

@rly rly added the category: bug errors in the code or code behavior label May 27, 2024
@magland
Copy link
Collaborator

magland commented May 27, 2024

@rly

I think what's going on here...

h5py can read partial chunks - and in this case there is no compression so this is possible

whereas lindi/zarr is set up to always read entire chunks

According to the lindi.json file, the chunk size is [13653, 384]

Maybe this is a zarr limitation/constraint/feature?

@rly
Copy link
Contributor Author

rly commented May 27, 2024

Ah, that makes sense. After changing the slice size to equal the chunk size, lindi is now only ~2x the speed of remfile. In inspecting the execution, it looks like zarr makes the request for key acquisition/ElectricalSeriesAp/data/0.0 twice. I'm trying to figure out why.

But also in digging through the Zarr code, I found that Zarr might be able to support partial reads:
https://github.com/zarr-developers/zarr-python/blob/b1f4c509abaee1cb8dec18e3a973e1199226011a/src/zarr/v2/core.py#L2054-L2095

Right now, execution is going through the else because "get_partial_values" is not an attribute of LindiReferenceFileSystemStore.

@magland
Copy link
Collaborator

magland commented May 27, 2024

Ah. It will be good to figure out whether the duplicate request can be avoided... and/or whether we should implement some caching for this type of situation.

Do you think we should set the get_partial_values attribute somehow?

@rly
Copy link
Contributor Author

rly commented May 28, 2024

Do you think we should set the get_partial_values attribute somehow?

Yeah, I think that would be nice, but not urgent. For most large reads, I think it would not make a big difference because the read will be mostly full chunks and some part of a chunk on each axis. And most big datasets are compressed.

If you have time, it would be great if you can take a look but no pressure. Otherwise, I'll try to take a look at it next week.

@magland
Copy link
Collaborator

magland commented May 28, 2024

Makes sense. I'm not going to work on it right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior
Projects
None yet
Development

No branches or pull requests

2 participants