-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ElectricalSeries: important chunking considerations #52
Comments
I don't think it's just chunking that's at play here then. Chunking is a necessary step for enabling compression, which is likely what is actually causing the large slowdown compared to the uncompressed (the un-chunked) data To confirm we would need to find or upload some example files that are chunked, but not compressed. I doubt any of those exist on the archive so I'll just make one and upload to staging Note: it is, however, best practice to always use compression when storing data on the archive, and the NWB Inspector should warn people about it (though DANDI validation will not outright prevent upload or publishing). Otherwise the archive would be much larger than it is today |
BTW out of curiosity, is there any parallelization going on right now w.r.t. data access calls? While decompression isn't super heavy computationally it might be beneficial to have multiple chunks requested and decompressed simultaneously, but not familiar enough with how Type/JavaScript interfaces with things like shared memory |
@CodyCBakerPhD to confirm that the first is using compression and the second is not (I added a couple of fields in the Python output): First (slow):
Second (fast):
Do you think there's a way (in the Python code) to test how much time compression is taking, vs network requests? I suppose one way is to download the file and try it locally -- but the first file is very large so that's inconvenient. For reference, this is the Python script I am using import time
import numpy as np
import fsspec
import h5py
from fsspec.implementations.cached import CachingFileSystem
# nwb_object_id = 'c86cdfba-e1af-45a7-8dfd-d243adc20ced'
nwb_object_id = '0828f847-62e9-443f-8241-3960985ddab3'
s3_url = f'https://dandiarchive.s3.amazonaws.com/blobs/{nwb_object_id[:3]}/{nwb_object_id[3:6]}/{nwb_object_id}'
print(s3_url)
fs = CachingFileSystem(
fs=fsspec.filesystem("http"),
# cache_storage="nwb-cache", # Local folder for the cache
)
print('Opening file for lazy reading...')
f = fs.open(s3_url, "rb")
file = h5py.File(f)
print('Getting data object')
x = file['acquisition']['ElectricalSeries']['data']
print(f'Shape of data: {x.shape}')
print(f'Chunking:', x.chunks)
print(f'Compression:', x.compression)
print(f'Compression opts:', x.compression_opts)
print('Getting first 10 rows of data')
timer = time.time()
data_chunk = x[:10, :]
print(f'Shape of data chunk: {data_chunk.shape}')
elapsed_sec = time.time() - timer
print(f'Elapsed time for reading data chunk: {elapsed_sec} sec') Regarding parallelization in the browser, all the hdf5 access takes place in a worker thread... and that worker thread makes synchronous http requests and gunzip (and I believe that's a requirement for using h5wasm), so there is no concurrency there, within a single worker. However, I am using 2 worker threads, so that allow some concurrency at a higher level. Is it the network latency or the decompression? - I'm just not sure. Tagging @garrettmflynn |
Rigorous performance testing is a very nuanced topic; this is something we want to really sink our teeth into pending a grant we've applied for related to cloud computing, but since we haven't started that yet I'll just keep things as general as I can to avoid making incorrect claims about specificities In general we follow the recommendations of the HDF5 Group on optimizing I/O, but I'll admit that white paper was put together quite a while ago and was likely not tuned specifically for our S3 situation
The most sophisticated way would be to do a full profile across the underlying The easiest way would be what I'm preparing for you now, which is to use Python timing similar to how you're currently doing it (with one difference elaborated below) but applied to several copies of the same data. Across these copies you have the same exact data written (i) chunked and compressed and you then test and compare timings of the same exact access patterns across each separate copy The access pattern itself will also vary a lot between the 3 cases; the read pattern of So I'd also recommend testing against a number of read patterns that both subset the chunk shape, are equal to the chunk shape, and span multiple chunks to get fairer comparisons Technically the type and/or options of compression can also make a big difference, which was the topic of a summer project by the LBNL team last summer, led by @rly. I'll set that aside for now since it gets even more complicated than this simpler discussion
Thanks for sharing the code - I'm going to point you towards this section of a blog about using The short version is that timing things according to the
There's also potential I/O bottleneck if there's any caching to disk involved (which I see your example code is using a CachingFileSystem with fsspec) - that could especially be experienced if more CPUs were thrown at the requests (and the bandwidth was confirmed to not be the current bottleneck) |
From @bendichter
Sure I can make a copy of each (i-iii) that have paging set (assuming I can get it to pass via |
Thanks for doing this @CodyCBakerPhD . I think it will be important to put the datasets in a cloud bucket to test the remote access. Also, they should be large enough to prevent artificially high level of cache hits when downloading chunks of the file (sorry if that is not expressed clearly) |
Yeah I'm copying the entire ElectricalSeries data from https://dandiarchive.s3.amazonaws.com/blobs/c86/cdf/c86cdfba-e1af-45a7-8dfd-d243adc20ced as a simpler |
@magland The new 2 examples (chunking and uncompressed, no chunking and no compression) should be up by morning at https://gui-staging.dandiarchive.org/dandiset/200560/draft/files?location=, a folder called 'for_jeremy' Two Where applicable, all chunking and compression parameters are equivalent to the source data |
Thanks @CodyCBakerPhD ! Here are some timing results. But a couple notes first
import time
import numpy as np
import fsspec
import h5py
from fsspec.implementations.cached import CachingFileSystem
## see https://gui-staging.dandiarchive.org/dandiset/200560/draft/files?location=for_jeremy
s3_urls = {
'chunked_but_not_compressed': 'https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/738/069/73806960-88ff-4d5e-9920-40d43186cf26',
'not_chunked_and_not_compressed': 'https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/0c1/51d/0c151db4-27db-4c2f-8290-0626a66526c3'
}
fs = CachingFileSystem(
fs=fsspec.filesystem("http")
)
timeseries_objects = {}
for k, s3_url in s3_urls.items():
print('====================================')
print(k)
print('Opening file for lazy reading...')
f = fs.open(s3_url, "rb")
file = h5py.File(f)
print('Getting data object')
x = file['acquisition']['TimeSeries']['data']
print(f'Shape of data: {x.shape}')
print(f'Chunking:', x.chunks)
print(f'Compression:', x.compression)
print(f'Compression opts:', x.compression_opts)
print('Getting first 10 rows of data')
timeseries_objects[k] = x
for k, timeseries_object in timeseries_objects.items():
print('====================================')
print(f'Timing reading data for {k}')
x = timeseries_object
# timeit doesn't work here because the result is cached after the first run
timer = time.time()
data_chunk = x[:10, :]
print(f'Shape of data chunk: {data_chunk.shape}')
elapsed_sec = time.time() - timer
print(f'Elapsed time for reading data chunk: {elapsed_sec} sec')
So there is a significant difference between the two schemes, even though both were not compressed. Looking at my network monitor, it seems the difference is the amount of data transferred - and this will depend on the relationship between the chunking scheme of This difference is consistent with what I am seeing in neurosift in the browser chunked_but_not_compressed not_chunked_and_not_compressed <------ I think it actually is chunked |
@magland The paginated versions of each combination are up (ignore the 'unchunked' ones of course, regenerating those now). Also includes a compressed + paginated version with same chunking as original data
Looks like I used the wrong approach to disable chunking - regenerating now
Yeah, as I mentioned above, would need to remember to set unique cache locations each time to avoid subsequent runs pulling from cache instead of retesting the stream time. Or not enable caching at all since we're really trying to test the bandwidth speed vs. other factors I still strongly recommend the
Yes, with respect to the read access pattern (10, 96). I'd recommend trying a wider range of access patterns representing real life situations, such as zooming in on several localized waveforms (smaller range of channels, maybe ~50 frame slice or so), monitoring global variations over longer time scales ( Despite my mistake, this is actually useful. What happened was the write procedure defaulted to using the So the difference in speed reading uncompressed data is in line with my hypothesis that data access requests for any sub-slice of a chunk streams the entire chunk worth of data. Not saying this is absolute confirmation but it could certainly explain the current observation |
Thanks! I agree that we'll want to look at the various read access patterns. For sanity (this is getting pretty complicated), I suggest we focus first on one single measure: how fast is the initial load of the traces in neurosift? For that question there is a very clear winner: unchunked_and_uncompressed_* (which actually is the default chunking from h5py) beats all the others by a long shot. In fact, right now, neurosift is set to only load the first 15 channels... and the difference is: waiting for around 15 seconds, and 10MB data download on the one hand, and waiting for around 1 second and less than 1MB download on the other. If I adjust the app to try to load ALL the channels, the one goes for more than 100 seconds before I closed the tab, whereas the other is just a few seconds. So, before getting into other use cases, there is an overwhelming winner here from the perspective of initial load of a small amount of data (which is highest priority for me). The question remains whether it would still be performant if we added compression to the default h5py chunking. Lazy loading via Python is a different consideration, which is important, but I'm hoping to focus first on the neurosift performance. Also, the different pagination schemes don't seem to make an obvious difference - at least when loading into neurosift. |
That's what I was trying to tell you lol 😆 this is a very deep rabbit hole and premature optimization can be detrimental to rapid dev momentum Just constraining to rough neurosift app speeds for now is fine by me I have now reuploaded (replaced) the filenames with properly unchunked versions, though I had to constrain it to the first ~100 seconds or so because any attempt to buffer the data write triggered chunking to be enabled - curious what you notice on the speeds when data is completely unchunked (though I notice I can't seem to control the second dimension of the TimeSeries view) For the final round of fair use case tests I'll generate some with the default h5py chunking as well as modern NeuroConv chunking (differs from the old source data), both compressed and uncompressed (and name the files accordingly). Probably skip pagination for now on those
The last thing I'll mention here is, if we accept my notion of the correlation between streaming speed and chunk size, then we are left with the following argument a) published data assets have their dataset options, including chunking, frozen and cannot be changed to optimize streaming speeds Towards the aim of (d) given the constraints of (a-c) I'm driven to the following conclusion: neurosift should attempt to work as best as it can around the chunk shape instead of being tuned to specific target ranges, even if those targets are more 'ideal' in general What I mean by this is, if streaming requests transfer an entire chunk worth of data even if you didn't intend to use all the data in that chunk, the way to improve performance is for default rendering to make the most of what's been returned. That is, instead of always having a default view that targets say (5 seconds, 15 channels) - which would only work well when source chunks are more roughly (~moderate amount of time, ~small amount of channels) as opposed to poor performance seen with source chunks (~small amount of time, all channels) or (~large amount of time, 1 channel) - instead make the default view always equal in shape to the first chunk or first few chunks that add up to some very small data threshold (such as ~5 MB), thus ensuring the users bandwidth is being utilized as much as possible upon the first view of the data. What would you think of that approach? Granted, if the user then selects different channel ranges or scans/plays or zooms in/out over time, that's when we get sent to the broader discussion of access patterns that can wait for another day, but still not too much we can do right away given the constraints (a-c) |
Thanks @CodyCBakerPhD, what you say makes sense. A lot of this is pretty mysterious to me. I'll follow up on Monday. |
There's a lot to discuss here - To start, jere are some Python timings based on the new data you uploaded, and your suggestion to look at reading other sizes of data. I ran each timing twice to get at least some idea of the variability For reading (10, 96) segment:
For reading (298022, 96) segment
I chose (298022, 96) so that the 298022 matches the chunking size for the first two examples. |
There's a great youtube video on this: https://www.youtube.com/watch?v=rcS5vt-mKok We can try to implement some of these features |
First steps on this PR: hdmf-dev/hdmf#925 Exposure of pagination (setting, larger defaults than Partial chunk shaping given a size limit has been requested and will give higher priority: catalystneuro/neuroconv#17 (the ability to say 'chunk this SpikeGLX series by |
I hesitate to add yet another consideration here, but... There is a tricky interplay between the emscripten lazy (virtual) file that h5wasm uses and the chunking and data layout in the remote h5 file. You need to specify a chunk size for the virtual file (completely different from the h5 dataset chunking). If too small you end up with too many GET requests and a low transfer rate due to the latency of each request. If too large you end up downloading a lot more data than is actually needed. For example, if the virtual file chunk size is 1 mb, and you only need to read the attributes of a h5 group, then you waste nearly the entire 1MB request. On the other hand, if the chunk size is 10 kb, then you need thousands of individual requests to download a 10 mb slice of a dataset. The problem is, when implementing the virtual file (backed by the synchronous http requests in the worker thread) you don't know how many bytes are being requested in any given read operation. You only get a request to read the data starting at a certain file position. This makes it impossible to distinguish between when the system is requesting small or large blocks of data. And even if the virtual file chunk size matched appropriately with the requested chunk size for datasets, there's no reason to expect that the chunks will align efficiently. Basically what I'm saying is that the h5wasm implementation is not as simple as requesting the needed data ranges in the h5 file. There is a virtual file chunking system at play, and it's not very flexible. I have found that using 100 kb virtual file chunk size is about optimal for these datasets, so that's what I am using now. One issue is that all the meta information (attributes of groups and datasets, plus the data for small datasets) needs to be read very rapidly on page load. This is not efficient if that meta data is scattered randomly throughout a very large nwb file. To remedy this, I have been precomputing a meta-only version of the nwb file and storing it on a separate server. That file ends up being 2-10 mb in size (usually). And so for that I use a larger virtual file chunk size (4 mb), and it loads all that initial information very quickly. For more info on how I am caching a meta-only version of the nwb files, see #7 |
It's fine to store intermediate files as a stop-gap but ultimately we'd like to have default file settings that are optimized for cloud access so these types of methods are unnecessary. We have been experimenting with packing h5 files so that all of this metadata is together and can be read one few requests. Can you point me to a file that is giving you trouble in this category? I'll use that to test this new approach. |
There are certainly more extreme examples, but here's one where there is a noticeable difference in the initial load time: Without using my meta nwb (slow): With using my meta nwb (fast): (note the no-meta=1 query parameter) For me the difference is 6 seconds versus <1 second. |
I uploaded some ephys datasets to DANDI. Notice how fast this loads when I use no compression and no chunking. I would expect similar performance with large chunks (covering all channels) and no compression. The next best option (IMO) would be large chunks (covering all channels) and compression (although this is not ideal for viewing small time segments, random access -- so the chunks should be around 5-10 MB IMO) The least cloud-optimal option is small chunks (or chunks covering a small number of channels) with or without compression - or very large chunks with compression. |
Even for NeuroPixels? By my calculations, a ~10MB chunk over all 384 channels would only cover about 1/3 of a second of equivalent real-time data. I'm about to repackage some of the raw data from
|
Oh yeah, I wasn't thinking of the large channel counts (I mainly work with 32 ch). I guess this depends on whether people will usually want to access all the channels or a subset. This is really difficult - I don't know a good way here. :) |
With 32 ch, how big of a time window would you ever want to look at? |
Usually just a few seconds, but perhaps up to 30 seconds. |
Thanks, that's helpful I'll add
to experiment with |
@bendichter Here's a more dramatic example highlighting the need for the meta nwb file that takes around 20 seconds to load... compared with just a second for this one: It's because of the 188 acquisition items |
which file on DANDI is that? I can try repacking it really quick to see if paginated metadata solves the initial loading problem |
It's this: https://dandiarchive.org/dandiset/000615/draft/files?location=sub-001 / sub-001_ses-20220411.nwb |
@magland OK here is the example of that one paginated at 10 MiB: https://flatironinstitute.github.io/neurosift/?p=/nwb&url=https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/c61/76f/c6176f66-c606-44eb-8d68-85ef71c3808d located in https://gui-staging.dandiarchive.org/dandiset/200560/draft/files?location=for_jeremy Is that any faster for you? One other thing is I do not know how or if The analog in and it must be set to a multiple of the page layout to work well; I'd recommend |
@CodyCBakerPhD That makes a huge difference! Your paginated version loads in just a couple seconds for me (notice no-meta=1) compared with this slow one The meta nwb method still seems to be a bit faster than the paginated one, but both are fast enough. |
Great to hear 👍 that alone is a pretty easy change to make aside from the chunking problem so we'll start incorporating that across the ecosystem |
The new round of NeuroPixel examples are up for qualitative evaluation The data stream that was repacked was the The naming convention is Let me know which you prefer so we can update that as the default for that and similar formats |
This is great, thank you @CodyCBakerPhD ! They all seem to provide a better neurosift experience than the previous default especially when num_channels per chunk >= 32. The ~10 MB chunks work well I think. Here's a table showing how many chunks are needed for various reasonable loading patterns. The columns are the number of channels in a chunk.
From this it seems like 64, 96, or 192 are all good choices. The middle one is 96, but I'd be happy with 64 as well. One note. These are 10 MB chunks from the perspective of the uncompressed data. The compressed chunks will be smaller. In neurosift on my wi-fi laptop, each chunk takes around 5 seconds to load, so I think I wouldn't want to go higher than that. Here's the code to generate the table above import math
ff = 30000
ncs = [1, 16, 32, 64, 96, 192, 384]
rows = [
('1ch 1tp', 1, 1),
('1ch 50ms', 1, 0.1*ff),
('1ch 1sec', 1, 1*ff),
('1ch 10sec', 1, 1*ff),
('32ch 1ms', 32, 0.001*ff),
('32ch 50ms', 32, 0.05*ff),
('32ch 1sec', 32, 1*ff),
('32ch 10sec', 32, 1*ff),
('128ch 1ms', 128, 0.001*ff),
('128ch 10ms', 128, 0.01*ff),
('128ch 50ms', 128, 0.05*ff),
('128ch 1sec', 128, 1*ff),
]
print('| | ' + ' | '.join([str(x) for x in ncs]) + ' |')
print('| --- |' + ' | '.join(['---' for x in ncs]) + ' |')
for r in rows:
a = []
for nc in ncs:
nf = (10 * 1024 * 1024 / (nc * 2))
a.append(math.ceil(r[2] / nf) * math.ceil(r[1] / nc))
print('| ' + r[0] + '|' + ' | '.join([str(x) for x in a]) + ' |') |
@magland I had to go into DANDI set 59 to fix some unrelated things, so while I was there I implemented all the practices suggested here Try out any/all of the files here: https://dandiarchive.org/dandiset/000059/0.230907.2101/files?location= The raw ones are not paginated as it inflated file size ~20%, pretty much undoing any gains from compression The processed ones are paginated at 10 MiB All chunk shapes for all data streams are around ~10 MiB, with the ElectricalSeries in particular chunked by 64 channels |
Thanks @CodyCBakerPhD I took a look at a few of them, and they do seem pretty responsive. |
Can this be closed? These examples and greater discussion are making their way to https://github.com/NeurodataWithoutBorders/nwb_benchmarks Also NeuroConv defaults have used 64 channel max up to 10 MB total size for a while now |
closing. |
This is more of a discussion than an issue.
Buzsaki dataset from 2021 (YutaMouse33-150222)
ElectricalSeries data loading in neurosift is very slow especially for large numbers of channels. I looked into the chunking size and got the following:
Compare that with a much more recent dataset (000463 draft)
It appears that in the second case, chunking is turned off, and it makes a huge difference for remote reading of data (well, there must be some caching going on here).
You can also see a dramatic difference in loading time in neurosift:
The slow first one (with chunking):
https://flatironinstitute.github.io/neurosift/#/nwb?url=https://dandiarchive.s3.amazonaws.com/blobs/c86/cdf/c86cdfba-e1af-45a7-8dfd-d243adc20ced
Takes 10 seconds to load the initial view
The fast second one (with no chunking):
https://flatironinstitute.github.io/neurosift/#/nwb?url=https://dandiarchive.s3.amazonaws.com/blobs/082/8f8/0828f847-62e9-443f-8241-3960985ddab3
Takes less than 1 second to load the initial view
Even if chunking is needed for very large datasets, I think that the chunks should be much larger in both the channel and time dimensions.
@bendichter
The text was updated successfully, but these errors were encountered: