Replies: 8 comments 6 replies
-
Thanks for the data example! A couple of things I noticed right away...
Those dimensions are incorrect for NWB; they should be Running the NWB Inspector (used for DANDI validation as well) should help catch such things
What version of NeuroConv are you using to run your ImagingExtractorInterface? I ask mainly because
Those chunks are smaller than modern recommendation, and using recent versions of NeuroConv should automatically choose some better ones for you (namely, chunking by entire frames; all rows/cols; then including as many frames as possible to the max chunk size; distribution of frames over time or z planes is a separate question that depends on use case) |
Beta Was this translation helpful? Give feedback.
-
Hi Cody! Hmm, thanks for catching that dimensions oddity, I had some issues with transposed matrices from our .mat files which I thought I had sorted out, but it seems I didn't. I'll have to re-visit. I installed NeuroConv and roiextractors by pip-cloning the github repos, so they are the the respective repo heads. The ImagingExtractor I wrote is strongly based on Hdf5ImagingInterface and Hdf5ImagingExtractor, with a few tweaks to accommodate those transposed arrays and some custom metadata. I ended up not using your lazy_ops module, but plain h5py for data access in the .mat files. So, not sure why the chunking is off. Would you perhaps have a clue as to where this is determined in the code? I can debug some more. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Sorry for the inaccuracy, I last pulled about two weeks ago, I'm at this commit, so pretty recent. I must have gotten something wrong in my interface implementation. You can find my module with both ImagingExtractors and ImagingExtractorInterface implementations here. The test was run as in the last two cells of this notebook -- please ignore the rest of that notebook, that's just random test snippets. I didn't get any warnings, but I had verbose=False. I'll test some more tomorrow. Thank you! |
Beta Was this translation helpful? Give feedback.
-
Hi! I fixed my issue with the wrong dimension order (I got confused by the fact that the chunk iterator does its own transpose). I uploaded a new NWB example data file with the correct dimensions (frames, rows, cols, planes) = (1200, 1188, 1213, 3), datatype is float32. The chunk size is still relatively small, (22, 22, 22, 3). I think this was chosen as an indirect consequence of the buffer size limit, which is 1GB. The buffers used during data conversion were (44, 1188, 1213, 3), which is 760 MB, so probably as large as possible. The chunk size was then chosen to efficiently tile the buffer and be below 1 MB. In my case, this resulted in a relatively small chunk size (127 kB). Would you recommend increasing the buffer and/or chunk size? What chunk size should I aim for? As for compression, is there anything you would recommend me to try to get a better compression efficiency out, or would you think my current one (119.73%) is acceptable? |
Beta Was this translation helpful? Give feedback.
-
Hi Cody! Sure, here's my environment:
|
Beta Was this translation helpful? Give feedback.
-
Alright, after upgrading to the latest main branches of neuroconv, roiextractors and hdmf, the chunk size is now (1, 1188, 1213, 1) = 5764176 bytes. Compression utilization increased very slightly to 122%. I don't have the time to try different compressors, so if you don't see a red flag then it is what it is. This is rather noisy data, so without doing something much smarter, I don't think that the other generic compressors would do dramatically better. One last question: Compression is single-threaded and thus quite a bottleneck. Would it be acceptable if instead of packing our entire 4D dataset into a single NWB file, we generated a separate NWB file for each of our 30 planes? Is there anything we should observe in that case regarding the metadata and file structures? Any example use cases out there that we should learn from? Thank you! |
Beta Was this translation helpful? Give feedback.
-
Thank you, Cody! The way we analyze things will remain in flux for a good while. Currently, we first run source extraction plane-by-plane and then merge across planes in a separate step, to create the final list of 3D neuron locations and activity time series. Meanwhile, we are also working on an inherently volumetric source extraction pipeline. So the most efficient way to organize the data during source extraction may change, and I think that it is a somewhat separate problem from how we archive it. Thanks for the hint on the NWB Zarr backend. I'll look into it and decide based on that. In general, I would expect that what 99% of data users will be interested in is not the monstrous 4D voxel dataset, but the output of the source extraction (locations, activities). And I'm very much inclined to keep that in a separate file anyway, to save people the trouble of dealing with a 150+ GB file of voxel data if all they want to access is a few MB of neuron activity time series, plus some metadata. So probably, we will end up having more than one NWB file per dataset anyways, so I might just as well keep the planes separate, too, to keep things a bit more lightweight on the file level. |
Beta Was this translation helpful? Give feedback.
-
Absolutely, thank you very much for your kind support. |
Beta Was this translation helpful? Give feedback.
-
Hi! We had a brief exchange a while back on whether Light Beads Microscopy data is suitable for DANDI. You offered to take a look at tuning compression parameters for our data back then.
Now I've finally gotten around to writing a preliminary implementation of an ImagingExtractorInterface that reads our (slightly preprocessed) 4D imaging data from .mat files and writes them into NWB. I've uploaded a 16 GB example dataset as an NWB file. This test dataset only contains three out of thirty z-planes, so it is about one tenth of a typical full dataset. Dimensions: (rows, cols, timesteps, z-planes) = (1213, 1188, 1200, 3).
Here's a snip from the output of
h5ls
showing dataset dimensions and compression ratio:So, deflate is squeezing out about 20%. This data will always be pretty noisy, so I don't think that this is a particularly bad compression ratio for our data. But perhaps you'd have some time to look into this some more?
Let me know how I can help.
Beta Was this translation helpful? Give feedback.
All reactions