-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoded waveforms load slower than compressed waveforms #77
Comments
Forgot to mention that these files are the same size so there is no benefit to storage for using encoding over gzipping.
|
@gipert suggested trying this on a hard drive, so I tested it on NERSC for a @gipert also suggested checking other compression filters. I tested the
|
Well, this cannot be...
Yes, I have also already noticed that ZZ is significantly slower than David's code. I would have originally liked to only use Sigcompress, but then it was decided not to rescale the pre-summed waveform (i.e. samples are of uint32 type). Sigcompress only works with 16-bit integers.
In principle, Another thing: do you make sure you run the test scripts multiple times? After the first time the read speed will be much higher, since the file is cached by the filesystem. If you have admin rights, you can manually clean the filesystem cache to ensure reproducibility (I forgot the command). Also make sure that Numba cache is enabled, so it does not need to recompile after the first time! |
If we confirm that LZF performance is so good, it could be a safe choice for the next re-processing of the raw tier. It's a famous algorithm so I'm sure it's well supported from the Julia side too. |
I retried this and it looks like I screwed up the test I made of Here is my code: waveform_tests.txt The default level of compression for
I'm not sure that there is actually any caching occurring or something else is going on with it that I don't understand. Maybe the cache is getting cleared faster than I can read the files? This is the output from repeating the test three times in a simple loop. The read speeds of each loop are all nearly identical. It does look like the test I made of I checked this by introducing a I can perform some more rigorous tests if there are suggestions.
Is there something special I need to do to enable this? I am not very familiar with Numba. I thought this would be handled by whatever functions we have that are using Numba. Results from
Results from
|
The caching is always occurring, and the data is kept there until the cache is full (or I guess a significant amount of time has passed). Since the cache is shared among all processes and users, the time it takes for the file to be removed from the cache can vary a lot, but it's not particularly short, I'd say.
Caching is true by default in legend-pydataobj, unless overridden by the user. To conclude, let's re-discuss all this when we'll be again close in time to re-generating the raw tier. We had to cut-off the discussion last time because of time constraints and we clearly did not study this well enough. |
See https://legend-exp.atlassian.net/wiki/spaces/LEGEND/pages/991953524/Investigation+of+LH5+File+Structure+and+I+O.
|
After some profiling, I think the culprit here is awkward array. Here's a test of reading in a raw file:
It looks like this line in
This line is responsible for padding each vector to have the same length and copying it into a numpy array. The problem is that this involves iterating throw the arrays in a slow python-y way it seems. I think it could be made a lot quicker by constructing the numpy array first and then copying into it. Doing this with numba would be very fast. Awkward array may also have a much more optimized way of doing this. |
I was asked to look into this again. Here's the update and recommendation. FYI I did not look into @iguinn previous comment. Code: test_compression_nersc.ipynb.txt Presently, the raw tier germanium waveforms are compressed using custom encoding algorithms: radware-sigcompress for waveform_windowed and ZigZag for waveform_presummed. These custom algorithms achieve a good compression ratio (approximately a factor of 6 reduction in size) but are very slow. The remaining datasets are encoded with GZIP by default. HDF5 / h5py / HDF5.jl are distributed with standard compression filters (GZIP, SZIP, LZF, LZ4, etc.); we have compared the performance and compression ratio of these standard distributed filters with our custom encoders. Based on this study, we make the following recommendations: We recommend changing the default compression algorithm to LZF and to use this filter for all data, including germanium waveforms. We note that these are drop-in changes, and the typical analyst will see no impact other than a speed increase when accessing the raw waveforms. These recommendations are based on the evaluation of two metrics: 1) compression ratio and 2) decompression speed. We are not concerned with the compression speed of the raw tier data since it is infrequently generated. Tests were conducted on NERSC using the dvs_ro file system. Compression ratio - The compression ratio of LZF is worse than our custom encoders and increases the overall file size on disk by about 50%. However, coupled with other proposed changes to save disk space, we can accommodate this increase in size. Decompression speed - LZF is about 7 times faster than ZigZag for waveform_presummed and about 3 times faster than radware-sigcompress for waveform_windowed. This is a large improvement and will aid in analyses and processing where waveforms must be repeatedly loaded from disk. There is only marginal improvement in performance (if any) on the remaining datasets. |
Answering some questions from Slack and posting here for posterity. slides:250115_compression_settings-1.pdf This proposal around compression settings does not touch on changing the structure of .lh5 files.
LZF file
I didn't specify the chunking for the LZF file so it was chosen automatically by h5py/HDF5. In past testing, we have not observed it to have a particular impact unless it is way too small or way too large. This is probably a reasonable size. It should not have much impact on accessing only parts of the dataset, and I would suspect it is faster since it only has to access 1 dataset to get out the values instead of 3 datasets.
So, as I remember and re-investigated, there are several different behaviors in lgdo depending upon what arguments are provided and what data is requested. They will have dramatically different reading times - this is problem with lgdo and how it handles defaults, and the behavior should probably be unified. These access patterns are
I have updated the slides to show these different scenarios. The color coding is kind of arbitrary. Note that to access 1 random waveform (or random selection of waveforms) in lgdo presently requires reading the entire column into memory and then slicing it to access the ones you want. (Or you can use the HDF5 default behavior if you use the h5_idx flag). You can't access only 1 random row from disk - HDF5 does not support this as far as I remember. You can quickly see how this becomes extremely non-performant if you go to even 10 random rows, as I show in my slides. This is the reason for the PR from #29 where we sacrificed 30% in speed by default to avoid 100x slow-down from users who access more than a few random rows. Advanced users who know what they are doing can use the h5_idx flag when accessing a contiguous block starting at the first row. Anyway, the custom encoders do perform better than other filters when you are accessing a relatively small number of random rows (<100s).
Also, I really don't have time to keep working on this - it takes a lot of time to do these investigations in detail since the behavior is so complicated and deeply ingrained in how HDF5, LH5, and awkward work and interact. My findings are largely the same as I had 10 months ago (#77) and the code is all posted there if other people want to look further. If we have more person-hours to devote to this, probably the right thing is to figure out how to convert awkward arrays to numpy arrays faster or stop using awkward arrays for the encoded waveforms. If we don't have more person-hours, the simplest thing is to switch to a different compression algorithm and accept the larger data. |
see also legend-exp/legend-dataflow#73 |
Encoded waveforms load 2-3x slower than gzipped waveforms. If I convert waveforms to being gzipped instead of encoded (see #76) , they load 2-3x faster.
gives
The text was updated successfully, but these errors were encountered: