Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoded waveforms load slower than compressed waveforms #77

Closed
lvarriano opened this issue Mar 14, 2024 · 11 comments · Fixed by #131
Closed

Encoded waveforms load slower than compressed waveforms #77

lvarriano opened this issue Mar 14, 2024 · 11 comments · Fixed by #131
Labels
compression Waveform Compression performance Code performance

Comments

@lvarriano
Copy link
Contributor

lvarriano commented Mar 14, 2024

Encoded waveforms load 2-3x slower than gzipped waveforms. If I convert waveforms to being gzipped instead of encoded (see #76) , they load 2-3x faster.

import lgdo
import h5py
import time

store = lgdo.lh5.LH5Store()

input_file = "/home/lv/Documents/uw/l200/l200-p06-r000-phy-20230619T034203Z-tier_raw.lh5"
output_file = "output.lh5"

ch_list = lgdo.lh5.ls(input_file)[2:] # skip FCConfig and OrcaHeader
geds_list = []
with h5py.File(input_file, mode='r') as f:
    for ch in ch_list:
        if 'waveform_windowed' in f[ch]['raw'].keys():
            geds_list.append(ch)

# copy data
for ch in ch_list:
    chobj, _ = store.read(f'{ch}/raw/', input_file)
    store.write(chobj, 'raw', output_file, f'{ch}/')

# read radware-sigcompress encoded waveform_windowed
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/raw/waveform_windowed', input_file)
print('read radware-sigcompress encoded waveform_windowed ', time.time() - start)

# read gzipped waveform_windowed
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/raw/waveform_windowed', output_file)
print('read gzipped waveform_windowed ', time.time() - start)

# read ZigZag encoded waveform_presummed
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/raw/waveform_presummed', input_file)
print('read ZigZag encoded waveform_presummed ', time.time() - start)

# read gzipped waveform_presummed
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/raw/waveform_presummed', output_file)
print('read gzipped waveform_presummed ', time.time() - start)

gives

read radware-sigcompress encoded waveform_windowed  9.676453590393066
read gzipped waveform_windowed  6.671638011932373
read ZigZag encoded waveform_presummed  14.28751802444458
read gzipped waveform_presummed  5.223653078079224
@lvarriano
Copy link
Contributor Author

Forgot to mention that these files are the same size so there is no benefit to storage for using encoding over gzipping.

-rw-r--r-- 1 lv lv 1.6G Mar  8 15:45 ../l200-p06-r000-phy-20230619T034203Z-tier_raw.lh5
-rw-rw-r-- 1 lv lv 1.6G Mar 14 09:00 output.lh5

@gipert gipert added compression Waveform Compression performance Code performance labels Mar 14, 2024
@lvarriano
Copy link
Contributor Author

lvarriano commented Mar 15, 2024

@gipert suggested trying this on a hard drive, so I tested it on NERSC for a phy raw file. The results were a little different, with gzip being only marginally faster than radware-sigcompress, though gzip is 2x faster than ZigZag encoding. However, the size of the gzip file is now 3.1 GB compared to the original 1.6 GB (not sure why). So I guess the lesson for me is to always check stuff on a hard drive.

@gipert also suggested checking other compression filters. I tested the lzf compression algorithm on NERSC (and on my laptop to double check.) This filter has even better performance than gzip both in terms of compression and speed, which seems somewhat surprising but I don't know what this filter is designed for. On NERSC (and my laptop), a file written with lzf takes up 2.1 GB compared to the original 1.6 GB. But the read speed is amazingly better!

read radware-sigcompress encoded waveform_windowed  19.01413583755493
read lzf waveform_windowed  5.321168899536133
read ZigZag encoded waveform_presummed  25.50418496131897
read lzf waveform_presummed  3.917027235031128

@lvarriano lvarriano changed the title Encoded waveforms load 2-3x slower than gzipped waveforms Encoded waveforms load 2-3x slower than compressed waveforms Mar 18, 2024
@lvarriano lvarriano changed the title Encoded waveforms load 2-3x slower than compressed waveforms Encoded waveforms load slower than compressed waveforms Mar 18, 2024
@gipert
Copy link
Member

gipert commented Mar 19, 2024

However, the size of the gzip file is now 3.1 GB compared to the original 1.6 GB (not sure why)

Well, this cannot be...

gzip is 2x faster than ZigZag encoding.

Yes, I have also already noticed that ZZ is significantly slower than David's code. I would have originally liked to only use Sigcompress, but then it was decided not to rescale the pre-summed waveform (i.e. samples are of uint32 type). Sigcompress only works with 16-bit integers.

I tested the lzf compression algorithm on NERSC (and on my laptop to double check.) This filter has even better performance than gzip both in terms of compression and speed, which seems somewhat surprising but I don't know what this filter is designed for.

In principle, lzf is optimized on speed and compresses less than GZip (with, I guess, a standard compression level of 5). I'm surprised that it compresses more in our case, but it could be special to our waveforms (?)

Another thing: do you make sure you run the test scripts multiple times? After the first time the read speed will be much higher, since the file is cached by the filesystem. If you have admin rights, you can manually clean the filesystem cache to ensure reproducibility (I forgot the command). Also make sure that Numba cache is enabled, so it does not need to recompile after the first time!

@gipert
Copy link
Member

gipert commented Mar 19, 2024

If we confirm that LZF performance is so good, it could be a safe choice for the next re-processing of the raw tier. It's a famous algorithm so I'm sure it's well supported from the Julia side too.

@lvarriano
Copy link
Contributor Author

However, the size of the gzip file is now 3.1 GB compared to the original 1.6 GB (not sure why)

Well, this cannot be...

I retried this and it looks like I screwed up the test I made of gzip on NERSC. I think probably what happened is I accidentally copied the file twice and it appended the rows (hence why the file size was 2x and why the read speed was 2x longer than what I had previously seen). I redid this and the results are in line with what I originally found on my laptop. I attached my code if anyone else wants to reproduce it (recommended). So both gzip and lzf seem faster than the encoding schemes we currently use.

Here is my code: waveform_tests.txt

The default level of compression for gzip in h5py is 4. (https://docs.h5py.org/en/stable/high/dataset.html#dataset-compression). I don't know how much faster it might be to use a lower level of compression.

Another thing: do you make sure you run the test scripts multiple times? After the first time the read speed will be much higher, since the file is cached by the filesystem. If you have admin rights, you can manually clean the filesystem cache to ensure reproducibility (I forgot the command).

I'm not sure that there is actually any caching occurring or something else is going on with it that I don't understand. Maybe the cache is getting cleared faster than I can read the files? This is the output from repeating the test three times in a simple loop. The read speeds of each loop are all nearly identical. It does look like the test I made of lzf with waveforms_windowed had a speed increase on the second and third loops.

I checked this by introducing a sleep of 2 minutes between for loops, assuming that this would be enough time for the cache to be cleared (?). I got the ~same results as I show below, roughly 6 seconds (5.4 s, 6.0 s, 7.5 s) to read the waveform_windowed with lzf.

I can perform some more rigorous tests if there are suggestions.

Also make sure that Numba cache is enabled, so it does not need to recompile after the first time!

Is there something special I need to do to enable this? I am not very familiar with Numba. I thought this would be handled by whatever functions we have that are using Numba.

Results from gzip compression on NERSC (all Ge channels for one phy raw file, ~3000 events).

removed /global/u1/v/varriano/legend/users/varriano/output_gzip.lh5
output file size: 1674.144667 MB
read radware-sigcompress encoded waveform_windowed  19.97586750984192
read gzipped waveform_windowed  10.338812589645386
read ZigZag encoded waveform_presummed  26.18614959716797
read gzipped waveform_presummed  7.355187892913818


read radware-sigcompress encoded waveform_windowed  19.864821910858154
read gzipped waveform_windowed  10.373951196670532
read ZigZag encoded waveform_presummed  26.007609128952026
read gzipped waveform_presummed  7.246658086776733


read radware-sigcompress encoded waveform_windowed  20.125010013580322
read gzipped waveform_windowed  10.180893898010254
read ZigZag encoded waveform_presummed  26.434375047683716
read gzipped waveform_presummed  6.846473217010498

Results from lzf compression on NERSC (all Ge channels for one phy raw file, ~3000 events).

removed /global/u1/v/varriano/legend/users/varriano/output_lzf.lh5
output file size: 2184.058997 MB
read radware-sigcompress encoded waveform_windowed  20.135162115097046
read lzf waveform_windowed  9.604149341583252
read ZigZag encoded waveform_presummed  26.367269039154053
read lzf waveform_presummed  4.854575872421265


read radware-sigcompress encoded waveform_windowed  20.017733097076416
read lzf waveform_windowed  6.0519983768463135
read ZigZag encoded waveform_presummed  26.321038007736206
read lzf waveform_presummed  4.818136930465698


read radware-sigcompress encoded waveform_windowed  20.06731939315796
read lzf waveform_windowed  5.865992546081543
read ZigZag encoded waveform_presummed  26.535961389541626
read lzf waveform_presummed  4.388535737991333

@gipert
Copy link
Member

gipert commented Mar 20, 2024

I'm not sure that there is actually any caching occurring or something else is going on with it that I don't understand. Maybe the cache is getting cleared faster than I can read the files?

The caching is always occurring, and the data is kept there until the cache is full (or I guess a significant amount of time has passed). Since the cache is shared among all processes and users, the time it takes for the file to be removed from the cache can vary a lot, but it's not particularly short, I'd say.

Is there something special I need to do to enable this? I am not very familiar with Numba.

Caching is true by default in legend-pydataobj, unless overridden by the user.

To conclude, let's re-discuss all this when we'll be again close in time to re-generating the raw tier. We had to cut-off the discussion last time because of time constraints and we clearly did not study this well enough.

@lvarriano
Copy link
Contributor Author

lvarriano commented May 11, 2024

See https://legend-exp.atlassian.net/wiki/spaces/LEGEND/pages/991953524/Investigation+of+LH5+File+Structure+and+I+O.
script: test_compression_encoding.py.txt

input file size (encoded waveforms): 1695.61888 MB

no compression/encoding output file size: 9609.928534 MB
write time:  61.316158294677734
GZIP output file size: 1674.927994 MB
write time:  127.61616349220276
SZIP output file size: 1742.706066 MB
write time:  115.50626015663147
LZF output file size: 2184.357187 MB
write time:  75.75087308883667
LZ4 output file size: 2287.242122 MB
write time:  63.4920859336853
LZ4_nbytes4096 output file size: 2355.047048 MB
write time:  63.97322130203247

read no compression waveform_windowed  2.5585856437683105
read radware-sigcompress encoded waveform_windowed  9.69769835472107
read GZIP waveform_windowed  6.3117835521698
read SZIP waveform_windowed  12.409362554550171
read LZF waveform_windowed  3.7812297344207764
read LZ4 waveform_windowed  3.2639217376708984
read LZ4_nbytes4096 waveform_windowed  3.3276045322418213

read no compression waveform_presummed  1.6638846397399902
read ZigZag encoded waveform_presummed  13.844745635986328
read GZIP waveform_presummed  4.550369501113892
read SZIP waveform_presummed  10.218478441238403
read LZF waveform_presummed  2.895991086959839
read LZ4 waveform_presummed  2.7338368892669678
read LZ4_nbytes4096 waveform_presummed  2.821821689605713

read no compression file all else  3.2617311477661133
read input file all else  3.370767116546631
read GZIP file all else  3.445415735244751
read SZIP file all else  3.6045446395874023
read LZF file all else  3.429582357406616
read LZ4 all else  3.3269314765930176
read LZ4_nbytes4096 all else  3.4018807411193848

@iguinn
Copy link
Contributor

iguinn commented Jul 29, 2024

After some profiling, I think the culprit here is awkward array. Here's a test of reading in a raw file:

         85086 function calls (84980 primitive calls) in 6.409 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    6.409    6.409 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/core.py:18(read)
      9/1    0.001    0.000    6.403    6.403 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/_serializers/read/composite.py:41(_h5_read_lgdo)
      3/1    0.002    0.001    6.402    6.402 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/_serializers/read/composite.py:264(_h5_read_table)
        2    0.013    0.006    6.387    3.193 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/_serializers/read/encoded.py:24(_h5_read_array_of_encoded_equalsized_arrays)
        2    0.000    0.000    6.374    3.187 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/_serializers/read/encoded.py:44(_h5_read_encoded_array)
        2    0.000    0.000    6.234    3.117 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/compression/generic.py:42(decode)
        2    0.117    0.059    5.245    2.622 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/types/vectorofvectors.py:531(to_aoesa)
    18/10    0.000    0.000    5.002    0.500 /opt/anaconda3/lib/python3.11/site-packages/awkward/_dispatch.py:35(dispatch)
    30/20    0.000    0.000    5.001    0.250 {built-in method builtins.next}
        4    0.000    0.000    4.018    1.004 /opt/anaconda3/lib/python3.11/site-packages/awkward/operations/ak_fill_none.py:17(fill_none)
        2    0.000    0.000    4.018    2.009 /opt/anaconda3/lib/python3.11/site-packages/awkward/operations/ak_fill_none.py:71(_impl)
...

It looks like this line in VectorOfVectors.to_aoesa is causing the slowdown (to_aoesa is taking ~5 s out of ~6 s, and awkward.fill_none is responsible for ~4 s of these ~5 s):

            nda = ak.fill_none(
                ak.pad_none(ak_arr, max_len, clip=True), fill_val
            ).to_numpy(allow_missing=False)

This line is responsible for padding each vector to have the same length and copying it into a numpy array. The problem is that this involves iterating throw the arrays in a slow python-y way it seems. I think it could be made a lot quicker by constructing the numpy array first and then copying into it. Doing this with numba would be very fast. Awkward array may also have a much more optimized way of doing this.

@lvarriano
Copy link
Contributor Author

I was asked to look into this again. Here's the update and recommendation. FYI I did not look into @iguinn previous comment.

Code: test_compression_nersc.ipynb.txt
Slides: 250115_compression_settings.pdf

Presently, the raw tier germanium waveforms are compressed using custom encoding algorithms: radware-sigcompress for waveform_windowed and ZigZag for waveform_presummed. These custom algorithms achieve a good compression ratio (approximately a factor of 6 reduction in size) but are very slow. The remaining datasets are encoded with GZIP by default. HDF5 / h5py / HDF5.jl are distributed with standard compression filters (GZIP, SZIP, LZF, LZ4, etc.); we have compared the performance and compression ratio of these standard distributed filters with our custom encoders. Based on this study, we make the following recommendations:

We recommend changing the default compression algorithm to LZF and to use this filter for all data, including germanium waveforms.
We recommend re-encoding previously taken ref-raw data with the LZF filter.

We note that these are drop-in changes, and the typical analyst will see no impact other than a speed increase when accessing the raw waveforms. These recommendations are based on the evaluation of two metrics: 1) compression ratio and 2) decompression speed. We are not concerned with the compression speed of the raw tier data since it is infrequently generated. Tests were conducted on NERSC using the dvs_ro file system.

Compression ratio - The compression ratio of LZF is worse than our custom encoders and increases the overall file size on disk by about 50%. However, coupled with other proposed changes to save disk space, we can accommodate this increase in size.

Decompression speed - LZF is about 7 times faster than ZigZag for waveform_presummed and about 3 times faster than radware-sigcompress for waveform_windowed. This is a large improvement and will aid in analyses and processing where waveforms must be repeatedly loaded from disk. There is only marginal improvement in performance (if any) on the remaining datasets.

@lvarriano
Copy link
Contributor Author

Answering some questions from Slack and posting here for posterity.

slides:250115_compression_settings-1.pdf
code: test_compression_nersc.ipynb.txt

This proposal around compression settings does not touch on changing the structure of .lh5 files.

  1. The LZ4 filter is distributed through the hdf5plugin package for python or HDF5.jl for Julia, as Florian says. These are standard packages.

  2. I don't know how to test I/O dominating build_dsp() - someone else will have to do that. Also, note that I am ignoring the amount of time it takes to get the file over the network and working only with the cached file in the times I am reporting. So again mostly relevant for repeated access of the file. (Including transfer time, it's still faster with LZF < GZIP < input file.)

  3. Here is the structure of the chunking:
    input file

├── waveform_presummed · table{t0,dt,values} 
│   ├── dt · array<1>{real} ── {'units': 'ns'}
│   |       dtype=float32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,deflate,)
│   ├── t0 · array<1>{real} ── {'units': 'ns'}
│   |       dtype=float32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,deflate,)
│   └── values · array_of_encoded_equalsized_arrays<1,1>{real} ── {'codec': 'uleb128_zigzag_diff'}
│       ├── decoded_size · real 
│       |       dtype=int64, shape=(), nbytes=8.0 B, numchunks=contiguous, filters=None
│       └── encoded_data · array<1>{array<1>{real}} 
│           ├── cumulative_length · array<1>{real} 
│           |       dtype=uint32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,deflate,)
│           └── flattened_data · array<1>{real} 
│                   dtype=uint8, shape=(4434558,), nbytes=4.2 MB, numchunks=128, chunkshape=(34645,), filters=(shuffle,)
└── waveform_windowed · table{t0,dt,values} 
    ├── dt · array<1>{real} ── {'units': 'ns'}
    |       dtype=float32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,deflate,)
    ├── t0 · array<1>{real} ── {'units': 'ns'}
    |       dtype=float32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,deflate,)
    └── values · array_of_encoded_equalsized_arrays<1,1>{real} ── {'codec': 'radware_sigcompress', 'codec_shift': -32768.0}
        ├── decoded_size · real 
        |       dtype=int64, shape=(), nbytes=8.0 B, numchunks=contiguous, filters=None
        └── encoded_data · array<1>{array<1>{real}} 
            ├── cumulative_length · array<1>{real} 
            |       dtype=uint32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,deflate,)
            └── flattened_data · array<1>{real} 
                    dtype=uint8, shape=(3413108,), nbytes=3.3 MB, numchunks=128, chunkshape=(26665,), filters=(shuffle,)

LZF file

├── waveform_presummed · table{t0,dt,values} 
│   ├── dt · array<1>{real} ── {'units': 'ns'}
│   |       dtype=float32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,lzf,)
│   ├── t0 · array<1>{real} ── {'units': 'ns'}
│   |       dtype=float32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,lzf,)
│   └── values · array_of_equalsized_arrays<1,1>{real} ── {'codec': 'uleb128_zigzag_diff'}
│           dtype=int32, shape=(2936, 1024), nbytes=11.5 MB, numchunks=256, chunkshape=(184, 64), filters=(shuffle,lzf,)
└── waveform_windowed · table{t0,dt,values} 
    ├── dt · array<1>{real} ── {'units': 'ns'}
    |       dtype=float32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,lzf,)
    ├── t0 · array<1>{real} ── {'units': 'ns'}
    |       dtype=float32, shape=(2936,), nbytes=11.5 kB, numchunks=1, chunkshape=(2936,), filters=(shuffle,lzf,)
    └── values · array_of_equalsized_arrays<1,1>{real} ── {'codec': 'radware_sigcompress', 'codec_shift': -32768.0}
            dtype=int32, shape=(2936, 1400), nbytes=15.7 MB, numchunks=512, chunkshape=(92, 88), filters=(shuffle,lzf,)

I didn't specify the chunking for the LZF file so it was chosen automatically by h5py/HDF5. In past testing, we have not observed it to have a particular impact unless it is way too small or way too large. This is probably a reasonable size. It should not have much impact on accessing only parts of the dataset, and I would suspect it is faster since it only has to access 1 dataset to get out the values instead of 3 datasets.

  1. Luigi asked about loading 1 waveform - this is a tricky issue that touches upon a number of other performance issues that I haven't thought about in ~10 months. Note that I am working in python with lgdo and don't know how things may have changed over that time - I'm using the version in the legendsw container. I thought at one point Ian was working on improving random access performance by skipping some of the h5py calls. It looks like the changes I'm most interested in have been made in https://github.com/legend-exp/legend-pydataobj/blob/main/src/lgdo/lh5/_serializers/read/ndarray.py.

So, as I remember and re-investigated, there are several different behaviors in lgdo depending upon what arguments are provided and what data is requested. They will have dramatically different reading times - this is problem with lgdo and how it handles defaults, and the behavior should probably be unified. These access patterns are

  1. accessing the whole data without using the idx or n_rows parameters
  2. accessing a row-contiguous subset of the data (i.e. providing n_rows or idx when idx is a contiguous range starting at 0, e.g. [0,1,2,...])
  3. accessing a random waveform in the middle, e.g. with idx
    4.-6. the same as 1-3 but using the flag h5_idx = True(read_object() with idx parameter is slow #29)

I have updated the slides to show these different scenarios. The color coding is kind of arbitrary. Note that to access 1 random waveform (or random selection of waveforms) in lgdo presently requires reading the entire column into memory and then slicing it to access the ones you want. (Or you can use the HDF5 default behavior if you use the h5_idx flag). You can't access only 1 random row from disk - HDF5 does not support this as far as I remember. You can quickly see how this becomes extremely non-performant if you go to even 10 random rows, as I show in my slides. This is the reason for the PR from #29 where we sacrificed 30% in speed by default to avoid 100x slow-down from users who access more than a few random rows. Advanced users who know what they are doing can use the h5_idx flag when accessing a contiguous block starting at the first row.

Anyway, the custom encoders do perform better than other filters when you are accessing a relatively small number of random rows (<100s).

  1. Ian asked about the blosc2 filter with LZ4 and DEFLATE and clevel=9 (https://hdf5plugin.readthedocs.io/en/stable/usage.html). This is ~30-50% faster to decompress than LZF or LZ4. The compression ratio is a little worse (62% increase in file size). The compression write speed is quite slow, 2x slower than LZF, but this doesn't matter so much.

  2. I looked at the performance during the decoding as Ian did previously (Encoded waveforms load slower than compressed waveforms #77). It still looks like converting the VoV to an awkward array then convert to AoESA is the problem, specifically this line https://github.com/legend-exp/legend-pydataobj/blob/main/src/lgdo/types/vectorofvectors.py#L568. However, that also seems to be how the internet suggests transforming awkward arrays to numpy arrays so I have no idea if that behavior can be changed.

  3. Ian suggested using the delta-rice algorithm (https://www.theoj.org/joss-papers/joss.06598/10.21105.joss.06598.pdf, https://github.com/david-mathews-1994/deltarice). However, this algorithm only supports 16 bit integers and we store 32 bit integers for the waveforms presently. I think that comes from the presumming - if the algorithm is amazing at compression and we can store the whole waveform without presumming, then that could be an option. Then the processing would either take longer or we presum on the fly.

Also, I really don't have time to keep working on this - it takes a lot of time to do these investigations in detail since the behavior is so complicated and deeply ingrained in how HDF5, LH5, and awkward work and interact. My findings are largely the same as I had 10 months ago (#77) and the code is all posted there if other people want to look further. If we have more person-hours to devote to this, probably the right thing is to figure out how to convert awkward arrays to numpy arrays faster or stop using awkward arrays for the encoded waveforms. If we don't have more person-hours, the simplest thing is to switch to a different compression algorithm and accept the larger data.

@lvarriano
Copy link
Contributor Author

see also legend-exp/legend-dataflow#73

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compression Waveform Compression performance Code performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants