Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dat file compression #57

Closed
cx1111 opened this issue Jun 2, 2017 · 6 comments
Closed

Dat file compression #57

cx1111 opened this issue Jun 2, 2017 · 6 comments

Comments

@cx1111
Copy link
Member

cx1111 commented Jun 2, 2017

This may be a feature for way down the line.

Incorporate some form of file compression for dat files. One level is to use some generic compression to save space, which will increase load times. The next level would be to come up with wfdb's own compression system to load files even faster.

@alistairewj
Copy link
Member

One relevant question: is the order of on the fly decompression consistent? Would be interesting if for a rdsamp up to sample N, you can only decompress up to sample N.

@cx1111
Copy link
Member Author

cx1111 commented Mar 17, 2018

I ran some basic benchmarks on the dat files of the first 50 patients in the mimic3 matched waveform database, on my computer. This is in memory compression/decompression, so these times are slower than compressing to disc.

fmt compress_level n_files uncompressed_total compressed_total compression_ratio time_compress time_decompress
bz2 1 3026 9.86 G 5.04 G 1.96 1963.81 782.26
bz2 9 3026 9.86 G 4.67 G 2.11 2232.27 1253.27
gzip 1 3026 9.86 G 6.19 G 1.59 486.09 186.93
gzip 9 3026 9.86 G 5.86 G 1.68 1434.51 175.94
lz4 0 3026 9.86 G 8.42 G 1.17 57.10 23.51
lz4 16 3026 9.86 G 6.73 G 1.47 1922.50 25.74
zstd 1 3026 9.86 G 6.40 G 1.54 106.53 52.95
zstd 22 3026 9.86 G 5.01 G 1.97 12940.08 93.34

All libraries have python bindings, with bzip2 and gzip being built into the cpython core distribution.

@cx1111
Copy link
Member Author

cx1111 commented May 24, 2018

Benchmark results of 5980 files, with total size 21.01 G. Total time is shown in HH:MM:SS, indicating the sum of times by all cores, effectively a single core benchmark.

fmt compress_level compression_ratio time_compress time_decompress
bz2 1 2.01 0:41:48 0:16:54
bz2 5 2.14 0:52:01 0:22:37
bz2 9 2.18 1:11:38 0:44:29
gzip 1 1.60 0:10:08 0:03:45
gzip 5 1.69 0:19:50 0:03:36
gzip 9 1.70 0:37:15 0:03:43
lz4 0 1.17 0:02:47 0:02:12
lz4 10 1.48 0:23:51 0:00:32
lz4 16 1.48 0:44:57 0:00:30
zstd 1 1.57 0:02:33 0:01:17
zstd 15 1.89 6:11:24 0:06:48
zstd 22 2.01 15:45:09 0:07:24
flac 0 2.18 0:06:55 0:04:19
flac 5 2.22 0:06:48 0:05:00
flac 8 2.26 0:11:27 0:04:47

@cx1111
Copy link
Member Author

cx1111 commented May 24, 2018

Leaning towards flac. There may be some challenges: https://xiph.org/flac/format.html

  • Only supports 1-8 channels.
  • Only supports integer sampling frequencies (can work around by ignoring this value when reading).

@kuprel
Copy link

kuprel commented Jul 18, 2019

I found that first converting the float array to int16 using the adc_gain and baseline properties results in a compression ratio of about 2.7x. So far I have compressed 441 GB of dat files to 164 GB of flac files. My code is here: https://github.com/kuprel/flacdb

@cx1111
Copy link
Member Author

cx1111 commented Jun 30, 2022

FLAC formats have been implemented.

@cx1111 cx1111 closed this as completed Jun 30, 2022
@cbrnr cbrnr mentioned this issue Oct 9, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants