Dat file compression #57

cx1111 · 2017-06-02T13:39:23Z

This may be a feature for way down the line.

Incorporate some form of file compression for dat files. One level is to use some generic compression to save space, which will increase load times. The next level would be to come up with wfdb's own compression system to load files even faster.

alistairewj · 2017-06-02T13:42:43Z

One relevant question: is the order of on the fly decompression consistent? Would be interesting if for a rdsamp up to sample N, you can only decompress up to sample N.

cx1111 · 2018-03-17T17:48:00Z

I ran some basic benchmarks on the dat files of the first 50 patients in the mimic3 matched waveform database, on my computer. This is in memory compression/decompression, so these times are slower than compressing to disc.

fmt	compress_level	n_files	uncompressed_total	compressed_total	compression_ratio	time_compress	time_decompress
bz2	1	3026	9.86 G	5.04 G	1.96	1963.81	782.26
bz2	9	3026	9.86 G	4.67 G	2.11	2232.27	1253.27
gzip	1	3026	9.86 G	6.19 G	1.59	486.09	186.93
gzip	9	3026	9.86 G	5.86 G	1.68	1434.51	175.94
lz4	0	3026	9.86 G	8.42 G	1.17	57.10	23.51
lz4	16	3026	9.86 G	6.73 G	1.47	1922.50	25.74
zstd	1	3026	9.86 G	6.40 G	1.54	106.53	52.95
zstd	22	3026	9.86 G	5.01 G	1.97	12940.08	93.34

All libraries have python bindings, with bzip2 and gzip being built into the cpython core distribution.

cx1111 · 2018-05-24T14:48:55Z

Benchmark results of 5980 files, with total size 21.01 G. Total time is shown in HH:MM:SS, indicating the sum of times by all cores, effectively a single core benchmark.

fmt	compress_level	compression_ratio	time_compress	time_decompress
bz2	1	2.01	0:41:48	0:16:54
bz2	5	2.14	0:52:01	0:22:37
bz2	9	2.18	1:11:38	0:44:29
gzip	1	1.60	0:10:08	0:03:45
gzip	5	1.69	0:19:50	0:03:36
gzip	9	1.70	0:37:15	0:03:43
lz4	0	1.17	0:02:47	0:02:12
lz4	10	1.48	0:23:51	0:00:32
lz4	16	1.48	0:44:57	0:00:30
zstd	1	1.57	0:02:33	0:01:17
zstd	15	1.89	6:11:24	0:06:48
zstd	22	2.01	15:45:09	0:07:24
flac	0	2.18	0:06:55	0:04:19
flac	5	2.22	0:06:48	0:05:00
flac	8	2.26	0:11:27	0:04:47

cx1111 · 2018-05-24T17:46:36Z

Leaning towards flac. There may be some challenges: https://xiph.org/flac/format.html

Only supports 1-8 channels.
Only supports integer sampling frequencies (can work around by ignoring this value when reading).

kuprel · 2019-07-18T17:42:54Z

I found that first converting the float array to int16 using the adc_gain and baseline properties results in a compression ratio of about 2.7x. So far I have compressed 441 GB of dat files to 164 GB of flac files. My code is here: https://github.com/kuprel/flacdb

cx1111 · 2022-06-30T23:02:55Z

FLAC formats have been implemented.

cx1111 added the enhancement label Jun 2, 2017

kuprel mentioned this issue Jul 18, 2019

2.7x lossless compression from .dat to .flac MIT-LCP/mimic-code#578

Closed

1 task

cx1111 closed this as completed Jun 30, 2022

cbrnr mentioned this issue Oct 9, 2024

Use uv #504

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dat file compression #57

Dat file compression #57

cx1111 commented Jun 2, 2017

alistairewj commented Jun 2, 2017

cx1111 commented Mar 17, 2018

cx1111 commented May 24, 2018 •

edited

Loading

cx1111 commented May 24, 2018

kuprel commented Jul 18, 2019

cx1111 commented Jun 30, 2022

Dat file compression #57

Dat file compression #57

Comments

cx1111 commented Jun 2, 2017

alistairewj commented Jun 2, 2017

cx1111 commented Mar 17, 2018

cx1111 commented May 24, 2018 • edited Loading

cx1111 commented May 24, 2018

kuprel commented Jul 18, 2019

cx1111 commented Jun 30, 2022

cx1111 commented May 24, 2018 •

edited

Loading