Handle sample numbers > 2**31 in annotation files #328

bemoody · 2021-09-29T17:25:23Z

In an Annotation object, the event timestamps are stored as an array
of integers. (The array is called sample because one "tick" usually
corresponds to one sample in the signal file, though this is not
necessarily the case.)

It's uncommon, but by no means unheard of, to have a record longer
than 2147483647 ticks (for example, the vendor annotations for
MIMIC-IV waveforms have 1-millisecond resolution, and many of those
records are longer than 24 days.) Therefore, we must be sure that the
sample array is able to accommodate larger values, which means the
timestamps must be calculated and stored as 64-bit integers, rather
than numpy's default integer type.

Furthermore, timestamps in WFDB annotation files are stored as an
offset from the previous annotation. If the offset is between 0 and
1023, then the offset along with the annotation type is stored as two
bytes. If the offset is negative or larger than 1023, then a "SKIP"
sequence is used to specify a 32-bit signed offset.

If the offset between two annotations is larger than 2147483647 (or
less than -2147483648), then multiple SKIPs must be used in a row.
Both rdann and wrann need to accommodate this case.

bemoody · 2021-09-29T18:01:29Z

Well, that's crazy:

2021-09-29T17:26:28.3441563Z ##[group]Run nosetests
2021-09-29T17:26:28.3442229Z ESC[36;1mnosetestsESC[0m
2021-09-29T17:26:28.3491355Z shell: C:\Program Files\PowerShell\7\pwsh.EXE -command ". '{0}'"
2021-09-29T17:26:28.3491907Z env:
2021-09-29T17:26:28.3492446Z   pythonLocation: C:\hostedtoolcache\windows\Python\3.6.8\x64
2021-09-29T17:26:28.3493080Z ##[endgroup]
2021-09-29T17:26:33.8184778Z ...D:\a\wfdb-python\wfdb-python\wfdb\io\annotation.py:1883: RuntimeWarning: overflow encountered in long_scalars
2021-09-29T17:26:33.8186197Z   sample_diff += skip_diff
2021-09-29T17:27:41.5789858Z F.........................................
2021-09-29T17:27:41.5790791Z ======================================================================
2021-09-29T17:27:41.5791673Z FAIL: Read and write annotations with large time skips
2021-09-29T17:27:41.5792556Z ----------------------------------------------------------------------
2021-09-29T17:27:41.5793298Z Traceback (most recent call last):
2021-09-29T17:27:41.5795825Z   File "D:\a\wfdb-python\wfdb-python\tests\test_annotation.py", line 198, in test_4
2021-09-29T17:27:41.5797271Z     self.assertEqual(annotation.sample[0], 10000000000)
2021-09-29T17:27:41.5798112Z AssertionError: 1410065408 != 10000000000
2021-09-29T17:27:41.5798595Z 
2021-09-29T17:27:41.5799226Z ----------------------------------------------------------------------
2021-09-29T17:27:41.5799952Z Ran 45 tests in 72.082s
2021-09-29T17:27:41.5800367Z 
2021-09-29T17:27:41.5800921Z FAILED (failures=1)
2021-09-29T17:27:41.9665736Z ##[error]Process completed with exit code 1.

Clearly something is trying to stuff timestamps into 32-bit integers (which is wrong, of course). The obvious culprit is this:

def lists_to_int_arrays(*args):
    return [np.array(a, dtype='int') for a in args]

but for me, that raises an OverflowError if the value is too large, it doesn't merely show a warning.

This variable contains the complete contents of the input annotation file, as a numpy array of pairs of bytes (shape=(N,2), dtype='uint8'). It is neither a str nor a bytes object.

In WFDB-format annotation files, annotation timestamps are represented as an offset from the previous annotation. When this offset is less than 0 or greater than 1023, a SKIP pseudo-annotation is used; when the offset is greater than 2**31 - 1 or less than -2**31, multiple SKIPs must be used. Thus, proc_core_fields must be able to handle an arbitrary number of SKIPs in a row, preceding the actual annotation, and add all of the offsets together to obtain the final timestamp.

When reading an annotation file in WFDB format, the timestamp (sample number) must be computed by adding up the relative timestamp difference for each annotation. For long records, sample numbers can easily exceed 2**32. The input to proc_core_fields is a numpy array, so if we operate on the byte values with ordinary arithmetic operations, the result will be a numpy integer object with numpy's default precision (i.e., int32 on 32-bit architectures, int64 on 64-bit architectures.) Instead, calculate the result as a Python integer, to avoid architecture-dependent behavior and (possible) silent wrapping. (Furthermore, use left-shift operations instead of multiplying by constants that are hard to remember.)

For long records, annotation timestamps (sample numbers) can easily exceed the range of a numpy 'int' on 32-bit architectures. Therefore, store the 'sample' array as 'int64' instead.

If the gap between two consecutive annotation timestamps is greater than 2**31 - 1 ticks, it must be represented as two or more SKIP pseudo-annotations. Handle this correctly in field2bytes() (to actually generate the correct byte sequences) and in Annotation.check_field() (to permit the application to specify such a gap.) (Previously, if there was a gap of exactly 2**31 ticks, this would not be caught by check_field, and field2bytes would incorrectly generate a SKIP of -2**31 instead.)

Make the test_annotation class a subclass of unittest.TestCase, allowing it to use standard unit testing utility methods, as well as setup and teardown functions. (nosetests will run "test" class methods automatically even if they are not subclasses of TestCase, but unittest won't.) Rename the class to TestAnnotation for consistency. Make the module executable (invoke unittest.main()) so it can be invoked simply using 'python3 -m tests.test_annotation'. Ensure that temporary files created by the annotation tests will be correctly cleaned up by TestAnnotation.tearDownClass() rather than by the unrelated TestRecord.tearDownClass(). (Presumably this only happened to work previously because "test_record" comes alphabetically after "test_annotation".)

Check that we can both read and write an annotation file containing a relative offset of more than 2**31 - 1 ticks, which necessitates the use of multiple SKIP pseudo-annotations.

cx1111

LGTM

wfdb/io/annotation.py

cx1111 · 2021-10-22T19:16:31Z

wfdb/io/annotation.py

-            data_bytes = [sd & 255, ((sd & 768) >> 8) + 4*typecode]
+        data_bytes = []
+        # Add SKIP elements if value is too large
+        while sd > 0x7fffffff:


We now have this hex value and the same decimal value 2147483647 in this file. Perhaps unify under a constant module level variable for clarity?

tompollard · 2022-03-03T17:18:03Z

Thanks Benjamin. Chen, it looks like your points are resolved, so I'm taking the liberty of merging!

Benjamin Moody added 7 commits September 30, 2021 12:31

Fix documentation of the internal variable 'filebytes'.

04ae55f

This variable contains the complete contents of the input annotation file, as a numpy array of pairs of bytes (shape=(N,2), dtype='uint8'). It is neither a str nor a bytes object.

rdann: store sample as an array of int64.

c098e38

For long records, annotation timestamps (sample numbers) can easily exceed the range of a numpy 'int' on 32-bit architectures. Therefore, store the 'sample' array as 'int64' instead.

Add test cases for reading/writing huge skips.

aad7b14

Check that we can both read and write an annotation file containing a relative offset of more than 2**31 - 1 ticks, which necessitates the use of multiple SKIP pseudo-annotations.

bemoody force-pushed the huge-skip branch from 036bc22 to aad7b14 Compare September 30, 2021 16:42

bemoody changed the title ~~Handle large gaps in annotation files~~ Handle sample numbers > 2**31 in annotation files Oct 1, 2021

cx1111 approved these changes Oct 22, 2021

View reviewed changes

field2bytes: rearrange and add comments for clarity.

bfa0a37

tompollard merged commit 0d41603 into master Mar 3, 2022

tompollard deleted the huge-skip branch March 3, 2022 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle sample numbers > 2**31 in annotation files #328

Handle sample numbers > 2**31 in annotation files #328

bemoody commented Sep 29, 2021 •

edited

Loading

bemoody commented Sep 29, 2021

cx1111 left a comment

cx1111 Oct 22, 2021

tompollard commented Mar 3, 2022

Handle sample numbers > 2**31 in annotation files #328

Handle sample numbers > 2**31 in annotation files #328

Conversation

bemoody commented Sep 29, 2021 • edited Loading

bemoody commented Sep 29, 2021

cx1111 left a comment

Choose a reason for hiding this comment

cx1111 Oct 22, 2021

Choose a reason for hiding this comment

tompollard commented Mar 3, 2022

bemoody commented Sep 29, 2021 •

edited

Loading