Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle sample numbers > 2**31 in annotation files #328

Merged
merged 8 commits into from
Mar 3, 2022
Merged

Conversation

bemoody
Copy link
Collaborator

@bemoody bemoody commented Sep 29, 2021

In an Annotation object, the event timestamps are stored as an array
of integers. (The array is called sample because one "tick" usually
corresponds to one sample in the signal file, though this is not
necessarily the case.)

It's uncommon, but by no means unheard of, to have a record longer
than 2147483647 ticks (for example, the vendor annotations for
MIMIC-IV waveforms have 1-millisecond resolution, and many of those
records are longer than 24 days.) Therefore, we must be sure that the
sample array is able to accommodate larger values, which means the
timestamps must be calculated and stored as 64-bit integers, rather
than numpy's default integer type.

Furthermore, timestamps in WFDB annotation files are stored as an
offset from the previous annotation. If the offset is between 0 and
1023, then the offset along with the annotation type is stored as two
bytes. If the offset is negative or larger than 1023, then a "SKIP"
sequence is used to specify a 32-bit signed offset.

If the offset between two annotations is larger than 2147483647 (or
less than -2147483648), then multiple SKIPs must be used in a row.
Both rdann and wrann need to accommodate this case.

@bemoody
Copy link
Collaborator Author

bemoody commented Sep 29, 2021

Well, that's crazy:

2021-09-29T17:26:28.3441563Z ##[group]Run nosetests
2021-09-29T17:26:28.3442229Z ESC[36;1mnosetestsESC[0m
2021-09-29T17:26:28.3491355Z shell: C:\Program Files\PowerShell\7\pwsh.EXE -command ". '{0}'"
2021-09-29T17:26:28.3491907Z env:
2021-09-29T17:26:28.3492446Z   pythonLocation: C:\hostedtoolcache\windows\Python\3.6.8\x64
2021-09-29T17:26:28.3493080Z ##[endgroup]
2021-09-29T17:26:33.8184778Z ...D:\a\wfdb-python\wfdb-python\wfdb\io\annotation.py:1883: RuntimeWarning: overflow encountered in long_scalars
2021-09-29T17:26:33.8186197Z   sample_diff += skip_diff
2021-09-29T17:27:41.5789858Z F.........................................
2021-09-29T17:27:41.5790791Z ======================================================================
2021-09-29T17:27:41.5791673Z FAIL: Read and write annotations with large time skips
2021-09-29T17:27:41.5792556Z ----------------------------------------------------------------------
2021-09-29T17:27:41.5793298Z Traceback (most recent call last):
2021-09-29T17:27:41.5795825Z   File "D:\a\wfdb-python\wfdb-python\tests\test_annotation.py", line 198, in test_4
2021-09-29T17:27:41.5797271Z     self.assertEqual(annotation.sample[0], 10000000000)
2021-09-29T17:27:41.5798112Z AssertionError: 1410065408 != 10000000000
2021-09-29T17:27:41.5798595Z 
2021-09-29T17:27:41.5799226Z ----------------------------------------------------------------------
2021-09-29T17:27:41.5799952Z Ran 45 tests in 72.082s
2021-09-29T17:27:41.5800367Z 
2021-09-29T17:27:41.5800921Z FAILED (failures=1)
2021-09-29T17:27:41.9665736Z ##[error]Process completed with exit code 1.

Clearly something is trying to stuff timestamps into 32-bit integers (which is wrong, of course). The obvious culprit is this:

def lists_to_int_arrays(*args):
    return [np.array(a, dtype='int') for a in args]

but for me, that raises an OverflowError if the value is too large, it doesn't merely show a warning.

Benjamin Moody added 7 commits September 30, 2021 12:31
This variable contains the complete contents of the input annotation
file, as a numpy array of pairs of bytes (shape=(N,2), dtype='uint8').
It is neither a str nor a bytes object.
In WFDB-format annotation files, annotation timestamps are represented
as an offset from the previous annotation.  When this offset is less
than 0 or greater than 1023, a SKIP pseudo-annotation is used; when
the offset is greater than 2**31 - 1 or less than -2**31, multiple
SKIPs must be used.  Thus, proc_core_fields must be able to handle an
arbitrary number of SKIPs in a row, preceding the actual annotation,
and add all of the offsets together to obtain the final timestamp.
When reading an annotation file in WFDB format, the timestamp (sample
number) must be computed by adding up the relative timestamp
difference for each annotation.  For long records, sample numbers can
easily exceed 2**32.

The input to proc_core_fields is a numpy array, so if we operate on
the byte values with ordinary arithmetic operations, the result will
be a numpy integer object with numpy's default precision (i.e., int32
on 32-bit architectures, int64 on 64-bit architectures.)

Instead, calculate the result as a Python integer, to avoid
architecture-dependent behavior and (possible) silent wrapping.

(Furthermore, use left-shift operations instead of multiplying by
constants that are hard to remember.)
For long records, annotation timestamps (sample numbers) can easily
exceed the range of a numpy 'int' on 32-bit architectures.  Therefore,
store the 'sample' array as 'int64' instead.
If the gap between two consecutive annotation timestamps is greater
than 2**31 - 1 ticks, it must be represented as two or more SKIP
pseudo-annotations.  Handle this correctly in field2bytes() (to
actually generate the correct byte sequences) and in
Annotation.check_field() (to permit the application to specify such a
gap.)

(Previously, if there was a gap of exactly 2**31 ticks, this would not
be caught by check_field, and field2bytes would incorrectly generate a
SKIP of -2**31 instead.)
Make the test_annotation class a subclass of unittest.TestCase,
allowing it to use standard unit testing utility methods, as well as
setup and teardown functions.  (nosetests will run "test" class
methods automatically even if they are not subclasses of TestCase, but
unittest won't.)  Rename the class to TestAnnotation for consistency.

Make the module executable (invoke unittest.main()) so it can be
invoked simply using 'python3 -m tests.test_annotation'.

Ensure that temporary files created by the annotation tests will be
correctly cleaned up by TestAnnotation.tearDownClass() rather than by
the unrelated TestRecord.tearDownClass().  (Presumably this only
happened to work previously because "test_record" comes alphabetically
after "test_annotation".)
Check that we can both read and write an annotation file containing a
relative offset of more than 2**31 - 1 ticks, which necessitates the
use of multiple SKIP pseudo-annotations.
@bemoody bemoody changed the title Handle large gaps in annotation files Handle sample numbers > 2**31 in annotation files Oct 1, 2021
Copy link
Member

@cx1111 cx1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

wfdb/io/annotation.py Show resolved Hide resolved
data_bytes = [sd & 255, ((sd & 768) >> 8) + 4*typecode]
data_bytes = []
# Add SKIP elements if value is too large
while sd > 0x7fffffff:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have this hex value and the same decimal value 2147483647 in this file. Perhaps unify under a constant module level variable for clarity?

@tompollard
Copy link
Member

Thanks Benjamin. Chen, it looks like your points are resolved, so I'm taking the liberty of merging!

@tompollard tompollard merged commit 0d41603 into master Mar 3, 2022
@tompollard tompollard deleted the huge-skip branch March 3, 2022 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants