-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle sample numbers > 2**31 in annotation files #328
Conversation
Well, that's crazy:
Clearly something is trying to stuff timestamps into 32-bit integers (which is wrong, of course). The obvious culprit is this:
but for me, that raises an OverflowError if the value is too large, it doesn't merely show a warning. |
This variable contains the complete contents of the input annotation file, as a numpy array of pairs of bytes (shape=(N,2), dtype='uint8'). It is neither a str nor a bytes object.
In WFDB-format annotation files, annotation timestamps are represented as an offset from the previous annotation. When this offset is less than 0 or greater than 1023, a SKIP pseudo-annotation is used; when the offset is greater than 2**31 - 1 or less than -2**31, multiple SKIPs must be used. Thus, proc_core_fields must be able to handle an arbitrary number of SKIPs in a row, preceding the actual annotation, and add all of the offsets together to obtain the final timestamp.
When reading an annotation file in WFDB format, the timestamp (sample number) must be computed by adding up the relative timestamp difference for each annotation. For long records, sample numbers can easily exceed 2**32. The input to proc_core_fields is a numpy array, so if we operate on the byte values with ordinary arithmetic operations, the result will be a numpy integer object with numpy's default precision (i.e., int32 on 32-bit architectures, int64 on 64-bit architectures.) Instead, calculate the result as a Python integer, to avoid architecture-dependent behavior and (possible) silent wrapping. (Furthermore, use left-shift operations instead of multiplying by constants that are hard to remember.)
For long records, annotation timestamps (sample numbers) can easily exceed the range of a numpy 'int' on 32-bit architectures. Therefore, store the 'sample' array as 'int64' instead.
If the gap between two consecutive annotation timestamps is greater than 2**31 - 1 ticks, it must be represented as two or more SKIP pseudo-annotations. Handle this correctly in field2bytes() (to actually generate the correct byte sequences) and in Annotation.check_field() (to permit the application to specify such a gap.) (Previously, if there was a gap of exactly 2**31 ticks, this would not be caught by check_field, and field2bytes would incorrectly generate a SKIP of -2**31 instead.)
Make the test_annotation class a subclass of unittest.TestCase, allowing it to use standard unit testing utility methods, as well as setup and teardown functions. (nosetests will run "test" class methods automatically even if they are not subclasses of TestCase, but unittest won't.) Rename the class to TestAnnotation for consistency. Make the module executable (invoke unittest.main()) so it can be invoked simply using 'python3 -m tests.test_annotation'. Ensure that temporary files created by the annotation tests will be correctly cleaned up by TestAnnotation.tearDownClass() rather than by the unrelated TestRecord.tearDownClass(). (Presumably this only happened to work previously because "test_record" comes alphabetically after "test_annotation".)
Check that we can both read and write an annotation file containing a relative offset of more than 2**31 - 1 ticks, which necessitates the use of multiple SKIP pseudo-annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
wfdb/io/annotation.py
Outdated
data_bytes = [sd & 255, ((sd & 768) >> 8) + 4*typecode] | ||
data_bytes = [] | ||
# Add SKIP elements if value is too large | ||
while sd > 0x7fffffff: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have this hex value and the same decimal value 2147483647 in this file. Perhaps unify under a constant module level variable for clarity?
Thanks Benjamin. Chen, it looks like your points are resolved, so I'm taking the liberty of merging! |
In an Annotation object, the event timestamps are stored as an array
of integers. (The array is called
sample
because one "tick" usuallycorresponds to one sample in the signal file, though this is not
necessarily the case.)
It's uncommon, but by no means unheard of, to have a record longer
than 2147483647 ticks (for example, the vendor annotations for
MIMIC-IV waveforms have 1-millisecond resolution, and many of those
records are longer than 24 days.) Therefore, we must be sure that the
sample
array is able to accommodate larger values, which means thetimestamps must be calculated and stored as 64-bit integers, rather
than numpy's default integer type.
Furthermore, timestamps in WFDB annotation files are stored as an
offset from the previous annotation. If the offset is between 0 and
1023, then the offset along with the annotation type is stored as two
bytes. If the offset is negative or larger than 1023, then a "SKIP"
sequence is used to specify a 32-bit signed offset.
If the offset between two annotations is larger than 2147483647 (or
less than -2147483648), then multiple SKIPs must be used in a row.
Both
rdann
andwrann
need to accommodate this case.