Need ability to guard strings so annotations won't be inserted inside; example case where annotator fails #180

jcushman · 2024-05-15T18:49:17Z

I found an example case in our code where annotation fails. In this simplified example, we're trying to insert {brackets} around the citation, after removing page numbers in pre-processing:

>>> from eyecite.annotate import annotate_citations
# we start with this html:
>>> html = '145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459. 1 Bla'
# we remove page labels and other tags during cleaning:
>>> text = '145, 11 N. H. 459. 1 Bla'
# and then, having located citations, attempt to annotate them:
>>> annots = [((5, 17), '{', '}')]
# but look where the '{' ends up:
>>> annotate_citations(text, annots, html)
'145, <a id="p4{10" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459}. 1 Bla'
# it does work correctly with the slower python diff algorithm, though that may not be reliable, just how the diff shakes out in this case:
>>> annotate_citations(text, annots, html, use_dmp=False)
'145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>{11 <em>N. H.</em> 459}. 1 Bla'

The solution I found is to protect the strings removed in cleaning, by temporarily encoding them as reserved UTF characters:

import re

def encode_strings_as_unicode(big_string, substrings):
    """ Replace substrings in big_string with unique characters in the unicode private use range. """
    char_mapping = []
    for i, substring in enumerate(substrings):
        unicode_char = chr(0xE000 + i)  # start of the private use range
        char_mapping.append([substring, unicode_char])
        big_string = big_string.replace(substring, unicode_char, 1)
    return big_string, char_mapping

def decode_unicode_to_strings(big_string, char_mapping):
    """Undo encode_strings_as_unicode by replacing each pair in char_mapping."""
    for s, c in char_mapping:
        big_string = big_string.replace(c, s)
    return big_string

strings_to_protect = re.findall(r'<a[^>]*>.*?</a>|<[^>]+>', html, flags=re.S)
new_html, char_mapping = encode_strings_as_unicode(html, strings_to_protect)
new_html = annotate_citations(text, annots, new_html)
new_html = decode_unicode_to_strings(new_html, char_mapping)

This works and gives correct results, but is a bit messy, especially since it requires some funky coding to use a custom annotation function:

def annotator(char_mapping, before, encoded_text, after):
    """
        Attach annotation tags to a stretch of citation text. If text contains a link or an unbalanced tag, wrap
        those tags.
    """
    text = decode_unicode_to_strings(encoded_text, char_mapping)
    if '<a' in text or not is_balanced_html(text):
        encoded_text = re.sub(r'[\uE000-\uF8FF]', rf"{after}\g<0>{before}", encoded_text)
    return before + encoded_text + after

# used as as:
new_html = annotate_citations(text, annots, new_html, annotator=partial(annotator, char_mapping))

The ideal thing would probably be to have a way to give hints to the diff function -- something like annotate_citations(text, annots, new_html, protected_strings=strings_removed_in_cleaning), and then ensure that all protected strings end up as coherent inserts in the diff generated by diff-match-patch.

The text was updated successfully, but these errors were encountered:

jcushman · 2024-05-15T19:00:26Z

For anyone who ever digs into this, the underlying issue is the difference between these two possible diffs:

# under the hood when we fail:
>>> import fast_diff_match_patch; fast_diff_match_patch.diff(text, html)
[('=', 5), ('-', 8), ('+', 111), ('=', 11)]
# under the hood when we succeed:
>>> from difflib import SequenceMatcher; SequenceMatcher(a=text, b=html, autojunk=False).get_opcodes()
[('equal', 0, 5, 0, 5), ('insert', 5, 5, 5, 99), ('equal', 5, 8, 99, 102), ('insert', 8, 8, 102, 106), ('equal', 8, 13, 106, 111), ('insert', 13, 13, 111, 116), ('equal', 13, 24, 116, 127)]

It should be possible to pass in strings_to_protect and:

use strings_to_protect to manually generate a diff between html-without-strings-to-protect and html-with-strings-to-protect
generate a diff between text and html-without-strings-to-protect
merge the two diffs into one combined diff
use that combined diff in the existing algorithm

DMP is designed for this sort of thing, so that should work fine, it'll just be a little annoying to sort out the code to manually generate the needed diffs and to merge diffs.

mlissner · 2024-05-15T20:19:55Z

OK, I sort of see what's happening here, but can you say how prevalent this is or what prerequisites it has? I'm trying to judge how much to worry about this vs. how much of a corner case it is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need ability to guard strings so annotations won't be inserted inside; example case where annotator fails #180

Need ability to guard strings so annotations won't be inserted inside; example case where annotator fails #180

jcushman commented May 15, 2024

jcushman commented May 15, 2024

mlissner commented May 15, 2024

Need ability to guard strings so annotations won't be inserted inside; example case where annotator fails #180

Need ability to guard strings so annotations won't be inserted inside; example case where annotator fails #180

Comments

jcushman commented May 15, 2024

jcushman commented May 15, 2024

mlissner commented May 15, 2024