Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need ability to guard strings so annotations won't be inserted inside; example case where annotator fails #180

Open
jcushman opened this issue May 15, 2024 · 2 comments

Comments

@jcushman
Copy link
Contributor

I found an example case in our code where annotation fails. In this simplified example, we're trying to insert {brackets} around the citation, after removing page numbers in pre-processing:

>>> from eyecite.annotate import annotate_citations
# we start with this html:
>>> html = '145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459. 1 Bla'
# we remove page labels and other tags during cleaning:
>>> text = '145, 11 N. H. 459. 1 Bla'
# and then, having located citations, attempt to annotate them:
>>> annots = [((5, 17), '{', '}')]
# but look where the '{' ends up:
>>> annotate_citations(text, annots, html)
'145, <a id="p4{10" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459}. 1 Bla'
# it does work correctly with the slower python diff algorithm, though that may not be reliable, just how the diff shakes out in this case:
>>> annotate_citations(text, annots, html, use_dmp=False)
'145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>{11 <em>N. H.</em> 459}. 1 Bla'

The solution I found is to protect the strings removed in cleaning, by temporarily encoding them as reserved UTF characters:

import re

def encode_strings_as_unicode(big_string, substrings):
    """ Replace substrings in big_string with unique characters in the unicode private use range. """
    char_mapping = []
    for i, substring in enumerate(substrings):
        unicode_char = chr(0xE000 + i)  # start of the private use range
        char_mapping.append([substring, unicode_char])
        big_string = big_string.replace(substring, unicode_char, 1)
    return big_string, char_mapping

def decode_unicode_to_strings(big_string, char_mapping):
    """Undo encode_strings_as_unicode by replacing each pair in char_mapping."""
    for s, c in char_mapping:
        big_string = big_string.replace(c, s)
    return big_string

strings_to_protect = re.findall(r'<a[^>]*>.*?</a>|<[^>]+>', html, flags=re.S)
new_html, char_mapping = encode_strings_as_unicode(html, strings_to_protect)
new_html = annotate_citations(text, annots, new_html)
new_html = decode_unicode_to_strings(new_html, char_mapping)

This works and gives correct results, but is a bit messy, especially since it requires some funky coding to use a custom annotation function:

def annotator(char_mapping, before, encoded_text, after):
    """
        Attach annotation tags to a stretch of citation text. If text contains a link or an unbalanced tag, wrap
        those tags.
    """
    text = decode_unicode_to_strings(encoded_text, char_mapping)
    if '<a' in text or not is_balanced_html(text):
        encoded_text = re.sub(r'[\uE000-\uF8FF]', rf"{after}\g<0>{before}", encoded_text)
    return before + encoded_text + after

# used as as:
new_html = annotate_citations(text, annots, new_html, annotator=partial(annotator, char_mapping))

The ideal thing would probably be to have a way to give hints to the diff function -- something like annotate_citations(text, annots, new_html, protected_strings=strings_removed_in_cleaning), and then ensure that all protected strings end up as coherent inserts in the diff generated by diff-match-patch.

@jcushman
Copy link
Contributor Author

For anyone who ever digs into this, the underlying issue is the difference between these two possible diffs:

# under the hood when we fail:
>>> import fast_diff_match_patch; fast_diff_match_patch.diff(text, html)
[('=', 5), ('-', 8), ('+', 111), ('=', 11)]
# under the hood when we succeed:
>>> from difflib import SequenceMatcher; SequenceMatcher(a=text, b=html, autojunk=False).get_opcodes()
[('equal', 0, 5, 0, 5), ('insert', 5, 5, 5, 99), ('equal', 5, 8, 99, 102), ('insert', 8, 8, 102, 106), ('equal', 8, 13, 106, 111), ('insert', 13, 13, 111, 116), ('equal', 13, 24, 116, 127)]

It should be possible to pass in strings_to_protect and:

  • use strings_to_protect to manually generate a diff between html-without-strings-to-protect and html-with-strings-to-protect
  • generate a diff between text and html-without-strings-to-protect
  • merge the two diffs into one combined diff
  • use that combined diff in the existing algorithm

DMP is designed for this sort of thing, so that should work fine, it'll just be a little annoying to sort out the code to manually generate the needed diffs and to merge diffs.

@mlissner
Copy link
Member

OK, I sort of see what's happening here, but can you say how prevalent this is or what prerequisites it has? I'm trying to judge how much to worry about this vs. how much of a corner case it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants