You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found an example case in our code where annotation fails. In this simplified example, we're trying to insert {brackets} around the citation, after removing page numbers in pre-processing:
>>> from eyecite.annotate import annotate_citations
# we start with this html:
>>> html = '145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459. 1 Bla'
# we remove page labels and other tags during cleaning:
>>> text = '145, 11 N. H. 459. 1 Bla'
# and then, having located citations, attempt to annotate them:
>>> annots = [((5, 17), '{', '}')]
# but look where the '{' ends up:
>>> annotate_citations(text, annots, html)
'145, <a id="p4{10" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>11 <em>N. H.</em> 459}. 1 Bla'
# it does work correctly with the slower python diff algorithm, though that may not be reliable, just how the diff shakes out in this case:
>>> annotate_citations(text, annots, html, use_dmp=False)
'145, <a id="p410" href="#p410" data-label="410" data-citation-index="1" class="page-label">*410</a>{11 <em>N. H.</em> 459}. 1 Bla'
The solution I found is to protect the strings removed in cleaning, by temporarily encoding them as reserved UTF characters:
import re
def encode_strings_as_unicode(big_string, substrings):
""" Replace substrings in big_string with unique characters in the unicode private use range. """
char_mapping = []
for i, substring in enumerate(substrings):
unicode_char = chr(0xE000 + i) # start of the private use range
char_mapping.append([substring, unicode_char])
big_string = big_string.replace(substring, unicode_char, 1)
return big_string, char_mapping
def decode_unicode_to_strings(big_string, char_mapping):
"""Undo encode_strings_as_unicode by replacing each pair in char_mapping."""
for s, c in char_mapping:
big_string = big_string.replace(c, s)
return big_string
strings_to_protect = re.findall(r'<a[^>]*>.*?</a>|<[^>]+>', html, flags=re.S)
new_html, char_mapping = encode_strings_as_unicode(html, strings_to_protect)
new_html = annotate_citations(text, annots, new_html)
new_html = decode_unicode_to_strings(new_html, char_mapping)
This works and gives correct results, but is a bit messy, especially since it requires some funky coding to use a custom annotation function:
def annotator(char_mapping, before, encoded_text, after):
"""
Attach annotation tags to a stretch of citation text. If text contains a link or an unbalanced tag, wrap
those tags.
"""
text = decode_unicode_to_strings(encoded_text, char_mapping)
if '<a' in text or not is_balanced_html(text):
encoded_text = re.sub(r'[\uE000-\uF8FF]', rf"{after}\g<0>{before}", encoded_text)
return before + encoded_text + after
# used as as:
new_html = annotate_citations(text, annots, new_html, annotator=partial(annotator, char_mapping))
The ideal thing would probably be to have a way to give hints to the diff function -- something like annotate_citations(text, annots, new_html, protected_strings=strings_removed_in_cleaning), and then ensure that all protected strings end up as coherent inserts in the diff generated by diff-match-patch.
The text was updated successfully, but these errors were encountered:
For anyone who ever digs into this, the underlying issue is the difference between these two possible diffs:
# under the hood when we fail:
>>> import fast_diff_match_patch; fast_diff_match_patch.diff(text, html)
[('=', 5), ('-', 8), ('+', 111), ('=', 11)]
# under the hood when we succeed:
>>> from difflib import SequenceMatcher; SequenceMatcher(a=text, b=html, autojunk=False).get_opcodes()
[('equal', 0, 5, 0, 5), ('insert', 5, 5, 5, 99), ('equal', 5, 8, 99, 102), ('insert', 8, 8, 102, 106), ('equal', 8, 13, 106, 111), ('insert', 13, 13, 111, 116), ('equal', 13, 24, 116, 127)]
It should be possible to pass in strings_to_protect and:
use strings_to_protect to manually generate a diff between html-without-strings-to-protect and html-with-strings-to-protect
generate a diff between text and html-without-strings-to-protect
merge the two diffs into one combined diff
use that combined diff in the existing algorithm
DMP is designed for this sort of thing, so that should work fine, it'll just be a little annoying to sort out the code to manually generate the needed diffs and to merge diffs.
OK, I sort of see what's happening here, but can you say how prevalent this is or what prerequisites it has? I'm trying to judge how much to worry about this vs. how much of a corner case it is.
I found an example case in our code where annotation fails. In this simplified example, we're trying to insert {brackets} around the citation, after removing page numbers in pre-processing:
The solution I found is to protect the strings removed in cleaning, by temporarily encoding them as reserved UTF characters:
This works and gives correct results, but is a bit messy, especially since it requires some funky coding to use a custom annotation function:
The ideal thing would probably be to have a way to give hints to the diff function -- something like
annotate_citations(text, annots, new_html, protected_strings=strings_removed_in_cleaning)
, and then ensure that all protected strings end up as coherent inserts in the diff generated by diff-match-patch.The text was updated successfully, but these errors were encountered: