Skip to content

Commit

Permalink
Avoid a crash when a ToUnicode CMap has an empty dstString in beginbf…
Browse files Browse the repository at this point in the history
…char

This is not a principled fix, but it is a hack to avoid a crash when
encountering an empty dstString in a `beginbfchar` table in a
ToUnicode CMap.

The right way to fix this would be to replace all the string
manipulation with a formal grammar, but i don't have the skill or
capacity to do that right now.

Instead, we take narrow aim at the issue of zero-length (empty) hex
string representations.

We take advantage of the fact that no angle-bracket-delimited hex
string contains a . character.  when we encounter an empty hex string,
rather than replacing it with the empty string, we replace it with a
literal ".".  Then, when we encounter a ".", we remember that it was
supposed to be an empty string.

One consequence of this fix is that the exported cmap can now return
an empty string, so we also have to clean up
`PageObject::process_operation` so that it doesn't try to read the
final character from an empty string.

This is a hackish workaround for py-pdf#1111.
  • Loading branch information
dkg committed Jul 15, 2022
1 parent 9bbe827 commit cba5b04
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 4 deletions.
18 changes: 14 additions & 4 deletions PyPDF2/_cmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,13 @@ def parse_to_unicode(
for i in range(len(ll)):
j = ll[i].find(b">")
if j >= 0:
ll[i] = ll[i][:j].replace(b" ", b"") + b" " + ll[i][j + 1 :]
if j == 0:
# string is empty: stash a placeholder here (see below)
# see https://github.com/py-pdf/PyPDF2/issues/1111
content = b"."
else:
content = ll[i][:j].replace(b" ", b"")
ll[i] = content + b" " + ll[i][j + 1 :]
cm = (
(b" ".join(ll))
.replace(b"[", b" [ ")
Expand Down Expand Up @@ -246,13 +252,17 @@ def parse_to_unicode(
lst = [x for x in l.split(b" ") if x]
map_dict[-1] = len(lst[0]) // 2
while len(lst) > 1:
map_to = ""
# placeholder (see above) means empty string
if lst[1] != b".":
map_to = unhexlify(lst[1]).decode(
"utf-16-be", "surrogatepass"
) # join is here as some cases where the code was split
map_dict[
unhexlify(lst[0]).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
)
] = unhexlify(lst[1]).decode(
"utf-16-be", "surrogatepass"
) # join is here as some cases where the code was split
] = map_to
int_entry.append(int(lst[0], 16))
lst = lst[2:]
for a, value in map_dict.items():
Expand Down
1 change: 1 addition & 0 deletions PyPDF2/_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -1377,6 +1377,7 @@ def process_operation(operator: bytes, operands: List) -> None:
if (
(abs(float(op)) >= _space_width)
and (abs(float(op)) <= 8 * _space_width)
and (len(text) > 0)
and (text[-1] != " ")
):
process_operation(b"Tj", [" "])
Expand Down

0 comments on commit cba5b04

Please sign in to comment.