AssertionError: "a > b" was NOT fulfilled in parse_to_unicode #990

MartinThoma · 2022-06-14T16:13:52Z

When trying to extract the text from a PDF, I get an exception.

Environment

$ python -m platform
Linux-5.4.0-113-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.2.0

MCVE: Code + PDF

This is a minimal, complete example that shows the issue with 923767.pdf:

from PyPDF2 import PdfReader
reader = PdfReader("923767.pdf")
reader.pages[0].extract_text()

gives

  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1301, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1124, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 21, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 225, in parse_to_unicode
    assert a > b
AssertionError

The text was updated successfully, but these errors were encountered:

MartinThoma · 2022-06-14T16:15:07Z

Other examples:

pdf/2498bbc3c849fc85dc76d69690aeb68b.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/925/925036.pdf)
pdf/40527db1dc8c2e72d3992820d8ad6bae.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/914/914094.pdf)
pdf/43156e5b2526ce8a40a89da1a00b4e1f.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/939/939597.pdf)
pdf/609559bf77a0afaae7a99d8d895e7732.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/916/916267.pdf)
pdf/665b80b8bed00e0c37f0a5def835fa2e.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/948/948596.pdf)
pdf/b886c988bb147c4bdf9b2ab954f84e2e.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/986/986335.pdf)
pdf/bd086499b3e44799827206247eccce46.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/960/960844.pdf)
pdf/c33d0604d759e4c9556ac1f327cf278c.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/983/983944.pdf)

pubpub-zz · 2022-06-14T20:43:37Z

error in the assert; however this is extra test not improving performances : I propose to remove it (PR #995)

pubpub-zz · 2022-06-15T18:44:46Z

@MartinThoma
This issue should be closed too

MartinThoma · 2022-06-15T18:59:44Z

Closed by #995 :-)

MartinThoma self-assigned this Jun 14, 2022

MartinThoma closed this as completed Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: "a > b" was NOT fulfilled in parse_to_unicode #990

AssertionError: "a > b" was NOT fulfilled in parse_to_unicode #990

MartinThoma commented Jun 14, 2022

MartinThoma commented Jun 14, 2022

pubpub-zz commented Jun 14, 2022 •

edited

Loading

pubpub-zz commented Jun 15, 2022

MartinThoma commented Jun 15, 2022

AssertionError: "a > b" was NOT fulfilled in parse_to_unicode #990

AssertionError: "a > b" was NOT fulfilled in parse_to_unicode #990

Comments

MartinThoma commented Jun 14, 2022

Environment

MCVE: Code + PDF

MartinThoma commented Jun 14, 2022

pubpub-zz commented Jun 14, 2022 • edited Loading

pubpub-zz commented Jun 15, 2022

MartinThoma commented Jun 15, 2022

pubpub-zz commented Jun 14, 2022 •

edited

Loading