pypdf crashes when extracting text from pdf #2173

fstark · 2023-09-07T18:28:15Z

I am trying to extract the text from a set of pdf. pypdf fails on some of them.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
python -m platform
Linux-5.15.0-82-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
# TODO: Your output goes here

pypdf==3.15.5, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

from pypdf import PdfReader

if __name__ == '__main__':
    pdf = PdfReader("bug.pdf")
    for page_number, page in enumerate(pdf.pages, start=1):
        print( f" {page_number}", end="" )
        text = page.extract_text()

bug.pdf

The page is the first page of this PDF from archive.org: https://archive.org/download/1979-Fall-compute-magazine/Compute_Issue_001_1979_Fall.pdf

Let us know if we may add them to our tests!

Traceback

This is the complete Traceback I see:

 1Traceback (most recent call last):
  File "/home/fred/Development/extractpages/bug.py", line 7, in <module>
    text = page.extract_text()
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_page.py", line 2263, in extract_text
    return self._extract_text(
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_page.py", line 1908, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 29, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 54, in build_char_map_from_dict
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 234, in parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 309, in process_cm_line
    multiline_rg = parse_bfrange(line, map_dict, int_entry, multiline_rg)
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 340, in parse_bfrange
    a = int(lst[0], 16)
ValueError: invalid literal for int() with base 16: b'\t\t'

The text was updated successfully, but these errors were encountered:

fstark · 2023-09-07T18:34:13Z

The reason is the use of tab characters instead of spaces in some places in the pdf.

The following changes fixed the behavior:

diff --git a/pypdf/_cmap.py b/pypdf/_cmap.py
index 369ab19..01397d3 100644
--- a/pypdf/_cmap.py
+++ b/pypdf/_cmap.py
@@ -295,7 +295,7 @@ def process_cm_line(
     map_dict: Dict[Any, Any],
     int_entry: List[int],
 ) -> Tuple[bool, bool, Union[None, Tuple[int, int]]]:
-    if line in (b"", b" ") or line[0] == 37:  # 37 = %
+    if line in (b"", b" ",b"\t") or line[0] == 37:  # 37 = %
         return process_rg, process_char, multiline_rg
     if b"beginbfrange" in line:
         process_rg = True
@@ -318,7 +318,7 @@ def parse_bfrange(
     int_entry: List[int],
     multiline_rg: Union[None, Tuple[int, int]],
 ) -> Union[None, Tuple[int, int]]:
-    lst = [x for x in line.split(b" ") if x]
+    lst = [x for x in line.split() if x]
     closure_found = False
     if multiline_rg is not None:
         fmt = b"%%0%dX" % (map_dict[-1] * 2)

2 changes:

added lines with only tabs to the skip list
splitted data by all whitespaces, not only spaces (using split() instead of split(b" "))

Overall, while this code looks fragile (what if a line is composed of two spaces? two tabs?), this fixed all the occurence of the issue I had.

pubpub-zz · 2023-09-07T19:22:10Z

Thanks, for your detailed analysis.
I had some issues to download the file. I'm attaching the first page of the file here
1stPage.pdf

closes py-pdf#2173

pubpub-zz · 2023-09-07T19:38:22Z

@MartinThoma
Can you tag as coauthored #2174 with @fstark - to be added to contributors !!

Closes #2173

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 7, 2023

BUG: accept tabs in cmaps

fd07c36

closes py-pdf#2173

pubpub-zz mentioned this issue Sep 7, 2023

BUG: Accept tabs in cmaps #2174

Merged

MartinThoma closed this as completed in #2174 Sep 8, 2023

MartinThoma pushed a commit that referenced this issue Sep 8, 2023

BUG: Accept tabs in cmaps (#2174)

ad4f13d

Closes #2173

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pypdf crashes when extracting text from pdf #2173

pypdf crashes when extracting text from pdf #2173

fstark commented Sep 7, 2023

fstark commented Sep 7, 2023

pubpub-zz commented Sep 7, 2023 •

edited

Loading

pubpub-zz commented Sep 7, 2023

pypdf crashes when extracting text from pdf #2173

pypdf crashes when extracting text from pdf #2173

Comments

fstark commented Sep 7, 2023

Environment

Traceback

fstark commented Sep 7, 2023

pubpub-zz commented Sep 7, 2023 • edited Loading

pubpub-zz commented Sep 7, 2023

pubpub-zz commented Sep 7, 2023 •

edited

Loading