Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pypdf crashes when extracting text from pdf #2173

Closed
fstark opened this issue Sep 7, 2023 · 3 comments · Fixed by #2174
Closed

pypdf crashes when extracting text from pdf #2173

fstark opened this issue Sep 7, 2023 · 3 comments · Fixed by #2174

Comments

@fstark
Copy link

fstark commented Sep 7, 2023

I am trying to extract the text from a set of pdf. pypdf fails on some of them.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
python -m platform
Linux-5.15.0-82-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
# TODO: Your output goes here

pypdf==3.15.5, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

from pypdf import PdfReader

if __name__ == '__main__':
    pdf = PdfReader("bug.pdf")
    for page_number, page in enumerate(pdf.pages, start=1):
        print( f" {page_number}", end="" )
        text = page.extract_text()

bug.pdf

The page is the first page of this PDF from archive.org: https://archive.org/download/1979-Fall-compute-magazine/Compute_Issue_001_1979_Fall.pdf

Let us know if we may add them to our tests!

Traceback

This is the complete Traceback I see:

 1Traceback (most recent call last):
  File "/home/fred/Development/extractpages/bug.py", line 7, in <module>
    text = page.extract_text()
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_page.py", line 2263, in extract_text
    return self._extract_text(
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_page.py", line 1908, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 29, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 54, in build_char_map_from_dict
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 234, in parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 309, in process_cm_line
    multiline_rg = parse_bfrange(line, map_dict, int_entry, multiline_rg)
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 340, in parse_bfrange
    a = int(lst[0], 16)
ValueError: invalid literal for int() with base 16: b'\t\t'
@fstark
Copy link
Author

fstark commented Sep 7, 2023

The reason is the use of tab characters instead of spaces in some places in the pdf.

The following changes fixed the behavior:

diff --git a/pypdf/_cmap.py b/pypdf/_cmap.py
index 369ab19..01397d3 100644
--- a/pypdf/_cmap.py
+++ b/pypdf/_cmap.py
@@ -295,7 +295,7 @@ def process_cm_line(
     map_dict: Dict[Any, Any],
     int_entry: List[int],
 ) -> Tuple[bool, bool, Union[None, Tuple[int, int]]]:
-    if line in (b"", b" ") or line[0] == 37:  # 37 = %
+    if line in (b"", b" ",b"\t") or line[0] == 37:  # 37 = %
         return process_rg, process_char, multiline_rg
     if b"beginbfrange" in line:
         process_rg = True
@@ -318,7 +318,7 @@ def parse_bfrange(
     int_entry: List[int],
     multiline_rg: Union[None, Tuple[int, int]],
 ) -> Union[None, Tuple[int, int]]:
-    lst = [x for x in line.split(b" ") if x]
+    lst = [x for x in line.split() if x]
     closure_found = False
     if multiline_rg is not None:
         fmt = b"%%0%dX" % (map_dict[-1] * 2)

2 changes:

  • added lines with only tabs to the skip list
  • splitted data by all whitespaces, not only spaces (using split() instead of split(b" "))

Overall, while this code looks fragile (what if a line is composed of two spaces? two tabs?), this fixed all the occurence of the issue I had.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 7, 2023

Thanks, for your detailed analysis.
I had some issues to download the file. I'm attaching the first page of the file here
1stPage.pdf

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 7, 2023
@pubpub-zz
Copy link
Collaborator

@MartinThoma
Can you tag as coauthored #2174 with @fstark - to be added to contributors !!

MartinThoma pushed a commit that referenced this issue Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants