-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pypdf crashes when extracting text from pdf #2173
Comments
The reason is the use of tab characters instead of spaces in some places in the pdf. The following changes fixed the behavior: diff --git a/pypdf/_cmap.py b/pypdf/_cmap.py
index 369ab19..01397d3 100644
--- a/pypdf/_cmap.py
+++ b/pypdf/_cmap.py
@@ -295,7 +295,7 @@ def process_cm_line(
map_dict: Dict[Any, Any],
int_entry: List[int],
) -> Tuple[bool, bool, Union[None, Tuple[int, int]]]:
- if line in (b"", b" ") or line[0] == 37: # 37 = %
+ if line in (b"", b" ",b"\t") or line[0] == 37: # 37 = %
return process_rg, process_char, multiline_rg
if b"beginbfrange" in line:
process_rg = True
@@ -318,7 +318,7 @@ def parse_bfrange(
int_entry: List[int],
multiline_rg: Union[None, Tuple[int, int]],
) -> Union[None, Tuple[int, int]]:
- lst = [x for x in line.split(b" ") if x]
+ lst = [x for x in line.split() if x]
closure_found = False
if multiline_rg is not None:
fmt = b"%%0%dX" % (map_dict[-1] * 2) 2 changes:
Overall, while this code looks fragile (what if a line is composed of two spaces? two tabs?), this fixed all the occurence of the issue I had. |
Thanks, for your detailed analysis. |
@MartinThoma |
I am trying to extract the text from a set of pdf. pypdf fails on some of them.
Environment
Which environment were you using when you encountered the problem?
pypdf==3.15.5, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none
bug.pdf
The page is the first page of this PDF from archive.org: https://archive.org/download/1979-Fall-compute-magazine/Compute_Issue_001_1979_Fall.pdf
Let us know if we may add them to our tests!
Traceback
This is the complete Traceback I see:
The text was updated successfully, but these errors were encountered: