Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to deal with errors in the bookmark structure #2236

Closed
PAlvesLancs opened this issue Oct 3, 2023 · 1 comment · Fixed by #2237
Closed

Not able to deal with errors in the bookmark structure #2236

PAlvesLancs opened this issue Oct 3, 2023 · 1 comment · Fixed by #2237

Comments

@PAlvesLancs
Copy link

I am using the code below (https://stackoverflow.com/questions/54303318/read-all-bookmarks-from-a-pdf-document-and-create-a-dictionary-with-pagenumber-a) as a starting point and it crashes in several PDFs (see an example here: https://easyupload.io/7fsipz).

Apparently, the PDF itself has some structural errors, but pypdf is not able to ignore them.
The output:

"( ValueError: not enough values to unpack (expected 3, got 1)"
C:\Users\XXXXX\PycharmProjects\pythonProject\venv\Scripts\python.exe "C:\Google Drive\python\projects\Get bookmarks.py"
Traceback (most recent call last):
File "C:\Google Drive\python\projects\Get bookmarks.py", line 24, in
bms = bookmark_dict(reader.outline, use_labels=False)
File "C:\Users\XXXXX\PycharmProjects\pythonProject\venv\lib\site-packages\pypdf_reader.py", line 844, in outline
return self._get_outline()
File "C:\Users\XXXXX\PycharmProjects\pythonProject\venv\lib\site-packages\pypdf_reader.py", line 880, in _get_outline
outline_obj = self._build_outline_item(node)
File "C:\Users\XXXXX\PycharmProjects\pythonProject\venv\lib\site-packages\pypdf_reader.py", line 1054, in _build_outline_item
outline_item = self._build_destination(title, dest)
File "C:\Users\XXXXX\PycharmProjects\pythonProject\venv\lib\site-packages\pypdf_reader.py", line 1018, in _build_destination
return Destination(title, page, Fit(fit_type=typ, fit_args=array)) # type: ignore
File "C:\Users\XXXXX\PycharmProjects\pythonProject\venv\lib\site-packages\pypdf\generic_data_structures.py", line 1495, in init
(
ValueError: not enough values to unpack (expected 3, got 2)
Process finished with exit code 1

The code (a direct use of the thread mentioned above).

from typing import Dict, Union
from pypdf import PdfReader

def bookmark_dict(
        bookmark_list, use_labels: bool = False
) -> Dict[Union[str, int], str]:
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            result.update(bookmark_dict(item))
        else:
            page_index = reader.get_destination_page_number(item)
            page_label = reader.page_labels[page_index]
            if use_labels:
                result[page_label] = item.title
            else:
                result[page_index] = item.title
    return result

if __name__ == "__main__":
    folder ="x:\\"
    file="TestPDF.pdf"
    reader = PdfReader(folder + file)
    bms = bookmark_dict(reader.outline, use_labels=False)
    for page_nb, title in sorted(bms.items(), key=lambda n: f"{str(n[0]):>5}"):
         print(f"{page_nb:>3}: {title}")

The PDF file that is giving me an error can be found here:

Thanks guys!

@pubpub-zz
Copy link
Collaborator

the PDF has outlines where the /XYZ destination has no top parameter. This is not in accordance with PDF reference however Acrobat Reader can process them. I pushed the test up to remove left parameter and the test is still good. I've added robustifcation for this case too.
cleaned test file below
tt1.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants