PdfReader method `_get_outlines()` can produce outline items with incorrect "/Title" #1121

mtd91429 · 2022-07-17T00:24:49Z

When obtaining outlines from a PDF with bookmarks that were copied/pasted and then re-titled, some of the returned outline items point to the wrong title.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.5.0 (commit ed5ecd9)
Python 3.10

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader
reader = PdfReader("mistitled_outlines_example.pdf")

def show_tree(outlines, indent=0):
    for item in outlines:
        if isinstance(item, list):
            show_tree(item, indent+4)
        else:           
            print(f'{" "*indent}{item.title}')

show_tree(reader.outlines)

Nineteenth
    Twentieth
    Twenty-first
    Twenty-second
        Twenty-third
        Twenty-fourth
    Twenty-fifth
        Twenty-sixth
        Twenty-seventh
Nineteenth
    Twentieth
    Twenty-first
    Twenty-second
    Twenty-third
Twenty-fourth
    Twenty-fifth
    Twenty-sixth
Twenty-seventh
Nineteenth
    Twentieth
    Twenty-first
    Twenty-second
    Twenty-third
    Twenty-fourth
    Twenty-fifth
    Twenty-sixth
    Twenty-seventh

The expected output for this file is:

First
    Second
    Third
    Fourth
        Fifth
        Sixth
    Seventh
        Eighth
        Ninth
Tenth
    Eleventh
    Twelfth
    Thirteenth
    Fourteenth
Fifteenth
    Sixteenth
    Seventeenth
Eighteenth
Nineteenth
    Twentieth
    Twenty-first
    Twenty-second
    Twenty-third
    Twenty-fourth
    Twenty-fifth
    Twenty-sixth
    Twenty-seventh

Here is the PDF:
mistitled_outlines_example.pdf

Here is a screenshot of the outline in Adobe Acrobat Reader:

This is in the context of py-pdf/pypdf#1121 Co-authored-by: mtd91429 <[email protected]>

See #1121

MartinThoma · 2022-07-17T07:37:52Z

Thank you for the good example! I've added it to the unit tests. If anybody knows how to fix this, it's now easy to test it :-)

mtd91429 · 2022-07-17T18:45:48Z

I think this has to do with how Python handles pointers and the fact that the outline objects are all recycling the same named destination. Specifically, the "First" outline entry points to the named destination "section.1", as does the "Tenth" and "Nineteenth" outline entries; the "Second", "Eleventh", and "Twentieth" point to "section.2".

I've been stepping through the code to determine where the error is introduced, and I think it occurs at https://github.com/py-pdf/PyPDF2/blob/ae0ff49058e6c57a8edcfcd3d956665ddaa8a787/PyPDF2/_reader.py#L837

I think I have a fix and issued a pull request #1128

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 17, 2022

MartinThoma added a commit to py-pdf/sample-files that referenced this issue Jul 17, 2022

ENH: Add mistitled-outlines example

2ce1098

This is in the context of py-pdf/pypdf#1121 Co-authored-by: mtd91429 <[email protected]>

MartinThoma added a commit to py-pdf/sample-files that referenced this issue Jul 17, 2022

ENH: Add mistitled-outlines example (#12)

6231cd5

This is in the context of py-pdf/pypdf#1121 Co-authored-by: mtd91429 <[email protected]>

MartinThoma added a commit that referenced this issue Jul 17, 2022

TST: Add MCVE of #1121

5037a67

MartinThoma mentioned this issue Jul 17, 2022

TST: Add MCVE showing outline title issue #1123

Merged

MartinThoma added a commit that referenced this issue Jul 17, 2022

TST: Add MCVE of #1121

81e4da1

MartinThoma added a commit that referenced this issue Jul 17, 2022

TST: Add MCVE showing outline title issue (#1123)

5ddf4cb

See #1121

MartinThoma added MCVE in Tests The MCVE was added to PyPDF2 test suite and removed Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 17, 2022

mtd91429 mentioned this issue Jul 17, 2022

use build_destination for named destination outlines #1128

Merged

MartinThoma closed this as completed in 7fba86b Jul 17, 2022

mtd91429 mentioned this issue Jul 18, 2022

getOutlines() returns repeated items #381

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfReader method `_get_outlines()` can produce outline items with incorrect "/Title" #1121

PdfReader method `_get_outlines()` can produce outline items with incorrect "/Title" #1121

mtd91429 commented Jul 17, 2022 •

edited

Loading

MartinThoma commented Jul 17, 2022

mtd91429 commented Jul 17, 2022

PdfReader method _get_outlines() can produce outline items with incorrect "/Title" #1121

PdfReader method _get_outlines() can produce outline items with incorrect "/Title" #1121

Comments

mtd91429 commented Jul 17, 2022 • edited Loading

Environment

Code + PDF

MartinThoma commented Jul 17, 2022

mtd91429 commented Jul 17, 2022

PdfReader method `_get_outlines()` can produce outline items with incorrect "/Title" #1121

PdfReader method `_get_outlines()` can produce outline items with incorrect "/Title" #1121

mtd91429 commented Jul 17, 2022 •

edited

Loading