Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf.getDocumentInfo().title sometimes None #511

Closed
clach04 opened this issue Aug 9, 2019 · 11 comments · Fixed by #744
Closed

pdf.getDocumentInfo().title sometimes None #511

clach04 opened this issue Aug 9, 2019 · 11 comments · Fixed by #744
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness

Comments

@clach04
Copy link
Contributor

clach04 commented Aug 9, 2019

Test case in #511 (comment) (original no longer available test case Details in https://github.com/mstamy2/PyPDF3/issues/13).

pdf = PdfFileReader(f)
info = pdf.getDocumentInfo()
info.title  # reports None
info['/Title']  # works
@clach04
Copy link
Contributor Author

clach04 commented Apr 8, 2022

@MartinThoma is this now resolved? Or can't reproduce as the other project appears to have been removed? I might have the original bug script around if that would be helpful.

@MartinThoma
Copy link
Member

#13 is the issue you've linked to. We moved PyPDF2 from mstamy2 -> me (MartinThoma) -> py-pdf (a Github Organization) this week.

As PyPDF2 has been inactive for a long time, I need to clean up a lot of things (PRs, issues, code, documentation, ...).

I'm sorry that I didn't add more details. Could you please check if the issue still persists? If so, please create a MCVE (including a PDF): https://stackoverflow.com/help/minimal-reproducible-example . Then I'll re-open :-)

@clach04
Copy link
Contributor Author

clach04 commented Apr 9, 2022

That's great news @MartinThoma! Thanks for the quick status update. How about any future issues get closed with a link to #657 as an explanation?

Back to this specific issue, this issue has nothing to do with #13, its specifically about the linked issue to a repo that no longer exists. I have the test code in claird#59 but the test pdf is only in the deleted repo (it was about 5Mb). @mstamy2 is the original owner, I noticed he responded to #657 so pinging him in case the repo is simply private rather than deleted as that's where I put all my notes. @mstamy2 do you still have access to the https://github.com/mstamy2/PyPDF3/issues/13?

I can try and check an old hard drive to see if I still have it but I won't have access to it for a while :-(

@clach04
Copy link
Contributor Author

clach04 commented Apr 9, 2022

New test case created based on original :-)

Overview

I've seen a number of PDF files where the title attribute/property is reported as None but when then accessing /Title there is content. I've no idea if this is a problem with the pdf(s) or with PyPDF. There is a workaround (which may be an indication of a potential change to PyPDF but I'm unclear of what the correct thing to do here is)

Attached PDF is about 5Mb and is a sample of a document that exhibits this behavior, I did not create it (nor do I know how it was created) so the only information we have is the metadata inside.

Test case, along with workaround below:

Test PDF file

title_bug.pdf

Test case

EDIT inline version dumps PyPDF version (attached version does not).

inline and attached (rename to .py)
pdf_title_bug.txt

#!/usr/bin/env python
# -*- coding: windows-1252 -*-
# vim:ts=4:sw=4:softtabstop=4:smarttab:expandtab
#

import os
import sys

ver_to_test = 3
ver_to_test = 4
ver_to_test = 2

if ver_to_test == 4:
    from pypdf import PdfFileReader  # https://github.com/claird/PyPDF4
elif ver_to_test == 3:
    from PyPDF3 import PdfFileReader  # https://github.com/mstamy2/PyPDF3
else:
    from PyPDF2 import PdfFileReader  # https://github.com/py-pdf/PyPDF2 - nee https://github.com/mstamy2/PyPDF2 / https://pythonhosted.org/PyPDF2/
    import PyPDF2 as pypdf_lib


print('Python %s on %s' % (sys.version, sys.platform))
print(pypdf_lib.__version__)

filename = 'title_bug.pdf'
f = open(filename, 'rb')
pdf = PdfFileReader(f)
info = pdf.documentInfo
#print(info)
print('title attribute %r' % info.title)  # reports None
print('title getText() %r' % info.getText("/Title"))  # this is what .title property calls
print('title get() %r' % info.get("/Title"))  # this is part of what dict[] does
print('title get().getObject() %r' % info.get("/Title").getObject())  # this is what dict[] does
print('/Title dict entry %r' % info['/Title'])  # with test pdf works
print('title attribute %r' % info.title)  # Sanity check it is still None
print('title Workaround %r' % (info.title or info['/Title']))  # Workaround


f.close()

output

Python 2

Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)] on win32
1.27.2
title attribute None
title getText() None
title get() IndirectObject(305, 0)
title get().getObject() u'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'
/Title dict entry u'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'
title attribute None
title Workaround u'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'

Python 3

Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] on win32
1.27.2
title attribute None
title getText() None
title get() IndirectObject(305, 0)
title get().getObject() 'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'
/Title dict entry 'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'
title attribute None
title Workaround 'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'

@clach04
Copy link
Contributor Author

clach04 commented Apr 9, 2022

@MartinThoma hopefully this helps. I really know nothing about PDF internals which is why I've not attempted a fix :-( I have a workaround that seems to be effective but not sure if it is reasonable.

If you need anything else from me on this please ping me (I recall I had other PDFs with similar behaviors, this was one of the smaller ones).

Thanks for picking up the torch on this and trying to organize collaboration

@MartinThoma
Copy link
Member

this issue has nothing to do with #13, its specifically about the linked issue to a repo that no longer exists

The repo was moved issue 13 from the linked repo is #13 here.

@MartinThoma MartinThoma reopened this Apr 9, 2022
@MartinThoma
Copy link
Member

Thank you very much! I now hope that somebody will pick it up and dig into it :-)

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 9, 2022
@clach04
Copy link
Contributor Author

clach04 commented Apr 9, 2022

updated test case with PyPDF2 version

@clach04
Copy link
Contributor Author

clach04 commented Apr 14, 2022

Located another PDF, appears to be created with the same PDF generator.

Adobe Reader reports; PDF Version 1.3 (Acrobat 4.x). InfoKey: Producer, InfoValue: macOS Version 10.10 Quartz PDFContext

This file is much smaller than the attached test PDF.

I also found some Microsoft Word generated PDFs but when I attempted to export/create PDFs from recent Word, the title worked fine.

@clach04
Copy link
Contributor Author

clach04 commented Apr 14, 2022

@MartinThoma

this issue has nothing to do with #13, its specifically about the linked issue to a repo that no longer exists

The repo was moved issue 13 from the linked repo is #13 here.

This is not correct, I've copy/pasted the one email I have where someone posted to the issue I created (note issue, not a PR - in a completely different repo):

From: johns1c [email protected]
Sent: Wednesday, May 6, 2020 5:45 PM
To: mstamy2/PyPDF3 [email protected]
Cc: Chris Clark ; Author [email protected]
Subject: Re: [mstamy2/PyPDF3] pdf.getDocumentInfo().title sometimes None (#13)

Are you running Python 3 by any chance

You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

clach04 added a commit to clach04/PyPDF2 that referenced this issue Apr 14, 2022
@MartinThoma MartinThoma linked a pull request Apr 14, 2022 that will close this issue
clach04 added a commit to clach04/PyPDF2 that referenced this issue Apr 15, 2022
Handle case when title really is None
@MartinThoma
Copy link
Member

There is something really weird about that PDF:

  1. When you upload it to http://pdf-analyser.edpsciences.org/result/1e54b64d it also gives no title
  2. mutool clean -d title_bug.pdf title_bug.txt seems to be catched in an infinite loop (from mupdf-tools)

@MartinThoma MartinThoma added is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Apr 15, 2022
MartinThoma pushed a commit that referenced this issue Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants