ExtractText yields nothing for apparently good PDF #168

chrisinmtown · 2015-01-08T16:10:22Z

PyPDF2 version 1.23 fails to extract any text from the first 3 pages of this PDF file:
http://emma.msrb.org/EP295293-EP10300-EP632440.pdf

The file seems well-formed to me; both Acrobat and evince display it nicely. The linux utility pdftotext converts it to text and I see the expected content just fine.

Here's the relevant bit of my little script:

    with open(filename, "rb") as pdf_file:
        try:
            pdf_obj = PdfFileReader(pdf_file)
            # gather properties
            prop_en = pdf_obj.getIsEncrypted()
            err = ""
            if not prop_en:
                # Look for any text on the first N pages
                prop_img = True
                prop_pg = pdf_obj.getNumPages()
                for i in xrange(min(prop_pg, 3)):
                    pagei = pdf_obj.getPage(i)
                    pageitext = pagei.extractText()
                    # Set property and stop searching at first text found
                    if len(pageitext) > 0:
                        prop_img = False
                        break

Is there a gotcha here that I'm missing? Pls advise, thanks in advance for help.

chrisinmtown · 2015-01-14T14:23:48Z

I would like to mention that I have many unprotected, machine-searchable (i.e., non-image) PDF files like this - I just posted one link. Unlike the last issue I opened about a freak PDF with a botched header, in this case PyPDF2 fails to get text from annoyingly many of the files I'm trying to process. Thanks for listening.

zevaverbach · 2015-01-23T23:09:06Z

@chrisinmtown I ran into a similar issue today; my PDF and yours are "page extraction: not allowed" according to Adobe Reader. :(

chrisinmtown · 2015-01-24T13:41:28Z

@zevav thanks for the comment but please let's not confuse issues.

Protected files are a whole different ball of wax and I don't expect PyPDF2 to extract anything from such files given no password.

The link I provided above yields a PDF that is not password protected. On this document Adobe Acrobat makes no complaint about extracting text, it happily saves-as plain text and the result is totally usable.

zevaverbach · 2015-01-25T14:47:49Z

@chrisinmtown hey, sorry to send you in the wrong direction; in Acrobat on my machine that document does show (document info) as "page extraction not allowed."

chrisinmtown · 2015-01-25T15:33:35Z

Thanks for clarifying. Now I'm concerned, I don't want to waste anyone's time here on non-issues!

I am using Adobe Acrobat XI on Win7_x64. With this document open in Acrobat I pick File -> Properties, switch to the Security tab of the Document Properties dialog, and there I read "Security Method: No Security", and under the restrictions everything is allowed (Printing, Changing, Copying ...).

Could there possibly be a difference in behavior between Reader and Acrobat on this document?

zevaverbach · 2015-01-25T15:46:07Z

Okay, my bad: I wrote "Acrobat" in my second comment, but I meant "Reader." Here's a screenshot of your file's info in that, on OS X.10, Reader 11.0.10.

chrisinmtown · 2015-01-25T16:01:20Z

I see the exact same thing in the Win7 version of Acrobat Reader XI: Document Assembly and Page Extract Not Allowed; all the rest (Content Copying ..) are Allowed. FWIW, PyPDF2 declares this document unprotected.

I'm starting to think the Properties window is reflecting features of Acrobat Reader rather than the document, do you agree? In my tests of Reader on other PDF documents, it invariably declares "Page Extract Not Allowed". Reader by definition cannot extract pages, right?

Just to be clear, I am sticking to my position :) that the original document is a valid PDF, unprotected, with text content, and I really would like PyPDF2 to be extended so it can handle this doc.

zevaverbach · 2015-01-25T16:38:02Z

Dang, you're right! I didn't think to check a PDF that I know PyPDF2 can extract the text of; Reader does indeed show that property for all PDFs. :(

What method in PyPDF2 tells you whether or not a document is protected?

chrisinmtown · 2015-01-25T16:45:26Z

The relevant method on PdfFileReader is getIsEncrypted()

Rob1080 · 2016-02-20T20:45:46Z

I realise this is an old post, did you ever find the reason for text not being extracted?

droid-surbhi · 2018-05-04T08:23:54Z

Facing same problem. PyPDF2 version 1.26

MartinThoma · 2022-06-06T12:29:56Z

Sadly, the PDF mentioned above is no longer reachable. I think that #924 fixed the issue and hence I close this PR.

It might also be a duplicate of the underlying cause of #242 .

If you face the same issue, please open a new bug ticket and upload a PDF with the issue (to which you must have the copyright)

The highlight of the 2.1.0 release is the most massive improvement to the text extraction capabilities of PyPDF2 since 2016 🥳🎊 A very big thank you goes to [pubpub-zz](https://github.com/pubpub-zz) who took a lot of time and knowledge about the PDF format to finally get those improvements into PyPDF2. Thank you 🤗💚 In case the new function causes any issues, you can use `_extract_text_old` for the old functionality. Please also open a bug ticket in that case. There were several people who have attempted to bring similar improvements to PyPDF2. All of those were valuable. The main reason why they didn't get merged is the big amount of open PRs / issues. pubpub-zz was the most comprehensive PR which also incorporated the latest changes of PyPDF2 2.0.0. Thank you to [VictorCarlquist](https://github.com/VictorCarlquist) for #858 and [asabramo](https://github.com/asabramo) for #464 🤗 New Features (ENH): - Massive text extraction improvement (#924). Closed many open issues: - Exceptions / missing spaces in extract_text() method (#17) 🕺 - Whitespace issues in extract_text() (#42) 💃 - pypdf2 reads the hifenated words in a new line (#246) - PyPDF2 failing to read unicode character (#37) - Unable to read bullets (#230) - ExtractText yields nothing for apparently good PDF (#168) 🎉 - Encoding issue in extract_text() (#235) - extractText() doesn't work on Chinese PDF (#252) - encoding error (#260) - Trouble with apostophes in names in text "O'Doul" (#384) - extract_text works for some PDF files, but not the others (#437) - Euro sign not being recognized by extractText (#443) - Failed extracting text from French texts (#524) - extract_text doesn't extract ligatures correctly (#598) - reading spanish text - mark convert issue (#635) - Read PDF changed from text to random symbols (#654) - .extractText() reads / as 1. (#789) - Update glyphlist (#947) - inspired by #464 - Allow adding PageRange objects (#948) Bug Fixes (BUG): - Delete .python-version file (#944) - Compare StreamObject.decoded_self with None (#931) Robustness (ROB): - Fix some conversion errors on non conform PDF (#932) Documentation (DOC): - Elaborate on PDF text extraction difficulties (#939) - Add logo (#942) - rotate vs Transformation().rotate (#937) - Example how to use PyPDF2 with AWS S3 (#938) - How to deprecate (#930) - Fix typos on robustness page (#935) - Remove scripts (pdfcat) from docs (#934) Developer Experience (DEV): - Ignore .python-version file - Mark deprecated code with no-cover (#943) - Automatically create Github releases from tags (#870) Testing (TST): - Text extraction for non-latin alphabets (#954) - Ignore PdfReadWarning in benchmark (#949) - writer.remove_text (#946) - Add test for Tree and _security (#945) Code Style (STY): - black, isort, Flake8, splitting buildCharMap (#950) Full Changelog: 2.0.0...2.1.0

Upliner mentioned this issue May 30, 2015

Initial support for CMap character translation #201

Closed

MartinThoma mentioned this issue Apr 6, 2022

extract_text works for some PDF files, but not the others #437

Closed

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 7, 2022

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExtractText yields nothing for apparently good PDF #168

ExtractText yields nothing for apparently good PDF #168

chrisinmtown commented Jan 8, 2015

chrisinmtown commented Jan 14, 2015

zevaverbach commented Jan 23, 2015

chrisinmtown commented Jan 24, 2015

zevaverbach commented Jan 25, 2015

chrisinmtown commented Jan 25, 2015

zevaverbach commented Jan 25, 2015

chrisinmtown commented Jan 25, 2015

zevaverbach commented Jan 25, 2015

chrisinmtown commented Jan 25, 2015

Rob1080 commented Feb 20, 2016

droid-surbhi commented May 4, 2018

MartinThoma commented Jun 6, 2022

ExtractText yields nothing for apparently good PDF #168

ExtractText yields nothing for apparently good PDF #168

Comments

chrisinmtown commented Jan 8, 2015

chrisinmtown commented Jan 14, 2015

zevaverbach commented Jan 23, 2015

chrisinmtown commented Jan 24, 2015

zevaverbach commented Jan 25, 2015

chrisinmtown commented Jan 25, 2015

zevaverbach commented Jan 25, 2015

chrisinmtown commented Jan 25, 2015

zevaverbach commented Jan 25, 2015

chrisinmtown commented Jan 25, 2015

Rob1080 commented Feb 20, 2016

droid-surbhi commented May 4, 2018

MartinThoma commented Jun 6, 2022