Exceptions / missing spaces in extract_text() method #17

mstamy2 · 2013-07-30T19:30:08Z

extractText() method isn't broken, but throws some exceptions in these cases:

http://doctor12wer.blogspot.com/2013/06/extracttext-function-in-pypdf2-throws.html

http://stackoverflow.com/questions/17270387/pypdf2-typeerror-when-trying-to-extract-text

tnorth · 2013-11-07T19:09:52Z

Hello,

Works for me, but the extracted text contains no spaces :/

input = PdfFileReader(open("foo.pdf", 'rb'))
print input.getPage(0).extractText()

Is that a known issue ?

tnorth · 2013-11-07T19:19:44Z

Hmm to make it more clear, the issue seem to appear for 2 columns papers, this one for example:
www.rowland.harvard.edu/rjf/vollmer/images/vollmer_fischer.pdf

mstamy2 · 2013-11-07T23:21:21Z

The extractText method is probably a little crude, and definitely doesn't function well for PDFs with complicated text. It could use some work to return text in a more orderly fashion that more closely appears like the text you see in a PDF viewer.

alisufian · 2014-01-01T10:13:41Z

Another pdf where whitespace is not preserved in extracted text
http://webapp.psc.state.md.us/Intranet/Casenum/NewIndex3_VOpenFile.cfm?ServerFilePath=C:\Casenum\9100-9199\9155\\354.pdf

kursataker · 2015-03-17T21:15:23Z

I tried to extract arabic text out of a PDF file using extractText() method. However, arabic text disappears in the output.

Lerchensporn · 2016-05-14T13:41:26Z

To resolve the problem of missing whitespaces, I propose the following for-loop in the extractText method. The part below “text += i” is new. The limit “i < -100” where a spacing becomes a whitespace is arbitrarily chosen; in a typical Springer pdf book a value of -300 to -200 determines a whitespace. Although this may look like a hack, I can think of no other criterion for a whitespace in such documents.
edit: Furthermore, I suggest to remove “text += "\n"" after the TJ operator, because it breaks words in some documents.
Handling of the TD, Td, Tm operators still demands refinement.

        for operands, operator in content.operations:
            if operator == b_("Tj"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += _text
            elif operator == b_("T*"):
                text += "\n"
            elif operator == b_("'"):
                text += "\n"
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += operands[0]
            elif operator == b_('"'):
                _text = operands[2]
                if isinstance(_text, TextStringObject):
                    text += "\n"
                    text += _text
            elif operator == b_("TJ"):
                for i in operands[0]:
                    if isinstance(i, TextStringObject):
                        text += i
                    elif isinstance(i, FloatObject) or isinstance(i, NumberObject):
                        if i < -100:
                            text += " "
            elif operator == b_("TD") or operator == b_("Tm"):
                if len(text) > 0 and text[-1] != " " and text[-1] != "\n":
                    text += " "

mborus · 2016-11-23T13:51:57Z

@woho's idea worked for me.
I got too many spaces, so I changed the code slightly...

       # add spaces
       # q&d - https://github.com/mstamy2/PyPDF2/issues/17
                elif isinstance(i, FloatObject) or isinstance(i, NumberObject):
                    if text and (not text[-1] in " \n"):
                        text += " "
        elif operator == b_("TD") or operator == b_("Tm"):
            if text and (not text[-1] in " \n"):
                text += " "
        # end add spaces

chrisjcameron · 2017-12-20T18:48:50Z

Some PDFs apparently generate empty operands. If this condition is explicitly checked, then I can avoid some thrown exceptions:

_text = operands[0] throws an exception if operands is empty.

Quick fix:

for operands, operator in content.operations:
            if not operands:          # Empty operands list contributes no text
                operands = [""]
            if operator == b_("Tj"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += _text

Tom-Evers · 2018-03-03T17:01:10Z

There should be a newline somewhere:

        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
                elif isinstance(i, FloatObject) or isinstance(i, NumberObject):
                    if text and (not text[-1] in " \n"):
                        text += " "
            text += "\n"

Tom-Evers · 2018-03-04T09:48:32Z

It seems that the value of the Float/NumberObject directly encodes the distance between two pieces of text, with the width of one space equaling -600:

                if text and (not text[-1] in " \n"):
                        text += " " * int(i / -600)

MartinThoma · 2022-04-16T09:20:45Z

A lot of the whitespace issues got fixed via #569

MartinThoma · 2022-06-06T12:02:22Z

#924 Improved further on the whitespace issue

MartinThoma · 2022-06-06T12:07:01Z

I think it is fixed.

Minimal example

from PyPDF2 import PdfReader

reader = PdfReader("vollmer_fischer.pdf")  # www.rowland.harvard.edu/rjf/vollmer/images/vollmer_fischer.pdf
text = reader.pages[0].extract_text()

text now is:

Ring-resonator-based frequency-domain opticalactivity measurements of a chiral liquid
Frank Vollmer and Peer Fischer
The Rowland Institute at Harvard, Harvard University, Cambridge, Massachusetts 02142Received September 22, 2005; revised November 11, 2005; accepted November 12, 2005; posted November 16, 2005 (Doc. ID 64961)
Chiral liquids rotate the plane of polarization of linearly polarized light and are therefore optically active.Here we show that optical rotation can be observed in the frequency domain. A chiral liquid introduced in aﬁber-loop ring resonator that supports left and right circularly polarized modes gives rise to relative fre-quency shifts that are a direct measure of the liquid’s circular birefringence and hence of its optical activity.The effect is in principle not diminished if the circumference of the ring is reduced. The technique is simi-larly applicable to refractive index and linear birefringence measurements.
© 2006 Optical Society ofAmericaOCIS codes:260.1440, 120.5410
.Natural optical activity arises because a medium hasdifferent refractive indices for left (/H11002) and right (/H11001)circularly polarized light. The optical rotation, in ra-dians, developed over a path lengthlis a function ofthe wavelength/H9261and is given by
/H9258=/H9266l

/H9261/H20851n/H20849−/H20850−n/H20849+/H20850/H20852./H208491/H20850The circular birefringence,n
/H20849−/H20850−n/H20849+/H20850, is, however,even in a pure chiral liquid small and at most a fewparts in 10
6. It is thus desirable to increase the effec-tive path length through the optically active mediumwithout the need for large sample volumes. This canbe achieved in an optical cavity as long as one en-sures that the optical rotation does not cancel on theround trip, which in practice one can accomplish byplacing quarter-wave plates in the cavity.
1Signiﬁ-cant enhancements in sensitivity compared withsingle-pass instruments have been reported for mea-surements that make use of Fabry–Perotresonators,
1–3including polarization-sensitive imple-mentations of cavity-ringdown spectroscopy,4,5aswell as laser cavities.6,7Both single-pass and multi-pass techniques typically determine the rotation inEq. (1) via intensity measurements that either re-quire rotating polarization optics or separate the or-thogonally polarized components of the light andtherefore require a balanced detection scheme.In this Letter we show that circular birefringence(optical rotation) can also be determined by fre-quency measurements. Left and right circularly po-larized modes acquire unequal phases when a chiralliquid is introduced into a resonator such that theirresonance frequencies shift relative to each other. Wedemonstrate the method, using a ﬁber optic ringresonator in combination with a narrow-linewidth cwlaser.A ﬁber-loop resonator
8,9may be considered to be aﬁber- or waveguide-based Fabry–Perot resonatorthat consists of a closed ﬁber loop in contact with alinear waveguide via a variable (directional) coupler.A resonance in the ring requires that the optical pathlength be a multiple of the wavelength of the light.Resonances are observed as minima in a transmis-sion spectrum whenever an integral multiple of thewavelength in the ring equals the circumference ofthe ﬁber loop. A shift in the resonance wavelength oc-curs if either the path length or the refractive indexchanges. Refractive indices may be measured by tun-ing the frequency of a laser with a sufﬁciently narrowlinewidth.Introduction of a sample with refractive indexnsinto the ring resonator will cause a wavelength shiftof the resonances relative to the reference mediumwith refractive indexn
0, which may, for instance, beair:/H9004/H9261
/H9261=ns−n0

nefff,/H208492/H20850wherefis the fraction of the total ring circumferencethat contains the optically active sample.n
effis an ef-fective refractive index used to describe the entireﬁber-loop resonator in the presence of the referencemedium and corresponds to the round-trip phase2
/H9266neffL//H9261acquired by a resonant mode at wave-length/H9261, where the circumference (ﬁber and free-space part) isL.The inherent birefringence of a bent optical ﬁberwill in general give rise to resonant modes with dif-ferent polarization states.
10These modes may beused to generate circularly polarized modes that aresensitive to chirality. A wavelength shift that is equalin magnitude and opposite in sign for the two circu-larly polarized modes is a direct function of the liq-uid’s circular birefringence and hence of its opticalactivity. Thus, particular interest are relativechanges in the resonance wavelengths of a pair of leftand right circularly polarized modes centered at/H9261:
/H20879/H9004/H9261/H20849−/H20850−/H9004/H9261/H20849+/H20850

/H9261/H20879=n/H20849−/H20850−n/H20849+/H20850

nefff,/H208493/H20850where any common mode noise is automaticallyeliminated. It can also be seen that the equation de-scribing optical activity in a ring resonator is inde-pendent of the actual dimension of the ring. For agiven ﬁnesse and a given fractionf, a reduction in thesize of the ring does not lead to a loss of sensitivity.February 15, 2006 / Vol. 31, No. 4 / OPTICS LETTERS4530146-9592/06/040453-3/$15.00 © 2006 Optical Society of America

The highlight of the 2.1.0 release is the most massive improvement to the text extraction capabilities of PyPDF2 since 2016 🥳🎊 A very big thank you goes to [pubpub-zz](https://github.com/pubpub-zz) who took a lot of time and knowledge about the PDF format to finally get those improvements into PyPDF2. Thank you 🤗💚 In case the new function causes any issues, you can use `_extract_text_old` for the old functionality. Please also open a bug ticket in that case. There were several people who have attempted to bring similar improvements to PyPDF2. All of those were valuable. The main reason why they didn't get merged is the big amount of open PRs / issues. pubpub-zz was the most comprehensive PR which also incorporated the latest changes of PyPDF2 2.0.0. Thank you to [VictorCarlquist](https://github.com/VictorCarlquist) for #858 and [asabramo](https://github.com/asabramo) for #464 🤗 New Features (ENH): - Massive text extraction improvement (#924). Closed many open issues: - Exceptions / missing spaces in extract_text() method (#17) 🕺 - Whitespace issues in extract_text() (#42) 💃 - pypdf2 reads the hifenated words in a new line (#246) - PyPDF2 failing to read unicode character (#37) - Unable to read bullets (#230) - ExtractText yields nothing for apparently good PDF (#168) 🎉 - Encoding issue in extract_text() (#235) - extractText() doesn't work on Chinese PDF (#252) - encoding error (#260) - Trouble with apostophes in names in text "O'Doul" (#384) - extract_text works for some PDF files, but not the others (#437) - Euro sign not being recognized by extractText (#443) - Failed extracting text from French texts (#524) - extract_text doesn't extract ligatures correctly (#598) - reading spanish text - mark convert issue (#635) - Read PDF changed from text to random symbols (#654) - .extractText() reads / as 1. (#789) - Update glyphlist (#947) - inspired by #464 - Allow adding PageRange objects (#948) Bug Fixes (BUG): - Delete .python-version file (#944) - Compare StreamObject.decoded_self with None (#931) Robustness (ROB): - Fix some conversion errors on non conform PDF (#932) Documentation (DOC): - Elaborate on PDF text extraction difficulties (#939) - Add logo (#942) - rotate vs Transformation().rotate (#937) - Example how to use PyPDF2 with AWS S3 (#938) - How to deprecate (#930) - Fix typos on robustness page (#935) - Remove scripts (pdfcat) from docs (#934) Developer Experience (DEV): - Ignore .python-version file - Mark deprecated code with no-cover (#943) - Automatically create Github releases from tags (#870) Testing (TST): - Text extraction for non-latin alphabets (#954) - Ignore PdfReadWarning in benchmark (#949) - writer.remove_text (#946) - Add test for Tree and _security (#945) Code Style (STY): - black, isort, Flake8, splitting buildCharMap (#950) Full Changelog: 2.0.0...2.1.0

mstamy2 mentioned this issue Aug 21, 2013

Python Version Compatibility #16

Closed

Tom-Evers added a commit to Tom-Evers/PyPDF2 that referenced this issue Mar 4, 2018

Updated extractText() according to changes proposed in issue py-pdf#17

9217428

Tom-Evers mentioned this issue Mar 4, 2018

Updated extractText() #397

Closed

reginafcompton mentioned this issue Mar 30, 2018

Conversion script for LA Metro attachments datamade/django-councilmatic#193

Merged

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 16, 2022

MartinThoma closed this as completed Jun 6, 2022

MartinThoma changed the title ~~extractText() method~~ Exceptions in extract_text() method Jun 6, 2022

MartinThoma changed the title ~~Exceptions in extract_text() method~~ Exceptions / missing spaces in extract_text() method Jun 6, 2022

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceptions / missing spaces in extract_text() method #17

Exceptions / missing spaces in extract_text() method #17

mstamy2 commented Jul 30, 2013

tnorth commented Nov 7, 2013

tnorth commented Nov 7, 2013

mstamy2 commented Nov 7, 2013

alisufian commented Jan 1, 2014

kursataker commented Mar 17, 2015

Lerchensporn commented May 14, 2016 •

edited

Loading

mborus commented Nov 23, 2016 •

edited

Loading

chrisjcameron commented Dec 20, 2017 •

edited

Loading

Tom-Evers commented Mar 3, 2018

Tom-Evers commented Mar 4, 2018

MartinThoma commented Apr 16, 2022

MartinThoma commented Jun 6, 2022

MartinThoma commented Jun 6, 2022

Exceptions / missing spaces in extract_text() method #17

Exceptions / missing spaces in extract_text() method #17

Comments

mstamy2 commented Jul 30, 2013

tnorth commented Nov 7, 2013

tnorth commented Nov 7, 2013

mstamy2 commented Nov 7, 2013

alisufian commented Jan 1, 2014

kursataker commented Mar 17, 2015

Lerchensporn commented May 14, 2016 • edited Loading

mborus commented Nov 23, 2016 • edited Loading

chrisjcameron commented Dec 20, 2017 • edited Loading

Tom-Evers commented Mar 3, 2018

Tom-Evers commented Mar 4, 2018

MartinThoma commented Apr 16, 2022

MartinThoma commented Jun 6, 2022

MartinThoma commented Jun 6, 2022

Minimal example

Lerchensporn commented May 14, 2016 •

edited

Loading

mborus commented Nov 23, 2016 •

edited

Loading

chrisjcameron commented Dec 20, 2017 •

edited

Loading