Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Added line-breaks at dashes #234

Closed
rnzucker opened this issue Nov 12, 2015 · 6 comments
Closed

BUG: Added line-breaks at dashes #234

rnzucker opened this issue Nov 12, 2015 · 6 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@rnzucker
Copy link

rnzucker commented Nov 12, 2015

I've been trying out PyPDF2 and encountered cases where it is skipping text. It has no problem with one file (https://github.com/rnzucker/MadLib/blob/master/test-1.pdf), beyond adding newlines at 80 characters. But with another one (https://github.com/rnzucker/MadLib/blob/master/test-2.pdf, the beginning of a newspaper editorial), it starts with the "-time" from "prime-time" in the first line. It also skipped other text in the file. My code is very simple:

from PyPDF2 import PdfReader

reader = PdfReader("test-1.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text()
print(text)
@JeremyMMulcahey
Copy link

I'm having the same issue with transcripts. Some sections of dialogue are missing the first 1-3 lines when the speakers alternate in a conversation.

The conversational format is:
Speaker1:
Speaker2:
Speaker1:

Has there been any progress on this issue?
I'll poke around the package and see if anything jumps out.

@mstamy2 mstamy2 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label May 19, 2016
@MartinThoma
Copy link
Member

@rnzucker Would it be ok for you if I added those files to PyPDF2 (Resouces) so that we can keep testing? (Under the Packages BSD license)

@rnzucker
Copy link
Author

Totally fine. They are just snippets of newspaper articles.

@MartinThoma MartinThoma added the Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests label Jun 6, 2022
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 26, 2022
@MartinThoma
Copy link
Member

Note to myself: The test-2 causes a newline where it shouldn't be. No text is missing (anymore).

The test-2.pdf is the following article of the New York Times from 2015: https://www.nytimes.com/2015/11/12/opinion/waiting-for-the-republican-shakeout.html -- I'm uncertain if we may add it.

@MartinThoma MartinThoma changed the title PyPDF2 at times skipping text BUG: Added line-breaks at dashes Jun 26, 2022
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jul 10, 2022

this is the results with PR #1084 for test-2:

Watching Tuesday’s Republican presidential debate, with the eight prime -time contenders 
talking over and past one another, the question arises: Should the party show a fe w of these 
candidates the door?  
Some fret that this mash -up lacks seriousness. The Republican National Committee says it won’t 
intervene. It is relying on voters to usher also -rans off the national stage , and that may be a good 
thing.  
Americans won’t pay full attention to the presidenti al campaign for weeks. By the time they do, 
debates and media exposure will have made for worthy vetting of these candidates’ attention -
getting but illogical tax plans, their dubious statements, and that most symbolic but ridiculous of 
qualifications, thei r early biographies. Gov. Scott Walker’s exit suggests that fears of “super 
PAC” money’s keeping flawed candida tes afloat may not materialize.  
A number of conservative thinkers believe the shedding of vestigial candidates will happen soon 
enough. In a com ing book, Henry Olsen of the Ethics and Public Policy Center in Washington 
divides the Republican electorate into “four discrete factions that are based primarily on 
ideology, with elements of class and religious background tempering that focus.”  

The extra space are introduced with Tm repositioning. I don't have currently an easy solution to identify this as a 'simple' text repositioning without space.

MartinThoma pushed a commit that referenced this issue Jul 13, 2022
* ENH : extract width from CIDFontType0/2
* ENH  : improve cr/lf and space extraction
* BUG : fix error in decoding #1075
* FIX: in ToUnicode  ignore comments (starting with %)
* FIX: extend utf16 for min of 4 characters

Improves #234
Improves #957
Closes #1003
Closes #1019

Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this issue Jul 15, 2022
* ENH : extract width from CIDFontType0/2
* ENH  : improve cr/lf and space extraction
* BUG : fix error in decoding py-pdf#1075
* FIX: in ToUnicode  ignore comments (starting with %)
* FIX: extend utf16 for min of 4 characters

Improves py-pdf#234
Improves py-pdf#957
Closes py-pdf#1003
Closes py-pdf#1019

Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Feb 28, 2023
@stefan6419846
Copy link
Collaborator

According to #2882 (comment), this has just been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

6 participants