-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Added line-breaks at dashes #234
Comments
I'm having the same issue with transcripts. Some sections of dialogue are missing the first 1-3 lines when the speakers alternate in a conversation. The conversational format is: Has there been any progress on this issue? |
@rnzucker Would it be ok for you if I added those files to PyPDF2 (Resouces) so that we can keep testing? (Under the Packages BSD license) |
Totally fine. They are just snippets of newspaper articles. |
Note to myself: The test-2 causes a newline where it shouldn't be. No text is missing (anymore). The test-2.pdf is the following article of the New York Times from 2015: https://www.nytimes.com/2015/11/12/opinion/waiting-for-the-republican-shakeout.html -- I'm uncertain if we may add it. |
this is the results with PR #1084 for test-2:
The extra space are introduced with Tm repositioning. I don't have currently an easy solution to identify this as a 'simple' text repositioning without space. |
* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding #1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves #234 Improves #957 Closes #1003 Closes #1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding py-pdf#1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves py-pdf#234 Improves py-pdf#957 Closes py-pdf#1003 Closes py-pdf#1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
According to #2882 (comment), this has just been fixed. |
I've been trying out PyPDF2 and encountered cases where it is skipping text. It has no problem with one file (https://github.com/rnzucker/MadLib/blob/master/test-1.pdf), beyond adding newlines at 80 characters. But with another one (https://github.com/rnzucker/MadLib/blob/master/test-2.pdf, the beginning of a newspaper editorial), it starts with the "-time" from "prime-time" in the first line. It also skipped other text in the file. My code is very simple:
The text was updated successfully, but these errors were encountered: