-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.1 extract_text() misses newline characters #957
Comments
Thank you for sharing and putting the time into writing an awesome Bug report! In the mean time, you can use |
Yeah, for the time being I constrained its version like |
|
Might be related to #591 |
I've started to have a look at the file, and the pdf shows cases I would have never guess. the Tm matrix shows an inverted which means that the document is filled upside/down. Correction is under analysis... |
Glad it's an edge case, never would've guessed ;) |
extract_text()
glued together
I just confirmed that this is still an issue with the current master (soon |
the fix was not issued still working on... |
improved by PR #1084 |
* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding #1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves #234 Improves #957 Closes #1003 Closes #1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding py-pdf#1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves py-pdf#234 Improves py-pdf#957 Closes py-pdf#1003 Closes py-pdf#1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
Not sure if this is related. I was using 2.8.1 and everything worked perfectly but any version above (2.9.0 and higher) had the same issue for me. With 2.9.0 and higher the output for |
@creepiepanda can you confirm this was detected with the same pdf ? if so can you provide it? |
Yes it's the same PDF every time. I switched pypdf versions mutliple times while trying with the same file. |
extract_text() has now layout extraction_mode. |
Just for reference import pypdf
from pypdf import PdfReader
print(f"pypdf=={pypdf.__version__}")
print(PdfReader("test.pdf").pages[0].extract_text()) gives:
and gives:
|
Hey there,
when updating from v2.0 to v2.1, extracted words that were separated by whitespaces whitespaces before are now glued together, (see below for example).
Environment
Machine:
Linux-5.17.5-76051705-generic-x86_64-with-glibc2.34
PyPDF:
2.1.0
Code
This is a minimal, complete example that shows the issue:
Now, output with v2.0 was like this:
Using v2.1, I get this:
PDF
PDF file from example can be found here. The names were redacted, so no personal information despite the looks of it.
The text was updated successfully, but these errors were encountered: