-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent hyphenation (and lost blanks) #2262
Comments
Please see the corresponding docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard To summarize: Text extraction is hard and involves quite some guessing - you only have individual character positions by default, all remaining steps tend to use heuristics to form words etc., thus they are not always correct. (Speaking of (Py)MuPDF: They provide commercial solutions as well and thus might have better general results.) |
Thanks for sharing the file and some examples! This helps a lot to refine our heuristics. I agree with everything @stefan6419846 said. There is little hope to ever solve this completely for all pdf documents. Do you own the license of that file or is it public domain? I'm always interested in refining my benchmark for text extraction |
It'd be nice to have a user-settable threshold, for situations (not super-common, but not exactly rare, either - in my tests) when the words are not spaced enough for the algorithm to make the right choice. Does such a setting exists? |
Also, the inconsistent hyphenation (sometimes leading to extracted text with a newline and sometimes without), is a separate issue altogether. Maybe I ought to have started 2 separate discussion threads... |
@MartinThoma - it's the PDF version of a book I used to own. I don't know if it's public domain. Doesn't a page extracted for technical tests qualify for "fair use"? |
In that case I would advise against sharing it publicly. Private sharing might be OK, but I'm not a lawyer and I don't want to get into / cause issues 😅 |
The line breaks were not working in this PDF. If you follow the TL section below, you will need to multiply the font size when processing the TL. PDF 1.7 |
To make this modification I just need to make the following changes to before:
after:
|
Yes, this should deserve a corresponding test case if not already covered by the existing tests. |
I'm trying to extract text from PDF documents, to isolate individual words and create an indexing system.
Some PDF files are parsed fine, but others (such as the attached "Ocean Currents.pdf") are disasters! Here's an example of the parsed text from the second page of the document:
Notice 2 problems:
For example (see screenshot below):
is extracted as
op-\nposite
(with a newline),while:
is extracted as
iso-bars
(no newline!)Code + PDF
Ocean Currents.pdf
(full document attached; please add to your tests)
Thoughts
I suspect you'll say that the attached PDF is malformed. Maybe it is... but another software, PyMuPDF, parses it just fine.
In fact, I have noticed that the lost spaces are far more prevalent in extractions by pypdf, compared to PyMuPDF - BUT for some files it's the opposite, and pypdf does far better.
Empirically, I've noticed an intriguing complementary between pypdf and PyMuPDF : for files where one messes up badly, the other one does well - and vice versa. Maybe a different threshold of how to detect blank spaces in sentences?
But the inconsistent hyphenation I mentioned at the beginning is another issue that seriously gets in the way of word extraction...
Thanks!
The text was updated successfully, but these errors were encountered: