-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated extractText() #397
Conversation
Some lines contain multiple draw operations, for example if underlined text is drawn text first, underlining ("________") second at the same vertical coordinates. The toggle 'skip_intertwining_text' will by default skip the next line if intertwining text is detected. Indentation is now also properly handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this output the text in the correct order?
It should, yeah, but it has been some time since I worked on this... The problem is: if the PDF itself has the text in the wrong order but relocated with weird offsets, there's a good chance it'll still mess up the order. Then again, the method that was used before my commit would then still be worse. |
Good work though mate
…On Thu, Jul 12, 2018 at 9:10 PM Tom-Evers ***@***.***> wrote:
It should, yeah, but it has been some time since I worked on this...
The problem is: if the PDF itself has the text in the wrong order but
relocated with weird offsets, there's a good chance it'll still mess up the
order. Then again, the method that was used before my commit would then
still be worse.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#397 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AWXgwLFQjVs3DDI63o71cvJCSgs14NELks5uF61KgaJpZM4SbRaR>
.
|
Can this be tested/pulled? |
Looks like a good improvement. I'm also having white space issues that this should improve. |
It should only improve things, and never break anything that isn't broken already. Could this be pulled please? |
I would like to have the newest changes also, but it doesnt look like anyone will build a new package. Any PyPDF2 fork out there with newer packages then PyPDF2 itself? Even pdfrw is not maintained very well, so do i miss some brand new python PDF engine on github where all the effort goes to? |
@TZanke I just became the maintainer this month - and PyPDF2 is moving again 🚀 |
It seems like this PR breaks a couple of things. Could you please have a look? |
Added changes proposed in issue #17