Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated extractText() #397

Closed
wants to merge 6 commits into from
Closed

Updated extractText() #397

wants to merge 6 commits into from

Conversation

Tom-Evers
Copy link

Added changes proposed in issue #17

@Tom-Evers
Copy link
Author

Some lines contain multiple draw operations, for example if underlined text is drawn text first, underlining ("________") second at the same vertical coordinates.

The toggle 'skip_intertwining_text' will by default skip the next line if intertwining text is detected.
When set to false, it will simply insert text after the previous line.

Indentation is now also properly handled.

Copy link

@deven96 deven96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this output the text in the correct order?

@Tom-Evers
Copy link
Author

It should, yeah, but it has been some time since I worked on this...

The problem is: if the PDF itself has the text in the wrong order but relocated with weird offsets, there's a good chance it'll still mess up the order. Then again, the method that was used before my commit would then still be worse.

@deven96
Copy link

deven96 commented Jul 13, 2018 via email

@Tom-Evers
Copy link
Author

Can this be tested/pulled?

@joegrange
Copy link

Looks like a good improvement. I'm also having white space issues that this should improve.

@Tom-Evers
Copy link
Author

It should only improve things, and never break anything that isn't broken already.

Could this be pulled please?

@TZanke
Copy link

TZanke commented Nov 8, 2018

I would like to have the newest changes also, but it doesnt look like anyone will build a new package. Any PyPDF2 fork out there with newer packages then PyPDF2 itself? Even pdfrw is not maintained very well, so do i miss some brand new python PDF engine on github where all the effort goes to?

@MartinThoma MartinThoma added Tiny Pull requests that make a tiny change - and thus should be easy to merge PdfReader The PdfReader component is affected labels Apr 6, 2022
@MartinThoma MartinThoma changed the title Updated extractText() according to changes proposed in issue #17 Updated extractText() Apr 16, 2022
PyPDF2/pdf.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

@TZanke I just became the maintainer this month - and PyPDF2 is moving again 🚀

@MartinThoma
Copy link
Member

It seems like this PR breaks a couple of things. Could you please have a look?

@MartinThoma MartinThoma added needs-change The PR/issue cannot be handled as issue and needs to be improved workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Apr 16, 2022
@MartinThoma
Copy link
Member

This PR addressed #17, but #924 fixed it (+ many other things). Hence I close it.

Thank you for the PR! I hope I can respond quicker in future to such improvements :-)

@MartinThoma MartinThoma closed this Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-change The PR/issue cannot be handled as issue and needs to be improved PdfReader The PdfReader component is affected Tiny Pull requests that make a tiny change - and thus should be easy to merge workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants