Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add public layout-base extraction support on PDFToTextConverter #3137

Merged

Conversation

danielbichuetti
Copy link
Contributor

@danielbichuetti danielbichuetti commented Sep 1, 2022

Related Issues

Proposed Changes:

Today, PDFToTextConverter has a private method which support turning on layout-based text extraction. The default option is stream ordered text extraction. Usually, this isn't an issue. However, some PDFs have a quite unfamiliar stream ordering, which is different from the physical layout.
This change implements a public parameter that makes possible to the user to choose between default stream-based extraction, or the layout-based extraction.

How did you test it?

Already existing PDFToTextConverter tests have been run again.
One specific scenario PDF file where the stream content is not the same as physical layout order has been added, and the test using the new parameter also has been included.

Notes for the reviewer

I have removed ancient comments, one doubled super().init (it happens at the start, and then it was being again called at the end, with useless effect)

Checklist

@danielbichuetti danielbichuetti requested review from a team as code owners September 1, 2022 16:18
@danielbichuetti danielbichuetti requested review from masci and removed request for a team September 1, 2022 16:18
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the additional cleaning :)

@danielbichuetti
Copy link
Contributor Author

The new documentation has been generated. CI is now green again.

@danielbichuetti
Copy link
Contributor Author

@masci Sorry to bother you, but I think this PR got lost in the Forgotten Lands. 😃

@masci
Copy link
Contributor

masci commented Sep 13, 2022

@danielbichuetti 🙈 apologies, merging now!

@masci masci merged commit df1f420 into deepset-ai:main Sep 13, 2022
@danielbichuetti danielbichuetti deleted the add_layout_parameter_to_pdf_converter branch September 13, 2022 15:00
brandenchan pushed a commit that referenced this pull request Sep 21, 2022
…#3137)

* feat(PDFToTextConverter): add option to get text in physical layout order

* test: add physical layout extraction test to PDFToTextConverter

* refactor: change layout parameter attribution places

* docs: manually trigger pre-commits

* docs: generate new docs to comply with pydoc-markdown style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDFToTextConverter is not forwarding layout parameter to private method
3 participants