feat: add public layout-base extraction support on PDFToTextConverter #3137

danielbichuetti · 2022-09-01T16:18:09Z

Related Issues

fixes PDFToTextConverter is not forwarding layout parameter to private method #3131

Proposed Changes:

Today, PDFToTextConverter has a private method which support turning on layout-based text extraction. The default option is stream ordered text extraction. Usually, this isn't an issue. However, some PDFs have a quite unfamiliar stream ordering, which is different from the physical layout.
This change implements a public parameter that makes possible to the user to choose between default stream-based extraction, or the layout-based extraction.

How did you test it?

Already existing PDFToTextConverter tests have been run again.
One specific scenario PDF file where the stream content is not the same as physical layout order has been added, and the test using the new parameter also has been included.

Notes for the reviewer

I have removed ancient comments, one doubled super().init (it happens at the start, and then it was being again called at the end, with useless effect)

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

…rder

masci

LGTM, thanks for the additional cleaning :)

danielbichuetti · 2022-09-05T14:20:08Z

The new documentation has been generated. CI is now green again.

…rter

danielbichuetti · 2022-09-13T10:43:29Z

@masci Sorry to bother you, but I think this PR got lost in the Forgotten Lands. 😃

masci · 2022-09-13T14:55:04Z

@danielbichuetti 🙈 apologies, merging now!

…#3137) * feat(PDFToTextConverter): add option to get text in physical layout order * test: add physical layout extraction test to PDFToTextConverter * refactor: change layout parameter attribution places * docs: manually trigger pre-commits * docs: generate new docs to comply with pydoc-markdown style

danielbichuetti added 3 commits September 1, 2022 11:30

feat(PDFToTextConverter): add option to get text in physical layout o…

3d30255

…rder

test: add physical layout extraction test to PDFToTextConverter

1e151b4

refactor: change layout parameter attribution places

575d3a0

danielbichuetti requested review from a team as code owners September 1, 2022 16:18

danielbichuetti requested review from masci and removed request for a team September 1, 2022 16:18

docs: manually trigger pre-commits

b4b3fb3

agnieszka-m approved these changes Sep 5, 2022

View reviewed changes

Merge branch 'main' into add_layout_parameter_to_pdf_converter

0bba8d5

masci approved these changes Sep 5, 2022

View reviewed changes

docs: generate new docs to comply with pydoc-markdown style

afd4451

Merge branch 'deepset-ai:main' into add_layout_parameter_to_pdf_conve…

356ea9b

…rter

masci merged commit df1f420 into deepset-ai:main Sep 13, 2022

danielbichuetti deleted the add_layout_parameter_to_pdf_converter branch September 13, 2022 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add public layout-base extraction support on PDFToTextConverter #3137

feat: add public layout-base extraction support on PDFToTextConverter #3137

danielbichuetti commented Sep 1, 2022 •

edited

Loading

masci left a comment

danielbichuetti commented Sep 5, 2022

danielbichuetti commented Sep 13, 2022

masci commented Sep 13, 2022

feat: add public layout-base extraction support on PDFToTextConverter #3137

feat: add public layout-base extraction support on PDFToTextConverter #3137

Conversation

danielbichuetti commented Sep 1, 2022 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

masci left a comment

Choose a reason for hiding this comment

danielbichuetti commented Sep 5, 2022

danielbichuetti commented Sep 13, 2022

masci commented Sep 13, 2022

danielbichuetti commented Sep 1, 2022 •

edited

Loading