Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added option to strip carriage returns from extracted PDF text #4434

Closed
wants to merge 1 commit into from

Conversation

buchen
Copy link
Member

@buchen buchen commented Dec 31, 2024

Apparently, some Quirin Bank document contain single carriage returns. That carriage returns are not in the sample data, because the CreateTextFromPDFHandler strips all carriage returns from the raw text.

The difference has been introduced with commit
d88e931

To limit the impact, this change applies it only to the Quirin extractor.

Apparently, some Quirin Bank document contain single carriage returns.
That carriage returns are not in the sample data, because the
CreateTextFromPDFHandler strips all carriage returns from the raw text.

The difference has been introduced with commit
d88e931

To limit the impact, this change applies it only to the Quirin extractor.
@buchen buchen requested a review from Nirus2000 December 31, 2024 09:31
@Nirus2000
Copy link
Member

Nirus2000 commented Jan 2, 2025

Let's make a separate branch with PDFBox 3.x or higher.
We'll have to update it at some point and I'd like to do it now at the beginning of the year.
I'll certainly have to touch all the PDF importers anyway to edit the pre-packages etc. and at some point we'll have to update that too.
I would also run through all my test PDF documents to fix the first differences.

buchen added a commit that referenced this pull request Jan 3, 2025
Apparently, some Quirin Bank document contain single carriage returns.
That carriage returns are not in the sample data, because the
CreateTextFromPDFHandler strips all carriage returns from the raw text.

The difference has been introduced with commit
d88e931

Issue: #4434
@buchen
Copy link
Member Author

buchen commented Jan 3, 2025

I have updated the code to remove the carriage returns for all documents.
I checked: no example has a single carriage return in the text.

About_:

Let's make a separate branch with PDFBox 3.x or higher.

I have created issue #4449 to discuss how best to upgrade to PDFBox 3.x incrementally.

@buchen buchen closed this Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants