Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix parsing spaces in russian language PDFs (#1987) #2427

Merged
merged 1 commit into from
Sep 14, 2024

Conversation

Hyperb0t
Copy link
Contributor

What problem does this PR solve?

#1987

When scanning PDF files character by character, the parser excluded spaces if the string did not match regex. Text from Russian documents needs spaces, but it does not match the regex because it uses different alphabet. That's why PDFs were parsed incorrectly and were almost unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different alphabets. I additionally tested PDF in Spanish and old [a-zA-Z...] regex parses it correctly with spaces.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)

@KevinHuSh KevinHuSh merged commit 7e75b9d into infiniflow:main Sep 14, 2024
1 check passed
Halfknow pushed a commit to Halfknow/ragflow that referenced this pull request Nov 11, 2024
…flow#2427)

### What problem does this PR solve?

[infiniflow#1987](infiniflow#1987)

When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf)
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816)
and old [a-zA-Z...] regex parses it correctly with spaces.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants