fix parsing spaces in russian language PDFs (#1987) #2427

Hyperb0t · 2024-09-14T04:13:38Z

What problem does this PR solve?

#1987

When scanning PDF files character by character, the parser excluded spaces if the string did not match regex. Text from Russian documents needs spaces, but it does not match the regex because it uses different alphabet. That's why PDFs were parsed incorrectly and were almost unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different alphabets. I additionally tested PDF in Spanish and old [a-zA-Z...] regex parses it correctly with spaces.

Type of change

Bug Fix (non-breaking change which fixes an issue)

…flow#2427) ### What problem does this PR solve? [infiniflow#1987](infiniflow#1987) When scanning PDF files character by character, the parser excluded spaces if the string did not match regex. Text from [Russian documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf) needs spaces, but it does not match the regex because it uses different alphabet. That's why PDFs were parsed incorrectly and were almost unusable as source. Fixed that by adding Russian alphabet to regex. There might be problems with other languages that use different alphabets. I additionally tested [PDF in Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816) and old [a-zA-Z...] regex parses it correctly with spaces. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)

fix parsing spaces in russian language PDFs (infiniflow#1987)

8f3df0c

KevinHuSh merged commit 7e75b9d into infiniflow:main Sep 14, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix parsing spaces in russian language PDFs (#1987) #2427

fix parsing spaces in russian language PDFs (#1987) #2427

Hyperb0t commented Sep 14, 2024

fix parsing spaces in russian language PDFs (#1987) #2427

fix parsing spaces in russian language PDFs (#1987) #2427

Conversation

Hyperb0t commented Sep 14, 2024

What problem does this PR solve?

Type of change