Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hebrew text displayed in reverse #398

Closed
bredisvictor opened this issue Mar 5, 2021 · 13 comments · Fixed by #634
Closed

Hebrew text displayed in reverse #398

bredisvictor opened this issue Mar 5, 2021 · 13 comments · Fixed by #634
Labels
bug parsing fail When (almost) nothing can be extracted from a given PDF

Comments

@bredisvictor
Copy link

Hello, for first thank you for the great parser.
The issue that when i parsing documents in Hebrew, all the text displayed reverse.

Thank you.

@k00ni
Copy link
Collaborator

k00ni commented Mar 8, 2021

Can you provide a PDF (for our test suite) which causes this error? It must be free of charge and without any obligations.

@k00ni k00ni added parsing fail When (almost) nothing can be extracted from a given PDF bug labels Mar 8, 2021
@bredisvictor
Copy link
Author

Hello Konrad,

Yes, sure, file attached.
test_hebrew.pdf

It will help me very much, because my solution that fix it, not so good.

Thank you.

@k00ni
Copy link
Collaborator

k00ni commented Mar 8, 2021

Thank you for the PDF.

It will help me very much, because my solution that fix it, not so good.

We had similar issues in the past. Can you post your solution here please? It might help to create a fix.

@bredisvictor
Copy link
Author

bredisvictor commented Mar 8, 2021

My fix is very simple and not so good (need to think over all cases), I didn't have for this issue so much time (did it in an hour). I'm just filter output text and make reverse for every word, except for the text in english and numbers. The problem in this solution that some words stick together after parsing, and it not easy to separate it correctly. I don't think that it will help you.

@k00ni
Copy link
Collaborator

k00ni commented Mar 10, 2021

@bredisvictor can you please test again with #402?

I don't think that it will help you.

In my opinion it is better to have a half solution than nothing. Your code showed a way to deal with the problem. I am not sure but it might helped @smalot to create a patch faster.

@bredisvictor
Copy link
Author

bredisvictor commented Mar 10, 2021

@bredisvictor can you please test again with #402?

I don't think that it will help you.

In my opinion it is better to have a half solution than nothing. Your code showed a way to deal with the problem. I am not sure but it might helped @smalot to create a patch faster.

I tested it and received the same result as before fix.
After debugging i found the reason. It never pass this case and condition.
Screen Shot 2021-03-10 at 12 11 24 PM

As a result, it is not enters to the part of the code that does reverse.
Screen Shot 2021-03-10 at 12 12 21 PM

And i did some temporary solution that work
Screen Shot 2021-03-10 at 12 14 37 PM

This condition checks the hebrew characters in the string
mb_ereg('[\x{0590}-\x{05FF}]', $text)

@smalot
Copy link
Owner

smalot commented Mar 10, 2021

Hi @bredisvictor
Did you try using the test_hebrew.pdf you provided ?
If so, can you provide screenshots about what you expect to obtain and what you really obtain using the issue-398 branch ?
Many thanks

@bredisvictor
Copy link
Author

bredisvictor commented Mar 10, 2021

Hi @bredisvictor
Did you try using the test_hebrew.pdf you provided ?
If so, can you provide screenshots about what you expect to obtain and what you really obtain using the issue-398 branch ?
Many thanks

Hello Sebastien,

Yes, you are right, i am test it with other one, test_hebrew.pdf parsed fine. But some files not parsed.

Try this file:
resumes_sample_Servicerepresentative.pdf
Also please look on numbers, looks like they are randomly scattered.

In additional, i add one more example. In this case reverse is works, but words break into several parts.
hebrew_test_2.pdf

Thank you

@k00ni
Copy link
Collaborator

k00ni commented Mar 11, 2021

Can we use these PDFs in our test environment? They must be free of charge and without any obligations.

@bredisvictor
Copy link
Author

These are resume examples in hebrew downloaded from free, open source.

@smalot
Copy link
Owner

smalot commented Mar 19, 2021

Adding support for right to left language makes me crazy.
When I handle parts of text, they seems to be good, but when I'm echoing on terminal.
And the mix of hebrew and ascii chars is really awful

@bredisvictor
Copy link
Author

Adding support for right to left language makes me crazy.
When I handle parts of text, they seems to be good, but when I'm echoing on terminal.
And the mix of hebrew and ascii chars is really awful

Can i help some how? Your parser very important for my business.

@GreyWyvern
Copy link
Contributor

The Hebrew text in the sample documents are displaying in the correct direction. The only remaining issue left here is that they are spaced out unnaturally. This is a result of decodeText() not taking into account the scale factor from the text matrix Tm when deciding to add whitespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parsing fail When (almost) nothing can be extracted from a given PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants