Hebrew text displayed in reverse #398

bredisvictor · 2021-03-05T23:05:08Z

Hello, for first thank you for the great parser.
The issue that when i parsing documents in Hebrew, all the text displayed reverse.

Thank you.

k00ni · 2021-03-08T07:22:37Z

Can you provide a PDF (for our test suite) which causes this error? It must be free of charge and without any obligations.

bredisvictor · 2021-03-08T11:45:47Z

Hello Konrad,

Yes, sure, file attached.
test_hebrew.pdf

It will help me very much, because my solution that fix it, not so good.

Thank you.

k00ni · 2021-03-08T11:49:06Z

Thank you for the PDF.

It will help me very much, because my solution that fix it, not so good.

We had similar issues in the past. Can you post your solution here please? It might help to create a fix.

bredisvictor · 2021-03-08T12:21:01Z

My fix is very simple and not so good (need to think over all cases), I didn't have for this issue so much time (did it in an hour). I'm just filter output text and make reverse for every word, except for the text in english and numbers. The problem in this solution that some words stick together after parsing, and it not easy to separate it correctly. I don't think that it will help you.

k00ni · 2021-03-10T09:35:51Z

@bredisvictor can you please test again with #402?

I don't think that it will help you.

In my opinion it is better to have a half solution than nothing. Your code showed a way to deal with the problem. I am not sure but it might helped @smalot to create a patch faster.

bredisvictor · 2021-03-10T10:27:04Z

@bredisvictor can you please test again with #402?

I don't think that it will help you.

In my opinion it is better to have a half solution than nothing. Your code showed a way to deal with the problem. I am not sure but it might helped @smalot to create a patch faster.

I tested it and received the same result as before fix.
After debugging i found the reason. It never pass this case and condition.

As a result, it is not enters to the part of the code that does reverse.

And i did some temporary solution that work

This condition checks the hebrew characters in the string
mb_ereg('[\x{0590}-\x{05FF}]', $text)

smalot · 2021-03-10T15:52:26Z

Hi @bredisvictor
Did you try using the test_hebrew.pdf you provided ?
If so, can you provide screenshots about what you expect to obtain and what you really obtain using the issue-398 branch ?
Many thanks

bredisvictor · 2021-03-10T17:51:01Z

Hi @bredisvictor
Did you try using the test_hebrew.pdf you provided ?
If so, can you provide screenshots about what you expect to obtain and what you really obtain using the issue-398 branch ?
Many thanks

Hello Sebastien,

Yes, you are right, i am test it with other one, test_hebrew.pdf parsed fine. But some files not parsed.

Try this file:
resumes_sample_Servicerepresentative.pdf
Also please look on numbers, looks like they are randomly scattered.

In additional, i add one more example. In this case reverse is works, but words break into several parts.
hebrew_test_2.pdf

Thank you

k00ni · 2021-03-11T08:43:39Z

Can we use these PDFs in our test environment? They must be free of charge and without any obligations.

bredisvictor · 2021-03-11T09:49:17Z

These are resume examples in hebrew downloaded from free, open source.

smalot · 2021-03-19T08:04:11Z

Adding support for right to left language makes me crazy.
When I handle parts of text, they seems to be good, but when I'm echoing on terminal.
And the mix of hebrew and ascii chars is really awful

bredisvictor · 2021-03-19T09:51:34Z

Adding support for right to left language makes me crazy.
When I handle parts of text, they seems to be good, but when I'm echoing on terminal.
And the mix of hebrew and ascii chars is really awful

Can i help some how? Your parser very important for my business.

GreyWyvern · 2023-08-10T17:57:12Z

The Hebrew text in the sample documents are displaying in the correct direction. The only remaining issue left here is that they are spaced out unnaturally. This is a result of decodeText() not taking into account the scale factor from the text matrix Tm when deciding to add whitespace.

k00ni added parsing fail When (almost) nothing can be extracted from a given PDF bug labels Mar 8, 2021

smalot mentioned this issue Mar 9, 2021

Add support for Reversed Chars instruction in BMC blocs #402

Merged

GreyWyvern mentioned this issue Aug 10, 2023

PdfParser does not consider the entire document stream #628

Closed

GreyWyvern mentioned this issue Aug 18, 2023

Major Update to PDFObject.php + Ancillary #634

Merged

k00ni closed this as completed in #634 Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hebrew text displayed in reverse #398

Hebrew text displayed in reverse #398

bredisvictor commented Mar 5, 2021

k00ni commented Mar 8, 2021

bredisvictor commented Mar 8, 2021

k00ni commented Mar 8, 2021

bredisvictor commented Mar 8, 2021 •

edited

Loading

k00ni commented Mar 10, 2021

bredisvictor commented Mar 10, 2021 •

edited

Loading

smalot commented Mar 10, 2021

bredisvictor commented Mar 10, 2021 •

edited

Loading

k00ni commented Mar 11, 2021

bredisvictor commented Mar 11, 2021

smalot commented Mar 19, 2021

bredisvictor commented Mar 19, 2021

GreyWyvern commented Aug 10, 2023

Hebrew text displayed in reverse #398

Hebrew text displayed in reverse #398

Comments

bredisvictor commented Mar 5, 2021

k00ni commented Mar 8, 2021

bredisvictor commented Mar 8, 2021

k00ni commented Mar 8, 2021

bredisvictor commented Mar 8, 2021 • edited Loading

k00ni commented Mar 10, 2021

bredisvictor commented Mar 10, 2021 • edited Loading

smalot commented Mar 10, 2021

bredisvictor commented Mar 10, 2021 • edited Loading

k00ni commented Mar 11, 2021

bredisvictor commented Mar 11, 2021

smalot commented Mar 19, 2021

bredisvictor commented Mar 19, 2021

GreyWyvern commented Aug 10, 2023

bredisvictor commented Mar 8, 2021 •

edited

Loading

bredisvictor commented Mar 10, 2021 •

edited

Loading

bredisvictor commented Mar 10, 2021 •

edited

Loading