Not working with Japanese #100

socrec · 2016-04-07T08:04:34Z

I need to extract text from a PDF file in Japanese. But pdfparser can't seem recognize the file charset encoding. It just output an unreadable string like

�\�w ��w �� /��y�y�y/� ��F�q��J��y�y�y/��S��M��d�y�y�y/��q�� C�y�y�y/��>�; ��C�y�y�y/��y�y�y/��]�b�;�t�K�h��y�y�y/�� y /��y�y�y�y/� �� C�y�y�y�y�y/� � a ��y�y�y�y/�� w�y�y�y/� a ��U�g�y�y�y/�� e�{�y�y�y�y/�2��" Copyright(c)2014 Daiichikizai.,Co.,Ltd All rights reserved.

Please help.

The text was updated successfully, but these errors were encountered:

smalot · 2016-04-07T09:01:56Z

Hi,
Can you try this patch ?
#96

smalot · 2016-04-07T09:05:20Z

I've just merge this PR.
So the "master" branch contains this patch

socrec · 2016-04-07T09:18:52Z

still, pdfparser outputs something like:

�\�w �Ö�”�´�w �Ä�¿�Ó � /£�é�¬� �ï�y�y�y/£ ý�F�q�»J�å�y�y�y/£�S�ð�M�ù�˜�d�y�y�y/£�q�þ Ø�C�y�y�y/£�>�; Ø�C�y�y�y/£�Ã�´�»�ç�Ò�¿�«�y�y�y/£�]�b�;�t�K�h�“�y�y�y/£�±� �Ä�Ú�¿�Ó�y /£�×�”�Ü�y�y�y�y/£ ý�£ Ø�C�y�y�y�y�y/£ ý a ¼�Šº�y�y�y�y/£�ª�»�î w�y�y�y/£ a ¼�U�g�y�y�y/£�§�»�é�¬ e�{�y�y�y�y/£2�Í" Copyright(c)2014

like so (image): https://www.dropbox.com/s/iq0y0w1q8qaqanr/Capture.PNG?dl=0

My page charset is UTF-8 already.
Here's the pdf file: https://www.dropbox.com/s/erj2x8c3ylfbf1b/10032016DKC_kensaku_a_0217.pdf?dl=0

sparx82 · 2016-04-07T12:52:56Z

I had a quick look at this as I'm into the whole encoding stuff at the moment anyway.

It doesn't seem to have something to do with UTF8 encoding (at least nothing which my patch would fix).

I copied one japanese character out of the PDF to Notepad++ and had a look at its encoding:
-> E68EA1

UnicodeDatabase

I then saved all the commands of the PDF file and searched for the hex string above. The commands do not contain this string, so I think something goes wrong (or at least not correct :-) while extracting the data from the PDF. But honestly, japanese UTF8 encodings aren't exactly my specialty...

commands.txt

socrec · 2016-04-08T02:29:33Z

I've got some answer from stackoverflow at http://stackoverflow.com/questions/36469985/extract-text-from-japanese-pdf-file?noredirect=1#comment60566488_36469985

But pardon my ignorance, I do not understand what doesn't support predefined CMaps is :( please help

sparx82 · 2016-04-08T05:12:55Z

First Google hit:
https://blog.idrsolutions.com/2012/05/understanding-the-pdf-file-format-embedded-cmap-tables/

And it seems that this pdfparser doesn't support those CMAP tables. Smalot, please correct me if I'm wrong.

smalot · 2016-04-08T06:50:46Z

I remember I worked on it.
So it should support cmap mapping tables.

sparx82 · 2016-04-08T09:03:16Z

What I found out:
PDFMiner
seems to work. At least it shows me some Japanese characters.

They're CMAP readme says:

[...] contains Adobe CMap resources. CMaps are required
to decode text data written in CJK (Chinese, Japanese, Korean) language.
CMap resources are now available freely from Adobe web site:
http://opensource.adobe.com/wiki/display/cmap/CMap+Resources

So these Adobe CMAP resources seem to be some kind of external translation table for CJK languages.

jjhesk · 2016-08-12T03:24:44Z

does it have GB or BIG5 support? how to add these supports? @sparx82

davispuh · 2019-09-17T13:10:17Z

Check if #257 PR does any better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not working with Japanese #100

Not working with Japanese #100

socrec commented Apr 7, 2016

smalot commented Apr 7, 2016

smalot commented Apr 7, 2016

socrec commented Apr 7, 2016

sparx82 commented Apr 7, 2016

socrec commented Apr 8, 2016

sparx82 commented Apr 8, 2016

smalot commented Apr 8, 2016

sparx82 commented Apr 8, 2016

jjhesk commented Aug 12, 2016

davispuh commented Sep 17, 2019

Not working with Japanese #100

Not working with Japanese #100

Comments

socrec commented Apr 7, 2016

smalot commented Apr 7, 2016

smalot commented Apr 7, 2016

socrec commented Apr 7, 2016

sparx82 commented Apr 7, 2016

socrec commented Apr 8, 2016

sparx82 commented Apr 8, 2016

smalot commented Apr 8, 2016

sparx82 commented Apr 8, 2016

jjhesk commented Aug 12, 2016

davispuh commented Sep 17, 2019