Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not working with Japanese #100

Open
socrec opened this issue Apr 7, 2016 · 10 comments
Open

Not working with Japanese #100

socrec opened this issue Apr 7, 2016 · 10 comments

Comments

@socrec
Copy link

socrec commented Apr 7, 2016

I need to extract text from a PDF file in Japanese. But pdfparser can't seem recognize the file charset encoding. It just output an unreadable string like

�\�w �������w ������ �� /����������y�y�y/� ��F�q�� J���y�y�y/��S���M�����d�y�y�y/��q�� ��C�y�y�y/��>�; ��C�y�y�y/����������������y�y�y/��]�b�;�t�K�h���y�y�y/����� ���������y /��������y�y�y�y/� ��� ��C�y�y�y�y�y/� � a ��� ��y�y�y�y/������� w�y�y�y/� a ��U�g�y�y�y/��������� e�{�y�y�y�y/�2��" Copyright(c)2014 Daiichikizai.,Co.,Ltd All rights reserved.

Please help.

@smalot
Copy link
Owner

smalot commented Apr 7, 2016

Hi,
Can you try this patch ?
#96

@smalot
Copy link
Owner

smalot commented Apr 7, 2016

I've just merge this PR.
So the "master" branch contains this patch

@socrec
Copy link
Author

socrec commented Apr 7, 2016

still, pdfparser outputs something like:

�\�w �Ö�”�´�w �Ä�¿�Ó � /£�é�¬� �ï�y�y�y/£ ý�F�q�» J�å�y�y�y/£�S�ð�M�ù�˜�d�y�y�y/£�q�þ Ø�C�y�y�y/£�>�; Ø�C�y�y�y/£�Ã�´�»�ç�Ò�¿�«�y�y�y/£�]�b�;�t�K�h�“�y�y�y/£�±� �Ä�Ú�¿�Ó�y /£�×�”�Ü�y�y�y�y/£ ý�£ Ø�C�y�y�y�y�y/£ ý a ¼�Š º�y�y�y�y/£�ª�»�î w�y�y�y/£ a ¼�U�g�y�y�y/£�§�»�é�¬ e�{�y�y�y�y/£2�Í" Copyright(c)2014

like so (image): https://www.dropbox.com/s/iq0y0w1q8qaqanr/Capture.PNG?dl=0

My page charset is UTF-8 already.
Here's the pdf file: https://www.dropbox.com/s/erj2x8c3ylfbf1b/10032016DKC_kensaku_a_0217.pdf?dl=0

@sparx82
Copy link
Contributor

sparx82 commented Apr 7, 2016

I had a quick look at this as I'm into the whole encoding stuff at the moment anyway.

It doesn't seem to have something to do with UTF8 encoding (at least nothing which my patch would fix).

I copied one japanese character out of the PDF to Notepad++ and had a look at its encoding:
image -> E68EA1

UnicodeDatabase

I then saved all the commands of the PDF file and searched for the hex string above. The commands do not contain this string, so I think something goes wrong (or at least not correct :-) while extracting the data from the PDF. But honestly, japanese UTF8 encodings aren't exactly my specialty...

commands.txt

@socrec
Copy link
Author

socrec commented Apr 8, 2016

I've got some answer from stackoverflow at http://stackoverflow.com/questions/36469985/extract-text-from-japanese-pdf-file?noredirect=1#comment60566488_36469985

But pardon my ignorance, I do not understand what doesn't support predefined CMaps is :( please help

@sparx82
Copy link
Contributor

sparx82 commented Apr 8, 2016

First Google hit:
https://blog.idrsolutions.com/2012/05/understanding-the-pdf-file-format-embedded-cmap-tables/

And it seems that this pdfparser doesn't support those CMAP tables. Smalot, please correct me if I'm wrong.

@smalot
Copy link
Owner

smalot commented Apr 8, 2016

I remember I worked on it.
So it should support cmap mapping tables.

@sparx82
Copy link
Contributor

sparx82 commented Apr 8, 2016

What I found out:
PDFMiner
seems to work. At least it shows me some Japanese characters.

They're CMAP readme says:

[...] contains Adobe CMap resources. CMaps are required
to decode text data written in CJK (Chinese, Japanese, Korean) language.
CMap resources are now available freely from Adobe web site:
http://opensource.adobe.com/wiki/display/cmap/CMap+Resources

So these Adobe CMAP resources seem to be some kind of external translation table for CJK languages.

@jjhesk
Copy link

jjhesk commented Aug 12, 2016

does it have GB or BIG5 support? how to add these supports? @sparx82

@davispuh
Copy link

Check if #257 PR does any better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants