-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not working with Japanese #100
Comments
Hi, |
I've just merge this PR. |
still,
like so (image): https://www.dropbox.com/s/iq0y0w1q8qaqanr/Capture.PNG?dl=0 My page charset is |
I've got some answer from stackoverflow at http://stackoverflow.com/questions/36469985/extract-text-from-japanese-pdf-file?noredirect=1#comment60566488_36469985 But pardon my ignorance, I do not understand what |
First Google hit: And it seems that this pdfparser doesn't support those CMAP tables. Smalot, please correct me if I'm wrong. |
I remember I worked on it. |
What I found out: They're CMAP readme says:
So these Adobe CMAP resources seem to be some kind of external translation table for CJK languages. |
does it have GB or BIG5 support? how to add these supports? @sparx82 |
Check if #257 PR does any better. |
I need to extract text from a PDF file in Japanese. But pdfparser can't seem recognize the file charset encoding. It just output an unreadable string like
Please help.
The text was updated successfully, but these errors were encountered: