-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does not parse text from pdf file #564
Comments
I've been having a similar problem with a PDF, but it turns out it's different from yours, @mapexpert . The problem in your case is how My similar problem was a lot easier to fix, as it was caused by if ($this->has('BaseEncoding') {
$this->encoding = EncodingLocator::getEncoding($this->getEncodingClass())->getTranslations();
// the code that loads Differences
} with if ($this->has('BaseEncoding') {
$this->encoding = EncodingLocator::getEncoding($this->getEncodingClass())->getTranslations();
}
// the code that loads Differences which is basically to take I'm not capable enough at the moment to understand how to fix the problem you're having though. I gave your PDF to I'm writing this hoping it will give someone a hint of where to look when they make an attempt to fix this. |
@mapexpert, have you solved the issue? |
no. I did not. still have the issue |
@mapexpert can you try the suggestion of @Uplink03 and get back to us, if this works: #564 (comment) |
I tried @Uplink03 suggestion and it does not work in my case |
Definitely something weird going on here with fonts. If I save the file as a reduced size PDF in Acrobat the text issue remains. If I select all the text in the PDF in Adobe Acrobat, convert the font to Arial, then save the file, PdfParser parses the text properly. Edit: The translate table used by
|
This appears to be fixed in the latest release v2.7.0. Although a lot of spacing issues remain, the text is extracted successfully. |
@GreyWyvern @mapexpert close this then? |
Yes |
Output:
b"%\t\n\x00X\x00\x00H\x00U\x00\x03\x005\x00H\x00F\x00H\x00L\x00S\x00W\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00S\x00W\x00\x03\x00\x06\t\n\x00\x14\t\n \x00\x1C\x00\x19\x00\x17\x00\x1A\x00\x19\x00\x17\x00\x13\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00S\x00W\x00\x03\x00'\x00D\x00W\x00H\t\n\x00\x14\t\n \x00\x14\x00\x12\x00\x15\x00\x16\x00\x12\x00\x15\x00\x13\x00\x15\x00\x15\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00Y\x00H\x00G\x00\x03\x00%\x00\t\n\x00$\t\n \x006\x00$\x003\x00\x03\x006\t\n\x006\t\n\n\x00R\x00O\x00G\x00\x03\x00$\x00W\x00\x03\x00%\x00U\x00D\x00Q\x00F\x00K\t\n\x00\x19\t\n
receipt1.pdf
IMHO parser does not parse font correctly and does not load translate tables.
The text was updated successfully, but these errors were encountered: