-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue parsing PDFs - UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined> #417
Comments
Update on this - I had the error crop up again when copying-and-pasting from a PDF, so I dug into the code. This block appears to be the challenge (lines 324-329 of cli.py):
On my own version, I added an ignore flag to the text open file. This will ignore improperly formatted characters, which may lose data, but I think in this package's use case, that won't be crippling.
Textract is still not working. |
Might just fix this with #421. |
Hi @timalamenciak - give PDF parsing a try in v1.0.2 (just released) - it now uses the option |
Thrilling! That worked. |
Thanks @caufieldjh ! |
Trying to pull in the PDF from this article throws the below error: https://onlinelibrary.wiley.com/doi/10.1002/eco.1705
This has been tested on other PDFs to the same end.
The text was updated successfully, but these errors were encountered: