Issue parsing PDFs - UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined> #417

timalamenciak · 2024-07-26T18:55:57Z

Trying to pull in the PDF from this article throws the below error: https://onlinelibrary.wiley.com/doi/10.1002/eco.1705

This has been tested on other PDFs to the same end.

ontogpt -vvv extract -t trek_2.yaml -i test1.pdf
INFO:root:Logger root set to level 10
INFO:root:Input file: test1.pdf
Traceback (most recent call last):
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Scripts\\ontogpt", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\Documents\Coding\TReK-OntoGPT\ontogpt\src\ontogpt\cli.py", line 329, in extract
    text = open(inputfile, "r").read()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined>

The text was updated successfully, but these errors were encountered:

timalamenciak · 2024-08-01T17:32:51Z

Update on this - I had the error crop up again when copying-and-pasting from a PDF, so I dug into the code. This block appears to be the challenge (lines 324-329 of cli.py):


        if use_textract:
            import textract

            text = textract.process(inputfile).decode("utf-8")
        else:
            text = open(inputfile, "r").read()

On my own version, I added an ignore flag to the text open file. This will ignore improperly formatted characters, which may lose data, but I think in this package's use case, that won't be crippling.


        if use_textract:
            import textract

            text = textract.process(inputfile).decode("utf-8")
        else:
            text = open(inputfile, "r", **errors="ignore"**).read()

Textract is still not working.

caufieldjh · 2024-08-01T21:40:14Z

Might just fix this with #421.
In the meantime, I'll have a fix here shortly along the lines of what you suggest - though I don't recommend parsing entire PDFs with it unless you want to get a lot of unreadable characters.

Encoding errors will be ignored when parsing text from files.

caufieldjh · 2024-08-06T16:34:19Z

Hi @timalamenciak - give PDF parsing a try in v1.0.2 (just released) - it now uses the option --use-pdf instead of --use-textract

timalamenciak · 2024-08-06T19:00:13Z

Thrilling! That worked.

timalamenciak · 2024-08-06T19:00:27Z

Thanks @caufieldjh !

caufieldjh added the bug Something isn't working label Jul 30, 2024

caufieldjh linked a pull request Aug 1, 2024 that will close this issue

Partial fix for #417 #422

Merged

caufieldjh closed this as completed in #422 Aug 2, 2024

caufieldjh added a commit that referenced this issue Aug 2, 2024

Partial fix for #417 (#422)

371ba50

Encoding errors will be ignored when parsing text from files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue parsing PDFs - UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined> #417

Issue parsing PDFs - UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined> #417

timalamenciak commented Jul 26, 2024 •

edited

Loading

timalamenciak commented Aug 1, 2024

caufieldjh commented Aug 1, 2024

caufieldjh commented Aug 6, 2024

timalamenciak commented Aug 6, 2024

timalamenciak commented Aug 6, 2024

Issue parsing PDFs - UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined> #417

Issue parsing PDFs - UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined> #417

Comments

timalamenciak commented Jul 26, 2024 • edited Loading

timalamenciak commented Aug 1, 2024

caufieldjh commented Aug 1, 2024

caufieldjh commented Aug 6, 2024

timalamenciak commented Aug 6, 2024

timalamenciak commented Aug 6, 2024

timalamenciak commented Jul 26, 2024 •

edited

Loading