[3rdparty]: paperless-ngx PDF Fails to Process with InputFileError: PDF content stream is corrupt #1413

singlatushar07 · 2024-10-30T05:34:29Z

Simple sanity checks

This is an issue with an app that uses OCRmyPDF for OCR
I am using a recent version of the third party app
I will include a file that reproduces the issuse

Third party app name and version

paperless-ngx 2.13.2

Describe the bug

Description: I'm encountering an issue with ocrmypdf where a specific PDF fails to process, displaying a "InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF." The document opens fine in other applications like pikepdf and Firefox’s PDF viewer without any issues, so it seems to be well-formed. This error prevents ocrmypdf from performing OCR on the file, which blocks downstream processing in workflows like Paperless-ngx.

Steps to Reproduce:

Run ocrmypdf on the PDF file with standard options.
Observe the output error message.

Expected Behavior: The PDF should be processed without errors if it is valid, and OCR should be performed normally.

Actual Behavior: The process fails with a InputFileError: PDF content stream is corrupt - this PDF is malformed, indicating that ocrmypdf interprets the document structure differently from other PDF readers.

Steps to reproduce

1. Import attached file into Paperless-ngx
2. Trigger OCR
3. Check log file

Files

The file triggering the issue is a bank statement.

OCRmyPDF version

16.5.0

Relevant log output

[2024-10-30 10:59:25,760] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-ngx9q0od5nk/3611XXXXXXXX67_02-10-2024-1_password_removed.pdf'), 'output_file': PosixPath('/tmp/paperless/paperless-foh1iomn/archive.pdf'), 'use_threads': True, 'jobs': 4, 'language': 'eng+hin', 'output_type': 'pdfa', 'progress_bar': False, 'color_conversion_strategy': 'RGB', 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-foh1iomn/sidecar.txt'), 'invalidate_digital_signatures': True}

[2024-10-30 10:59:26,157] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.. Attempting force OCR to get the text.

[2024-10-30 10:59:26,158] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-ngx9q0od5nk/3611XXXXXXXX67_02-10-2024-1_password_removed.pdf'), 'output_file': PosixPath('/tmp/paperless/paperless-foh1iomn/archive-fallback.pdf'), 'use_threads': True, 'jobs': 4, 'language': 'eng+hin', 'output_type': 'pdfa', 'progress_bar': False, 'color_conversion_strategy': 'RGB', 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-foh1iomn/sidecar-fallback.txt'), 'invalidate_digital_signatures': True}

[2024-10-30 10:59:26,412] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-foh1iomn

[2024-10-30 10:59:26,416] [ERROR] [paperless.consumer] Error occurred while consuming document 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 243, in _interpret_contents

    ctm = Matrix(operands) @ ctm

          ^^^^^^^^^^^^^^^^

ValueError: ObjectList must have 6 elements

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 376, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = get_pdfinfo(

              ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__

    self._pages = _pdf_pageinfo_concurrent(

                  ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__

    self._execute(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute

    result = future.result()

             ^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result

    return self.__get_result()

           ^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync

    return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo

    for info in _process_content_streams(

                ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents

    raise InputFileError(

ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 427, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = get_pdfinfo(

              ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__

    self._pages = _pdf_pageinfo_concurrent(

                  ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__

    self._execute(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute

    result = future.result()

             ^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result

    return self.__get_result()

           ^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync

    return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo

    for info in _process_content_streams(

                ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents

    raise InputFileError(

ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap

    raise exc_info[1]

  File "/usr/src/paperless/src/documents/consumer.py", line 476, in run

    document_parser.parse(self.working_copy, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 439, in parse

    raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

documents.parsers.ParseError: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

[2024-10-30 10:59:26,423] [ERROR] [paperless.tasks] ConsumeTaskPlugin failed: 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: Error occurred while consuming document 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 243, in _interpret_contents

    ctm = Matrix(operands) @ ctm

          ^^^^^^^^^^^^^^^^

ValueError: ObjectList must have 6 elements

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 376, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = get_pdfinfo(

              ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__

    self._pages = _pdf_pageinfo_concurrent(

                  ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__

    self._execute(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute

    result = future.result()

             ^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result

    return self.__get_result()

           ^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync

    return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo

    for info in _process_content_streams(

                ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents

    raise InputFileError(

ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 427, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = get_pdfinfo(

              ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__

    self._pages = _pdf_pageinfo_concurrent(

                  ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__

    self._execute(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute

    result = future.result()

             ^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result

    return self.__get_result()

           ^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync

    return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo

    for info in _process_content_streams(

                ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents

    raise InputFileError(

ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap

    raise exc_info[1]

  File "/usr/src/paperless/src/documents/consumer.py", line 476, in run

    document_parser.parse(self.working_copy, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 439, in parse

    raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

documents.parsers.ParseError: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/tasks.py", line 148, in consume_file

    msg = plugin.run()

          ^^^^^^^^^^^^

  File "/usr/src/paperless/src/documents/consumer.py", line 508, in run

    self._fail(

  File "/usr/src/paperless/src/documents/consumer.py", line 151, in _fail

    raise ConsumerError(f"{self.filename}: {log_message or message}") from exception

documents.consumer.ConsumerError: 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: Error occurred while consuming document 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-11-01T22:15:58Z

Are you able to run ocrmypdf directly on the file and question, and can you step through with a debugger to read the value of operands at the time the exception occurs?

jbarlow83 · 2024-11-03T22:55:49Z

16.6 is more aggressive about forcing repair of incoming files, but no promises, especially without test files.

singlatushar07 · 2024-11-04T04:30:15Z

I will try to run it directly and check for the reason this weekend. Will let you know about my findings.

singlatushar07 added the triage Issue needs triage label Oct 30, 2024

singlatushar07 assigned jbarlow83 Oct 30, 2024

jbarlow83 added bug need test file and removed triage Issue needs triage labels Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3rdparty]: paperless-ngx PDF Fails to Process with InputFileError: PDF content stream is corrupt #1413

[3rdparty]: paperless-ngx PDF Fails to Process with InputFileError: PDF content stream is corrupt #1413

singlatushar07 commented Oct 30, 2024

jbarlow83 commented Nov 1, 2024

jbarlow83 commented Nov 3, 2024

singlatushar07 commented Nov 4, 2024

[3rdparty]: paperless-ngx PDF Fails to Process with InputFileError: PDF content stream is corrupt #1413

[3rdparty]: paperless-ngx PDF Fails to Process with InputFileError: PDF content stream is corrupt #1413

Comments

singlatushar07 commented Oct 30, 2024

Simple sanity checks

Third party app name and version

Describe the bug

Steps to reproduce

Files

OCRmyPDF version

Relevant log output

jbarlow83 commented Nov 1, 2024

jbarlow83 commented Nov 3, 2024

singlatushar07 commented Nov 4, 2024