Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3rdparty]: paperless-ngx PDF Fails to Process with InputFileError: PDF content stream is corrupt #1413

Open
2 of 3 tasks
singlatushar07 opened this issue Oct 30, 2024 · 3 comments
Assignees

Comments

@singlatushar07
Copy link

Simple sanity checks

  • This is an issue with an app that uses OCRmyPDF for OCR
  • I am using a recent version of the third party app
  • I will include a file that reproduces the issuse

Third party app name and version

paperless-ngx 2.13.2

Describe the bug

Description: I'm encountering an issue with ocrmypdf where a specific PDF fails to process, displaying a "InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF." The document opens fine in other applications like pikepdf and Firefox’s PDF viewer without any issues, so it seems to be well-formed. This error prevents ocrmypdf from performing OCR on the file, which blocks downstream processing in workflows like Paperless-ngx.

Steps to Reproduce:

Run ocrmypdf on the PDF file with standard options.
Observe the output error message.

Expected Behavior: The PDF should be processed without errors if it is valid, and OCR should be performed normally.

Actual Behavior: The process fails with a InputFileError: PDF content stream is corrupt - this PDF is malformed, indicating that ocrmypdf interprets the document structure differently from other PDF readers.

Steps to reproduce

1. Import attached file into Paperless-ngx
2. Trigger OCR
3. Check log file

Files

The file triggering the issue is a bank statement.

OCRmyPDF version

16.5.0

Relevant log output

[2024-10-30 10:59:25,760] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-ngx9q0od5nk/3611XXXXXXXX67_02-10-2024-1_password_removed.pdf'), 'output_file': PosixPath('/tmp/paperless/paperless-foh1iomn/archive.pdf'), 'use_threads': True, 'jobs': 4, 'language': 'eng+hin', 'output_type': 'pdfa', 'progress_bar': False, 'color_conversion_strategy': 'RGB', 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-foh1iomn/sidecar.txt'), 'invalidate_digital_signatures': True}

[2024-10-30 10:59:26,157] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.. Attempting force OCR to get the text.

[2024-10-30 10:59:26,158] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-ngx9q0od5nk/3611XXXXXXXX67_02-10-2024-1_password_removed.pdf'), 'output_file': PosixPath('/tmp/paperless/paperless-foh1iomn/archive-fallback.pdf'), 'use_threads': True, 'jobs': 4, 'language': 'eng+hin', 'output_type': 'pdfa', 'progress_bar': False, 'color_conversion_strategy': 'RGB', 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-foh1iomn/sidecar-fallback.txt'), 'invalidate_digital_signatures': True}

[2024-10-30 10:59:26,412] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-foh1iomn

[2024-10-30 10:59:26,416] [ERROR] [paperless.consumer] Error occurred while consuming document 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 243, in _interpret_contents

    ctm = Matrix(operands) @ ctm

          ^^^^^^^^^^^^^^^^

ValueError: ObjectList must have 6 elements

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 376, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = get_pdfinfo(

              ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__

    self._pages = _pdf_pageinfo_concurrent(

                  ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__

    self._execute(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute

    result = future.result()

             ^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result

    return self.__get_result()

           ^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync

    return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo

    for info in _process_content_streams(

                ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents

    raise InputFileError(

ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 427, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = get_pdfinfo(

              ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__

    self._pages = _pdf_pageinfo_concurrent(

                  ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__

    self._execute(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute

    result = future.result()

             ^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result

    return self.__get_result()

           ^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync

    return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo

    for info in _process_content_streams(

                ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents

    raise InputFileError(

ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap

    raise exc_info[1]

  File "/usr/src/paperless/src/documents/consumer.py", line 476, in run

    document_parser.parse(self.working_copy, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 439, in parse

    raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

documents.parsers.ParseError: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

[2024-10-30 10:59:26,423] [ERROR] [paperless.tasks] ConsumeTaskPlugin failed: 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: Error occurred while consuming document 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 243, in _interpret_contents

    ctm = Matrix(operands) @ ctm

          ^^^^^^^^^^^^^^^^

ValueError: ObjectList must have 6 elements

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 376, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = get_pdfinfo(

              ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__

    self._pages = _pdf_pageinfo_concurrent(

                  ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__

    self._execute(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute

    result = future.result()

             ^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result

    return self.__get_result()

           ^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync

    return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo

    for info in _process_content_streams(

                ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents

    raise InputFileError(

ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 427, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline

    pdfinfo = get_pdfinfo(

              ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo

    return PdfInfo(

           ^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__

    self._pages = _pdf_pageinfo_concurrent(

                  ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__

    self._execute(

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute

    result = future.result()

             ^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result

    return self.__get_result()

           ^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync

    return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo

    for info in _process_content_streams(

                ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents

    raise InputFileError(

ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap

    raise exc_info[1]

  File "/usr/src/paperless/src/documents/consumer.py", line 476, in run

    document_parser.parse(self.working_copy, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 439, in parse

    raise ParseError(f"{e.__class__.__name__}: {e!s}") from e

documents.parsers.ParseError: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/tasks.py", line 148, in consume_file

    msg = plugin.run()

          ^^^^^^^^^^^^

  File "/usr/src/paperless/src/documents/consumer.py", line 508, in run

    self._fail(

  File "/usr/src/paperless/src/documents/consumer.py", line 151, in _fail

    raise ConsumerError(f"{self.filename}: {log_message or message}") from exception

documents.consumer.ConsumerError: 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: Error occurred while consuming document 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
@singlatushar07 singlatushar07 added the triage Issue needs triage label Oct 30, 2024
@jbarlow83
Copy link
Collaborator

Are you able to run ocrmypdf directly on the file and question, and can you step through with a debugger to read the value of operands at the time the exception occurs?

@jbarlow83 jbarlow83 added bug need test file and removed triage Issue needs triage labels Nov 1, 2024
@jbarlow83
Copy link
Collaborator

16.6 is more aggressive about forcing repair of incoming files, but no promises, especially without test files.

@singlatushar07
Copy link
Author

I will try to run it directly and check for the reason this weekend. Will let you know about my findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants