You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an issue with an app that uses OCRmyPDF for OCR
I am using a recent version of the third party app
I will include a file that reproduces the issuse
Third party app name and version
paperless-ngx 2.13.2
Describe the bug
Description: I'm encountering an issue with ocrmypdf where a specific PDF fails to process, displaying a "InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF." The document opens fine in other applications like pikepdf and Firefox’s PDF viewer without any issues, so it seems to be well-formed. This error prevents ocrmypdf from performing OCR on the file, which blocks downstream processing in workflows like Paperless-ngx.
Steps to Reproduce:
Run ocrmypdf on the PDF file with standard options.
Observe the output error message.
Expected Behavior: The PDF should be processed without errors if it is valid, and OCR should be performed normally.
Actual Behavior: The process fails with a InputFileError: PDF content stream is corrupt - this PDF is malformed, indicating that ocrmypdf interprets the document structure differently from other PDF readers.
The file triggering the issue is a bank statement.
OCRmyPDF version
16.5.0
Relevant log output
[2024-10-30 10:59:25,760] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-ngx9q0od5nk/3611XXXXXXXX67_02-10-2024-1_password_removed.pdf'), 'output_file': PosixPath('/tmp/paperless/paperless-foh1iomn/archive.pdf'), 'use_threads': True, 'jobs': 4, 'language': 'eng+hin', 'output_type': 'pdfa', 'progress_bar': False, 'color_conversion_strategy': 'RGB', 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-foh1iomn/sidecar.txt'), 'invalidate_digital_signatures': True}
[2024-10-30 10:59:26,157] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.. Attempting force OCR to get the text.
[2024-10-30 10:59:26,158] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-ngx9q0od5nk/3611XXXXXXXX67_02-10-2024-1_password_removed.pdf'), 'output_file': PosixPath('/tmp/paperless/paperless-foh1iomn/archive-fallback.pdf'), 'use_threads': True, 'jobs': 4, 'language': 'eng+hin', 'output_type': 'pdfa', 'progress_bar': False, 'color_conversion_strategy': 'RGB', 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-foh1iomn/sidecar-fallback.txt'), 'invalidate_digital_signatures': True}
[2024-10-30 10:59:26,412] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-foh1iomn
[2024-10-30 10:59:26,416] [ERROR] [paperless.consumer] Error occurred while consuming document 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 243, in _interpret_contents
ctm = Matrix(operands) @ ctm
^^^^^^^^^^^^^^^^
ValueError: ObjectList must have 6 elements
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 376, in parse
ocrmypdf.ocr(**args)
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline
return _run_pipeline(options, plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline
pdfinfo = get_pdfinfo(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo
return PdfInfo(
^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__
self._pages = _pdf_pageinfo_concurrent(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent
executor(
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__
self._execute(
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute
result = future.result()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync
return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__
self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo
for info in _process_content_streams(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams
contentsinfo = _interpret_contents(container, initial_shorthand)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents
raise InputFileError(
ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 427, in parse
ocrmypdf.ocr(**args)
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline
return _run_pipeline(options, plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline
pdfinfo = get_pdfinfo(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo
return PdfInfo(
^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__
self._pages = _pdf_pageinfo_concurrent(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent
executor(
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__
self._execute(
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute
result = future.result()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync
return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__
self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo
for info in _process_content_streams(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams
contentsinfo = _interpret_contents(container, initial_shorthand)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents
raise InputFileError(
ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap
raise exc_info[1]
File "/usr/src/paperless/src/documents/consumer.py", line 476, in run
document_parser.parse(self.working_copy, mime_type, self.filename)
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 439, in parse
raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
documents.parsers.ParseError: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
[2024-10-30 10:59:26,423] [ERROR] [paperless.tasks] ConsumeTaskPlugin failed: 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: Error occurred while consuming document 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 243, in _interpret_contents
ctm = Matrix(operands) @ ctm
^^^^^^^^^^^^^^^^
ValueError: ObjectList must have 6 elements
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 376, in parse
ocrmypdf.ocr(**args)
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline
return _run_pipeline(options, plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline
pdfinfo = get_pdfinfo(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo
return PdfInfo(
^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__
self._pages = _pdf_pageinfo_concurrent(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent
executor(
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__
self._execute(
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute
result = future.result()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync
return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__
self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo
for info in _process_content_streams(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams
contentsinfo = _interpret_contents(container, initial_shorthand)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents
raise InputFileError(
ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 427, in parse
ocrmypdf.ocr(**args)
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 223, in run_pipeline
return _run_pipeline(options, plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in _run_pipeline
pdfinfo = get_pdfinfo(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo
return PdfInfo(
^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 1153, in __init__
self._pages = _pdf_pageinfo_concurrent(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 809, in _pdf_pageinfo_concurrent
executor(
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__
self._execute(
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute
result = future.result()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 758, in _pdf_pageinfo_sync
return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 873, in __init__
self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 924, in _gather_pageinfo
for info in _process_content_streams(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 662, in _process_content_streams
contentsinfo = _interpret_contents(container, initial_shorthand)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/pdfinfo/info.py", line 245, in _interpret_contents
raise InputFileError(
ocrmypdf.exceptions.InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap
raise exc_info[1]
File "/usr/src/paperless/src/documents/consumer.py", line 476, in run
document_parser.parse(self.working_copy, mime_type, self.filename)
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 439, in parse
raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
documents.parsers.ParseError: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/paperless/src/documents/tasks.py", line 148, in consume_file
msg = plugin.run()
^^^^^^^^^^^^
File "/usr/src/paperless/src/documents/consumer.py", line 508, in run
self._fail(
File "/usr/src/paperless/src/documents/consumer.py", line 151, in _fail
raise ConsumerError(f"{self.filename}: {log_message or message}") from exception
documents.consumer.ConsumerError: 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: Error occurred while consuming document 3611XXXXXXXX67_02-10-2024-1_password_removed.pdf: InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF.
The text was updated successfully, but these errors were encountered:
Are you able to run ocrmypdf directly on the file and question, and can you step through with a debugger to read the value of operands at the time the exception occurs?
Simple sanity checks
Third party app name and version
paperless-ngx 2.13.2
Describe the bug
Description: I'm encountering an issue with ocrmypdf where a specific PDF fails to process, displaying a "InputFileError: PDF content stream is corrupt - this PDF is malformed. Use a PDF editor that is capable of visually inspecting the PDF." The document opens fine in other applications like pikepdf and Firefox’s PDF viewer without any issues, so it seems to be well-formed. This error prevents ocrmypdf from performing OCR on the file, which blocks downstream processing in workflows like Paperless-ngx.
Steps to Reproduce:
Expected Behavior: The PDF should be processed without errors if it is valid, and OCR should be performed normally.
Actual Behavior: The process fails with a InputFileError: PDF content stream is corrupt - this PDF is malformed, indicating that ocrmypdf interprets the document structure differently from other PDF readers.
Steps to reproduce
Files
The file triggering the issue is a bank statement.
OCRmyPDF version
16.5.0
Relevant log output
The text was updated successfully, but these errors were encountered: