Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Catalog - error message upon the OCR process in the workflow of a PDF #34

Closed
frankbootmaker opened this issue Nov 5, 2020 · 11 comments

Comments

@frankbootmaker
Copy link

frankbootmaker commented Nov 5, 2020

I created the workflow but at the processing to OCR a PDF with an image inside these block of errors are being listed in the NC 20.0.1 log:

Error workflow_ocr Exception: Missing catalog./var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 71: Smalot\PdfParser\Document->getPages()/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 47: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->getPagesTextInfo(null)/var/www/html/custom_apps/workflow_ocr/lib/Service/OcrService.php - line 42: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->ocrFile(null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 192: OCA\WorkflowOcr\Service\OcrService->ocrFile("application/pdf", null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 157: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->ocrFile(OC\Files\Node\File {})/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 96: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->processFile("/admin/file ... f")/var/www/html/lib/private/BackgroundJob/Job.php - line 52: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->run({ filePath: ... "})/var/www/html/lib/private/BackgroundJob/QueuedJob.php - line 46: OC\BackgroundJob\Job->execute(OC\BackgroundJob\JobList {}, OC\Log {})/var/www/html/cron.php - line 127: OC\BackgroundJob\QueuedJob->execute(OC\BackgroundJob\JobList {}, OC\Log {})
rror PHP Error: Trying to access array offset on value of type int at /var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php#217/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 217: OC\Log\ErrorHandler::onError(8, "Trying to a ... t", "/var/www/ht ... p", 217, { id: "7_0", ... 2})/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 109: Smalot\PdfParser\Parser->parseObject("7_0", [ "null","null",220450], Smalot\PdfParser\Document {})/var/www/html/custom_apps/workflow_ocr/lib/Wrapper/PdfParserWrapper.php - line 44: Smalot\PdfParser\Parser->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 67: OCA\WorkflowOcr\Wrapper\PdfParserWrapper->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 47: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->getPagesTextInfo(null)/var/www/html/custom_apps/workflow_ocr/lib/Service/OcrService.php - line 42: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->ocrFile(null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 192: OCA\WorkflowOcr\Service\OcrService->ocrFile("application/pdf", null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 157: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->ocrFile(OC\Files\Node\File {})/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 96: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->processFile("/admin/file ... f")/var/www/html/lib/private/BackgroundJob/Job.php - line 52: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->run({ filePath: ... "})/var/www/html/lib/private/BackgroundJob/QueuedJob.php - line 46: OC\BackgroundJob\Job->execute(OC\BackgroundJob\JobList {}, OC\Log {})/var/www/html/cron.php - line 127: OC\BackgroundJob\QueuedJob->execute(OC\BackgroundJob\JobList {}, OC\Log {})
Error PHP Error: Trying to access array offset on value of type int at /var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php#217/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 217: OC\Log\ErrorHandler::onError(8, "Trying to a ... t", "/var/www/ht ... p", 217, { id: "7_0", ... 2})/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 109: Smalot\PdfParser\Parser->parseObject("7_0", [ "null","null",220450], Smalot\PdfParser\Document {})/var/www/html/custom_apps/workflow_ocr/lib/Wrapper/PdfParserWrapper.php - line 44: Smalot\PdfParser\Parser->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 67: OCA\WorkflowOcr\Wrapper\PdfParserWrapper->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 47: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->getPagesTextInfo(null)/var/www/html/custom_apps/workflow_ocr/lib/Service/OcrService.php - line 42: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->ocrFile(null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 192: OCA\WorkflowOcr\Service\OcrService->ocrFile("application/pdf", null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 157: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->ocrFile(OC\Files\Node\File {})/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 96: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->processFile("/admin/file ... f")/var/www/html/lib/private/BackgroundJob/Job.php - line 52: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->run({ filePath: ... "})/var/www/html/lib/private/BackgroundJob/QueuedJob.php - line 46: OC\BackgroundJob\Job->execute(OC\BackgroundJob\JobList {}, OC\Log {})/var/www/html/cron.php - line 127: OC\BackgroundJob\QueuedJob->execute(OC\BackgroundJob\JobList {}, OC\Log {})
Error PHP Error: Trying to access array offset on value of type int at /var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php#152/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 152: OC\Log\ErrorHandler::onError(8, "Trying to a ... t", "/var/www/ht ... p", 152, { id: "7_0", ... 2})/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 109: Smalot\PdfParser\Parser->parseObject("7_0", [ "null","null",220450], Smalot\PdfParser\Document {})/var/www/html/custom_apps/workflow_ocr/lib/Wrapper/PdfParserWrapper.php - line 44: Smalot\PdfParser\Parser->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 67: OCA\WorkflowOcr\Wrapper\PdfParserWrapper->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 47: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->getPagesTextInfo(null)/var/www/html/custom_apps/workflow_ocr/lib/Service/OcrService.php - line 42: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->ocrFile(null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 192: OCA\WorkflowOcr\Service\OcrService->ocrFile("application/pdf", null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 157: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->ocrFile(OC\Files\Node\File {})/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 96: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->processFile("/admin/file ... f")/var/www/html/lib/private/BackgroundJob/Job.php - line 52: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->run({ filePath: ... "})/var/www/html/lib/private/BackgroundJob/QueuedJob.php - line 46: OC\BackgroundJob\Job->execute(OC\BackgroundJob\JobList {}, OC\Log {})/var/www/html/cron.php - line 127: OC\BackgroundJob\QueuedJob->execute(OC\BackgroundJob\JobList {}, OC\Log {})

The errors are reported always in blocks of the lines:

217
217
152

The PDF won't be processed, no OCR version being created.

@R0Wi
Copy link
Contributor

R0Wi commented Nov 5, 2020

Thanks for reporting this. Seems like there is an issue with the PDF Parser component. Since we're currently working on a migration to use OCRmyPDF in the backend (#32) i don't think we will investigate too much time into the old stack. Maybe you would like to try the new backend? If so please install OCRmyPDF on your system and checkout the branch https://github.com/R0Wi/workflow_ocr/tree/feature/support_ocrmypdf%2332.

Let me know if you need further assistance. @bahnwaerter FYI

@frankbootmaker
Copy link
Author

frankbootmaker commented Nov 5, 2020 via email

@R0Wi
Copy link
Contributor

R0Wi commented Nov 5, 2020

@frankbootmaker which engine do you mean? Could you share a link, please?

@frankbootmaker
Copy link
Author

frankbootmaker commented Nov 7, 2020 via email

@R0Wi
Copy link
Contributor

R0Wi commented Nov 8, 2020

Well usually files shouldn't be processed multiple times because we use the NC backgroundjob queue. If a files is processed it should be removed from that queue and therefore never be touched again unless there was an action on it fitting your workflow filter. Then of course it would be added to the queue again.

Let me investigate in that i'll keep you up to date. In the meantime could you please share your workflow configuration setting?

@frankbootmaker
Copy link
Author

frankbootmaker commented Nov 8, 2020 via email

@R0Wi
Copy link
Contributor

R0Wi commented Nov 9, 2020

Unfortunately a cannot see the image 🙈

@frankbootmaker
Copy link
Author

Hi, I am pasting it in here as well.

kép

@R0Wi
Copy link
Contributor

R0Wi commented Nov 10, 2020

I think File updated ist the issue here. Please remove this since it would retrigger the flow after a new Version is created because of the OCR process. I think there is an error in the documentation because since we're now using OCRmyPDF we're not skipping PDF files which already have an text layer in it so in this configuration you mentioned you'll have kind of an invinite loop.

Please change the when to only File created and let me know if this helps.

@R0Wi R0Wi mentioned this issue Nov 10, 2020
5 tasks
@frankbootmaker
Copy link
Author

frankbootmaker commented Nov 10, 2020 via email

@R0Wi
Copy link
Contributor

R0Wi commented Nov 19, 2020

We tackled the "infinite loop" issue in our implementation at #32 so the next release won't have this issue anymore and you should be able to also use the File updated condition. Closing this for now since all issues described here seem to be fixed.

@R0Wi R0Wi closed this as completed Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants