Missing Catalog - error message upon the OCR process in the workflow of a PDF #34

frankbootmaker · 2020-11-05T09:40:37Z

I created the workflow but at the processing to OCR a PDF with an image inside these block of errors are being listed in the NC 20.0.1 log:

Error	workflow_ocr	Exception: Missing catalog./var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 71: Smalot\PdfParser\Document->getPages()/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 47: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->getPagesTextInfo(null)/var/www/html/custom_apps/workflow_ocr/lib/Service/OcrService.php - line 42: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->ocrFile(null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 192: OCA\WorkflowOcr\Service\OcrService->ocrFile("application/pdf", null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 157: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->ocrFile(OC\Files\Node\File {})/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 96: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->processFile("/admin/file ... f")/var/www/html/lib/private/BackgroundJob/Job.php - line 52: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->run({ filePath: ... "})/var/www/html/lib/private/BackgroundJob/QueuedJob.php - line 46: OC\BackgroundJob\Job->execute(OC\BackgroundJob\JobList {}, OC\Log {})/var/www/html/cron.php - line 127: OC\BackgroundJob\QueuedJob->execute(OC\BackgroundJob\JobList {}, OC\Log {})

rror	PHP	Error: Trying to access array offset on value of type int at /var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php#217/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 217: OC\Log\ErrorHandler::onError(8, "Trying to a ... t", "/var/www/ht ... p", 217, { id: "7_0", ... 2})/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 109: Smalot\PdfParser\Parser->parseObject("7_0", [ "null","null",220450], Smalot\PdfParser\Document {})/var/www/html/custom_apps/workflow_ocr/lib/Wrapper/PdfParserWrapper.php - line 44: Smalot\PdfParser\Parser->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 67: OCA\WorkflowOcr\Wrapper\PdfParserWrapper->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 47: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->getPagesTextInfo(null)/var/www/html/custom_apps/workflow_ocr/lib/Service/OcrService.php - line 42: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->ocrFile(null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 192: OCA\WorkflowOcr\Service\OcrService->ocrFile("application/pdf", null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 157: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->ocrFile(OC\Files\Node\File {})/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 96: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->processFile("/admin/file ... f")/var/www/html/lib/private/BackgroundJob/Job.php - line 52: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->run({ filePath: ... "})/var/www/html/lib/private/BackgroundJob/QueuedJob.php - line 46: OC\BackgroundJob\Job->execute(OC\BackgroundJob\JobList {}, OC\Log {})/var/www/html/cron.php - line 127: OC\BackgroundJob\QueuedJob->execute(OC\BackgroundJob\JobList {}, OC\Log {})

Error	PHP	Error: Trying to access array offset on value of type int at /var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php#217/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 217: OC\Log\ErrorHandler::onError(8, "Trying to a ... t", "/var/www/ht ... p", 217, { id: "7_0", ... 2})/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 109: Smalot\PdfParser\Parser->parseObject("7_0", [ "null","null",220450], Smalot\PdfParser\Document {})/var/www/html/custom_apps/workflow_ocr/lib/Wrapper/PdfParserWrapper.php - line 44: Smalot\PdfParser\Parser->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 67: OCA\WorkflowOcr\Wrapper\PdfParserWrapper->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 47: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->getPagesTextInfo(null)/var/www/html/custom_apps/workflow_ocr/lib/Service/OcrService.php - line 42: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->ocrFile(null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 192: OCA\WorkflowOcr\Service\OcrService->ocrFile("application/pdf", null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 157: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->ocrFile(OC\Files\Node\File {})/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 96: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->processFile("/admin/file ... f")/var/www/html/lib/private/BackgroundJob/Job.php - line 52: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->run({ filePath: ... "})/var/www/html/lib/private/BackgroundJob/QueuedJob.php - line 46: OC\BackgroundJob\Job->execute(OC\BackgroundJob\JobList {}, OC\Log {})/var/www/html/cron.php - line 127: OC\BackgroundJob\QueuedJob->execute(OC\BackgroundJob\JobList {}, OC\Log {})

Error	PHP	Error: Trying to access array offset on value of type int at /var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php#152/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 152: OC\Log\ErrorHandler::onError(8, "Trying to a ... t", "/var/www/ht ... p", 152, { id: "7_0", ... 2})/var/www/html/custom_apps/workflow_ocr/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php - line 109: Smalot\PdfParser\Parser->parseObject("7_0", [ "null","null",220450], Smalot\PdfParser\Document {})/var/www/html/custom_apps/workflow_ocr/lib/Wrapper/PdfParserWrapper.php - line 44: Smalot\PdfParser\Parser->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 67: OCA\WorkflowOcr\Wrapper\PdfParserWrapper->parseContent(null)/var/www/html/custom_apps/workflow_ocr/lib/OcrProcessors/PdfOcrProcessor.php - line 47: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->getPagesTextInfo(null)/var/www/html/custom_apps/workflow_ocr/lib/Service/OcrService.php - line 42: OCA\WorkflowOcr\OcrProcessors\PdfOcrProcessor->ocrFile(null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 192: OCA\WorkflowOcr\Service\OcrService->ocrFile("application/pdf", null)/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 157: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->ocrFile(OC\Files\Node\File {})/var/www/html/custom_apps/workflow_ocr/lib/BackgroundJobs/ProcessFileJob.php - line 96: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->processFile("/admin/file ... f")/var/www/html/lib/private/BackgroundJob/Job.php - line 52: OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob->run({ filePath: ... "})/var/www/html/lib/private/BackgroundJob/QueuedJob.php - line 46: OC\BackgroundJob\Job->execute(OC\BackgroundJob\JobList {}, OC\Log {})/var/www/html/cron.php - line 127: OC\BackgroundJob\QueuedJob->execute(OC\BackgroundJob\JobList {}, OC\Log {})

The errors are reported always in blocks of the lines:

217
217
152

The PDF won't be processed, no OCR version being created.

R0Wi · 2020-11-05T13:38:08Z

Thanks for reporting this. Seems like there is an issue with the PDF Parser component. Since we're currently working on a migration to use OCRmyPDF in the backend (#32) i don't think we will investigate too much time into the old stack. Maybe you would like to try the new backend? If so please install OCRmyPDF on your system and checkout the branch https://github.com/R0Wi/workflow_ocr/tree/feature/support_ocrmypdf%2332.

Let me know if you need further assistance. @bahnwaerter FYI

frankbootmaker · 2020-11-05T14:36:50Z

Hi, OK, I will give it a try 😊 Do you plan by the way possibly a replacement for the File -> PDF generator APP engine as well? Currently for that to work the entire Libreoffice has to be installed … Cheers, Feri From: Robin Windey <[email protected]> Sent: Thursday, November 5, 2020 2:38 PM To: R0Wi/workflow_ocr <[email protected]> Cc: Csizmadia Ferenc <[email protected]>; Author <[email protected]> Subject: Re: [R0Wi/workflow_ocr] Missing Catalog - error message upon the OCR process in the workflow of a PDF (#34) Thanks for reporting this. Seems like there is an issue with the PDF Parser<https://www.pdfparser.org/> component. Since we're currently working on a migration to use OCRmyPDF in the backend (#32<#32>) i don't think we will investigate too much time into the old stack. Maybe you would like to try the new backend? If so please install OCRmyPDF<https://github.com/jbarlow83/OCRmyPDF> on your system and checkout the branch https://github.com/R0Wi/workflow_ocr/tree/feature/support_ocrmypdf%2332. Let me know if you need further assistance. @bahnwaerter<https://github.com/bahnwaerter> FYI — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#34 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH274TH4DLMRSWXHVOKWO3LSOKTE7ANCNFSM4TLD626Q>.

R0Wi · 2020-11-05T15:04:40Z

@frankbootmaker which engine do you mean? Could you share a link, please?

frankbootmaker · 2020-11-07T19:42:23Z

Hi R0wi, let’s leave that topic for now… I installed the OCRmyPDF version of the workflow_ocr app, and it does OCR the PDFs with an image inside. However, even it did once OCR the PDF and created a new version for it, by the next run it will do that again… and the more files I am uploading the longer it takes to process as all previously OCR-ed files will be again OCR-ed. Isn’t there a way to detect if a file was already processed and skip that from the next runs? From: Robin Windey <[email protected]> Sent: Thursday, November 5, 2020 4:05 PM To: R0Wi/workflow_ocr <[email protected]> Cc: Csizmadia Ferenc <[email protected]>; Mention <[email protected]> Subject: Re: [R0Wi/workflow_ocr] Missing Catalog - error message upon the OCR process in the workflow of a PDF (#34) @frankbootmaker<https://github.com/frankbootmaker> which engine do you mean? Could you share a link, please? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#34 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH274TBCZ4O4K3MSJBMNU5LSOK5JTANCNFSM4TLD626Q>.

R0Wi · 2020-11-08T08:55:08Z

Well usually files shouldn't be processed multiple times because we use the NC backgroundjob queue. If a files is processed it should be removed from that queue and therefore never be touched again unless there was an action on it fitting your workflow filter. Then of course it would be added to the queue again.

Let me investigate in that i'll keep you up to date. In the meantime could you please share your workflow configuration setting?

frankbootmaker · 2020-11-08T21:47:04Z

Hi R0wi, I have the below workflow currently set up: [cid:[email protected]] Cheers, Feri From: Robin Windey <[email protected]> Sent: Sunday, November 8, 2020 9:55 AM To: R0Wi/workflow_ocr <[email protected]> Cc: Csizmadia Ferenc <[email protected]>; Mention <[email protected]> Subject: Re: [R0Wi/workflow_ocr] Missing Catalog - error message upon the OCR process in the workflow of a PDF (#34) Well usually files shouldn't be processed multiple times because we use the NC backgroundjob queue. If a files is processed it should be removed from that queue and therefore never be touched again unless there was an action on it fitting your workflow filter. Then of course it would be added to the queue again. Let me investigate in that i'll keep you up to date. In the meantime could you please share your workflow configuration setting? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#34 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH274TF3QX7JEQYT33C7G7LSOZMHPANCNFSM4TLD626Q>.

R0Wi · 2020-11-09T05:54:08Z

Unfortunately a cannot see the image 🙈

frankbootmaker · 2020-11-10T10:41:14Z

Hi, I am pasting it in here as well.

R0Wi · 2020-11-10T13:05:12Z

I think File updated ist the issue here. Please remove this since it would retrigger the flow after a new Version is created because of the OCR process. I think there is an error in the documentation because since we're now using OCRmyPDF we're not skipping PDF files which already have an text layer in it so in this configuration you mentioned you'll have kind of an invinite loop.

Please change the when to only File created and let me know if this helps.

frankbootmaker · 2020-11-10T23:17:29Z

Yes, this trick seems to work! 😊 From: Robin Windey <[email protected]> Sent: Tuesday, November 10, 2020 2:05 PM To: R0Wi/workflow_ocr <[email protected]> Cc: Csizmadia Ferenc <[email protected]>; Mention <[email protected]> Subject: Re: [R0Wi/workflow_ocr] Missing Catalog - error message upon the OCR process in the workflow of a PDF (#34) I think File updated ist the issue here. Please remove this since it would retrigger the flow after a new Version is created because of the OCR process. I think there is an error in the documentation because since we're now using OCRmyPDF we're not skipping PDF files which already have an text layer in it so in this configuration you mentioned you'll have kind of an invinite loop. Please change the when to only File created and let me know if this helps. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#34 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH274TGFIF3BPL2R4VUPMIDSPE3BTANCNFSM4TLD626Q>.

R0Wi · 2020-11-19T06:26:52Z

We tackled the "infinite loop" issue in our implementation at #32 so the next release won't have this issue anymore and you should be able to also use the File updated condition. Closing this for now since all issues described here seem to be fixed.

R0Wi mentioned this issue Nov 10, 2020

Integrate OCRmyPDF #32

Closed

5 tasks

R0Wi closed this as completed Nov 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Catalog - error message upon the OCR process in the workflow of a PDF #34

Missing Catalog - error message upon the OCR process in the workflow of a PDF #34

frankbootmaker commented Nov 5, 2020 •

edited

Loading

R0Wi commented Nov 5, 2020

frankbootmaker commented Nov 5, 2020 via email

R0Wi commented Nov 5, 2020

frankbootmaker commented Nov 7, 2020 via email

R0Wi commented Nov 8, 2020

frankbootmaker commented Nov 8, 2020 via email

R0Wi commented Nov 9, 2020

frankbootmaker commented Nov 10, 2020

R0Wi commented Nov 10, 2020

frankbootmaker commented Nov 10, 2020 via email

R0Wi commented Nov 19, 2020

Missing Catalog - error message upon the OCR process in the workflow of a PDF #34

Missing Catalog - error message upon the OCR process in the workflow of a PDF #34

Comments

frankbootmaker commented Nov 5, 2020 • edited Loading

R0Wi commented Nov 5, 2020

frankbootmaker commented Nov 5, 2020 via email

R0Wi commented Nov 5, 2020

frankbootmaker commented Nov 7, 2020 via email

R0Wi commented Nov 8, 2020

frankbootmaker commented Nov 8, 2020 via email

R0Wi commented Nov 9, 2020

frankbootmaker commented Nov 10, 2020

R0Wi commented Nov 10, 2020

frankbootmaker commented Nov 10, 2020 via email

R0Wi commented Nov 19, 2020

frankbootmaker commented Nov 5, 2020 •

edited

Loading