-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing Catalog - error message upon the OCR process in the workflow of a PDF #34
Comments
Thanks for reporting this. Seems like there is an issue with the PDF Parser component. Since we're currently working on a migration to use OCRmyPDF in the backend (#32) i don't think we will investigate too much time into the old stack. Maybe you would like to try the new backend? If so please install OCRmyPDF on your system and checkout the branch https://github.com/R0Wi/workflow_ocr/tree/feature/support_ocrmypdf%2332. Let me know if you need further assistance. @bahnwaerter FYI |
Hi,
OK, I will give it a try 😊
Do you plan by the way possibly a replacement for the File -> PDF generator APP engine as well?
Currently for that to work the entire Libreoffice has to be installed …
Cheers,
Feri
From: Robin Windey <[email protected]>
Sent: Thursday, November 5, 2020 2:38 PM
To: R0Wi/workflow_ocr <[email protected]>
Cc: Csizmadia Ferenc <[email protected]>; Author <[email protected]>
Subject: Re: [R0Wi/workflow_ocr] Missing Catalog - error message upon the OCR process in the workflow of a PDF (#34)
Thanks for reporting this. Seems like there is an issue with the PDF Parser<https://www.pdfparser.org/> component. Since we're currently working on a migration to use OCRmyPDF in the backend (#32<#32>) i don't think we will investigate too much time into the old stack. Maybe you would like to try the new backend? If so please install OCRmyPDF<https://github.com/jbarlow83/OCRmyPDF> on your system and checkout the branch https://github.com/R0Wi/workflow_ocr/tree/feature/support_ocrmypdf%2332.
Let me know if you need further assistance. @bahnwaerter<https://github.com/bahnwaerter> FYI
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#34 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH274TH4DLMRSWXHVOKWO3LSOKTE7ANCNFSM4TLD626Q>.
|
@frankbootmaker which engine do you mean? Could you share a link, please? |
Hi R0wi,
let’s leave that topic for now…
I installed the OCRmyPDF version of the workflow_ocr app, and it does OCR the PDFs with an image inside.
However, even it did once OCR the PDF and created a new version for it, by the next run it will do that again… and the more files
I am uploading the longer it takes to process as all previously OCR-ed files will be again OCR-ed.
Isn’t there a way to detect if a file was already processed and skip that from the next runs?
From: Robin Windey <[email protected]>
Sent: Thursday, November 5, 2020 4:05 PM
To: R0Wi/workflow_ocr <[email protected]>
Cc: Csizmadia Ferenc <[email protected]>; Mention <[email protected]>
Subject: Re: [R0Wi/workflow_ocr] Missing Catalog - error message upon the OCR process in the workflow of a PDF (#34)
@frankbootmaker<https://github.com/frankbootmaker> which engine do you mean? Could you share a link, please?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#34 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH274TBCZ4O4K3MSJBMNU5LSOK5JTANCNFSM4TLD626Q>.
|
Well usually files shouldn't be processed multiple times because we use the NC backgroundjob queue. If a files is processed it should be removed from that queue and therefore never be touched again unless there was an action on it fitting your workflow filter. Then of course it would be added to the queue again. Let me investigate in that i'll keep you up to date. In the meantime could you please share your workflow configuration setting? |
Hi R0wi,
I have the below workflow currently set up:
[cid:[email protected]]
Cheers,
Feri
From: Robin Windey <[email protected]>
Sent: Sunday, November 8, 2020 9:55 AM
To: R0Wi/workflow_ocr <[email protected]>
Cc: Csizmadia Ferenc <[email protected]>; Mention <[email protected]>
Subject: Re: [R0Wi/workflow_ocr] Missing Catalog - error message upon the OCR process in the workflow of a PDF (#34)
Well usually files shouldn't be processed multiple times because we use the NC backgroundjob queue. If a files is processed it should be removed from that queue and therefore never be touched again unless there was an action on it fitting your workflow filter. Then of course it would be added to the queue again.
Let me investigate in that i'll keep you up to date. In the meantime could you please share your workflow configuration setting?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#34 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH274TF3QX7JEQYT33C7G7LSOZMHPANCNFSM4TLD626Q>.
|
Unfortunately a cannot see the image 🙈 |
I think Please change the |
Yes, this trick seems to work! 😊
From: Robin Windey <[email protected]>
Sent: Tuesday, November 10, 2020 2:05 PM
To: R0Wi/workflow_ocr <[email protected]>
Cc: Csizmadia Ferenc <[email protected]>; Mention <[email protected]>
Subject: Re: [R0Wi/workflow_ocr] Missing Catalog - error message upon the OCR process in the workflow of a PDF (#34)
I think File updated ist the issue here. Please remove this since it would retrigger the flow after a new Version is created because of the OCR process. I think there is an error in the documentation because since we're now using OCRmyPDF we're not skipping PDF files which already have an text layer in it so in this configuration you mentioned you'll have kind of an invinite loop.
Please change the when to only File created and let me know if this helps.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#34 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH274TGFIF3BPL2R4VUPMIDSPE3BTANCNFSM4TLD626Q>.
|
We tackled the "infinite loop" issue in our implementation at #32 so the next release won't have this issue anymore and you should be able to also use the |
I created the workflow but at the processing to OCR a PDF with an image inside these block of errors are being listed in the NC 20.0.1 log:
The errors are reported always in blocks of the lines:
217
217
152
The PDF won't be processed, no OCR version being created.
The text was updated successfully, but these errors were encountered: