-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing languages #140
Comments
Hi, you're right, we discussed that feature at #135 and #119 already. This is definitely a feature we want to add but i can't give you a time horizon for this. Something we have to check first is how we could get all installed languages for OcrMyPdf in a system-independent manner. @bahnwaerter some ideas for a command? |
Thanks, for the fast reply. Is there any other way I could change the installed app files and add required languages manually in the meantime? I tried to to alter the javascript files but the changes do not show in Nextcloud.
Tesseract does that nicely: |
Only changing the
Nice hint, will try this, thanks 👍 |
I you could add Slovak language in the meantime, it would be very nice :) |
@marian-code i added |
Thanks, everything works fine. I only had to bump version because nextcould kept updating the app from the store. |
I have nothing to add to @marian-code's suggestion using |
@R0Wi Hello, could you add the Ukrainian language? Thank you. |
@marian-code @melnyksergii @bahnwaerter i just implemented a first beta version where the app only lists the tesseract languages which are installed on the backend. Would love to get some feedback. Artifact can be found at https://github.com/R0Wi/workflow_ocr/suites/8119756582/artifacts/351406245. Please download and unpack (attention: packed twice...) the files into your NC Will try to write some test these days, then this could be released in the near future. |
Well, I have tried it today and it gives me Tesseract swos this output: tesseract --list-langs
List of available languages (4):
ces
eng
osd
slk |
Thank's for testing. Could you check the server logs? Maybe also decreasing the loglevel could help. Will try to check it inside a fresh NC installation these days... |
Well, there is nothing in the logs. Even with debug level setting. I see in code that log should be emited on error. Maybe the command runs OK but something else errors? There is no log for successful run of the command. I am not familiar with PHP, can I experiment a bit by adding some lines of code? As far as I know it is interpreted so changes shoul work right away? public function getInstalledLanguages() : array {
$commandStr = 'tesseract --list-langs';
$this->command->setCommand($commandStr);
$success = $this->command->execute();
$errorOutput = $this->command->getError();
$stdErr = $this->command->getStdErr();
$exitCode = $this->command->getExitCode();
if (!$success) {
throw new Exception('The command ' . $commandStr .' exited abnormally with exit-code ' . $exitCode . '. Message: ' . $errorOutput . ' ' . $stdErr);
}
if ($stdErr !== '' || $errorOutput !== '') {
$this->logger->warning('Tesseract list languages succeeded with warning(s): {stdErr}, {errorOutput}', [
'stdErr' => $stdErr,
'errorOutput' => $errorOutput
]);
}
$installedLangsStr = $this->command->getOutput();
if (!$installedLangsStr) {
throw new Exception('The command ' . $commandStr .' did not produce any output');
}
$lines = explode("\n", $installedLangsStr);
return Chain::create($lines)
->slice(1) // Skip tesseract header line
->filter(function ($line) {
return $line !== 'osd'; // Also skip "osd" (OSD is not a language)
})
->array;
} |
@marian-code you're right, you can edit the file directly in your NC instance and changes will be reflected immediately. I'd suggest to edit the mentioned code like the following: public function getInstalledLanguages() : array {
$this->logger->debug('getInstalledLanguages');
$commandStr = 'tesseract --list-langs';
$this->command->setCommand($commandStr);
$this->logger->debug('executing command: ' . $commandStr);
$success = $this->command->execute();
$errorOutput = $this->command->getError();
$stdErr = $this->command->getStdErr();
$exitCode = $this->command->getExitCode();
if (!$success) {
throw new Exception('The command ' . $commandStr .' exited abnormally with exit-code ' . $exitCode . '. Message: ' . $errorOutput . ' ' . $stdErr);
}
if ($stdErr !== '' || $errorOutput !== '') {
$this->logger->warning('Tesseract list languages succeeded with warning(s): {stdErr}, {errorOutput}', [
'stdErr' => $stdErr,
'errorOutput' => $errorOutput
]);
}
$installedLangsStr = $this->command->getOutput();
$this->logger->debug('got installed language string: ' . $installedLangsStr);
if (!$installedLangsStr) {
throw new Exception('The command ' . $commandStr .' did not produce any output');
}
$lines = explode("\n", $installedLangsStr);
return Chain::create($lines)
->slice(1) // Skip tesseract header line
->filter(function ($line) {
return $line !== 'osd'; // Also skip "osd" (OSD is not a language)
})
->array;
} When setting NC loglevel to |
Here are the two relevant lines from log. It seems to be working fine. So the problem must be elsewhere. {"user":"admin",
"app":"workflow_ocr",
"method":"GET",
"url":"/index.phpflow_ocr/ocrBackendInfo/installedLangs",
"message":"executing command: tesseract --list-langs",
"version":"24.0.5.1",
"data":{"app":"workflow_ocr"}}
{"user":"admin",
"app":"workflow_ocr",
"method":"GET",
"url":"/index.phpflow_ocr/ocrBackendInfo/installedLangs",
"message":"got installed language string: List of available languages (4):\nces\neng\nosd\nslk",
"version":"24.0.5.1",
"data":{"app":"workflow_ocr"}} |
You're right that's all looking good so far. Could you try to open the dev-tools of your browser and see if there are any errors reported? |
I think I've got it. Dev-tools console shows this error:
Which points to here: beforeMount: async function() {
const t = await async function() {
const t = (0,
n.generateUrl)("/apps/workflow_ocr/ocrBackendInfo/installedLangs");
return (await o.default.get(t)).data
}();
this.availableLanguages = e.filter((e=>t.includes(e.langCode)))
}, apparently t is not a string or array so js errors here. |
Thank's for digging into that @marian-code! So it really seems to be a frontend error then. I'll try to reproduce it (event though I'm wondering why it worked on my dev build). Could you do me a favor and also send me some screenshot of you |
Thanks for all the fast replies. Here is what I found. I am only vaguely familiar with web developement so I hope I understood right and this is what you wanted. If there is anything more thats needed I am happy to help. ResponseThe response I am seeing is: {"0":"ces","1":"eng","3":"slk"} This is in line with what I would expect, since I have installed these 3 languages. Headers
|
That's exactly the line I needed. Looks like the server serialization is a bit buggy. According to https://stackoverflow.com/a/11722121 PHP will turn any associative array into an JSON object,
I think in my test cases this did not happen because my languages array was continuous in my installation. Could you please try to patch the following line: to return array_values($this->ocrBackendInfoService->getInstalledLanguages()); ? Thank's again for your participation! |
Well, that is certailny interesting serlialization :D |
It is 😄 Since I'm also not a "real" PHP-developer, I think I have missed this detail. But nevertheless I'm glad to hear that things are working correctly now. I will integrate the changes into the code and also add additional tests. This issue will then be closed if the code was merged. Thank's again (hopefully the last time 😸 ) |
Yeah, no problem. Glad to help. |
Get installed tesseract languages from backend Signed-off-by: Robin Windey <[email protected]>
Get installed tesseract languages from backend Signed-off-by: Robin Windey <[email protected]>
Get installed tesseract languages from backend Signed-off-by: Robin Windey <[email protected]>
Signed-off-by: Robin Windey <[email protected]>
Get installed tesseract languages from backend Signed-off-by: Robin Windey <[email protected]>
Signed-off-by: Robin Windey <[email protected]>
Signed-off-by: Robin Windey <[email protected]>
* Implement #140 Get installed tesseract languages from backend Signed-off-by: Robin Windey <[email protected]> * Fix OcrBackendInfoServiceTest for #140 Signed-off-by: Robin Windey <[email protected]> * Introduce specific CommandException Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]>
* Implement #140 Get installed tesseract languages from backend Signed-off-by: Robin Windey <[email protected]> * Fix OcrBackendInfoServiceTest for #140 Signed-off-by: Robin Windey <[email protected]> * Introduce specific CommandException Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]>
* Feature/impl#144 (#145) * Register EventService class * Fire TextRecognizedEvent * Add TextRecognizedEvent class * Create sidecar and add recognized text to result * Added PdfOcrProcessor constructor argument * Added recognizedText variable to class * Added EventService * Refactored TextRecognizeEvent * Added EventService * Fixed tests * composer run cs:fix * Basic code cleanup Signed-off-by: Robin Windey <[email protected]> * Adjustments for #144 * Add additional tests * Refactor code to use more "high-level" SidecarFileAccessor Signed-off-by: Robin Windey <[email protected]> * Add docs for #144 * Add section for events to README.md * Remove TOC workflow Signed-off-by: Robin Windey <[email protected]> * Fix php7.4 syntax Signed-off-by: Robin Windey <[email protected]> * Add check if event is emitted Signed-off-by: Robin Windey <[email protected]> * Change TextRecognizedEvent interface to be more generic Linked to #144 * Adjust docs to match new interface Signed-off-by: Robin Windey <[email protected]> * Fix codecov Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]> Co-authored-by: Guido Schmitz <[email protected]> Co-authored-by: Robin Windey <[email protected]> * Implement #140 (#148) * Implement #140 Get installed tesseract languages from backend Signed-off-by: Robin Windey <[email protected]> * Fix OcrBackendInfoServiceTest for #140 Signed-off-by: Robin Windey <[email protected]> * Introduce specific CommandException Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]> Co-authored-by: g-schmitz <[email protected]> Co-authored-by: Guido Schmitz <[email protected]>
Hey,
This is closely related to #119 concernig OCR language selection. The documentation is a bit misleading stating that you have to just install corresponding language package.
But this holds true only for the hardcoded languages in the lists:
https://github.com/R0Wi/workflow_ocr/blob/bba5551a22b4a74a0ae23812449a9eea73b7fed0/lib/OcrProcessors/OcrMyPdfBasedProcessor.php#L39-L49
https://github.com/R0Wi/workflow_ocr/blob/bba5551a22b4a74a0ae23812449a9eea73b7fed0/src/components/WorkflowOcr.vue#L47-L57
Could you please consider creating these lists dynamicaly based on installed languages or at least include all tesseract available languages by default? Because this brings down this otherwise perfect tool :)
The text was updated successfully, but these errors were encountered: